Today I was the first time I’ve used my new found Python skills in a professional capacity. As part of a larger project, the client wanted me to check each HTML document for image tags missing their “alt” attributes. Knowing how tedious and impercise it would be to do this manually, writing a script was the only logical choice. And it would be handy to be able to run this both locally on their Windows-based workstations before the files were uploaded and on the Linux-based Apache web server. Since I had already worked with the HTMLParser library, I figured this would be a perfect opportunity for Python.
Here is the entire source. Notice that I don’t have to contend with any regular expression torture, nor do I have to implement my own function to recursively search the directory structure. Python can do both these things automatically.
import urllib import htmllib import formatter import os from os.path import join import sys filename = '' class ImgParser(htmllib.HTMLParser): global filename def __init__(self, formatter): htmllib.HTMLParser.__init__(self, formatter) self.imgs =  def start_img(self, attrs): tag = dict(attrs) if not 'alt' in tag: self.imgs.append( (filename, tag['src']) ) def get_imgs(self): return self.imgs format = formatter.NullFormatter() htmlparser = ImgParser(format) search_root = sys.argv for root, dirs, files in os.walk(search_root): for name in files: if name[-5:] == '.html': filename = join(root, name) f = open(filename, 'r') htmlparser.feed(f.read()) f.close() htmlparser.close() imgs = htmlparser.get_imgs() current_file = '' print "\n\nThe following files contain images with missing alt tags:\n" for filename, src in imgs: if current_file != filename: current_file = filename print filename print " " + src
I’m normally pretty careful to comment code that other people are likely to read, however this is so simple, I didn’t bother. The “os.walk” method automatically handles the recursive directory walking and if the filename ends in “.html”, then the file is opened and run through an instance of the “ImgParser” class. ImgParser is defined as a new class that inherits the HTMLParser class from the “htmllib” module (library). The “start_img” function overrides the handler that is called when the parser encounters the start of an “img” tag. If an img tag is encountered without an alt attribute, then the filename and img’s src attibute is added to the list. Once all of the files are processed, the script loops through the list and outputs the incomplete img tags and the files in which they are contained.
It’s neither sexy or full of features, but it only took me (a Python noob) 10 minutes to write. The path that is searched is read in as the second command line argument, so the script has to be executed as python scan_img_tags.py /var/html or python scan_img_tags.py c:/myhtmlstuff. The script did exactly what I needed it to do, so I’m not likely to polish it up. But if I were, I would probably add line number reporting, a little error trapping, and the ability to specify additional file extensions from the command line.
Hopefully someone else can find some use for this snippet.