SA recently updated the layout of their smileys, and as a result, I had to update my automatic emoticon packager so that I can continue maintaining the Digsby SA emoticons. While trying to fix the regular expressions, a fellow goon pointed me to lxml, an XML/HTML library for Python. He suggested using XPath, something I had learned about in my database class just this last semester. I had already forgotten about it, and it never crossed my mind to use it! Boy, I’m glad he suggested that because the XPath code is much simpler than the equivalent regular expression.
Here an example of what I am parsing:
1 2 | <div class="text">:arghfist:</div> <img alt="" src="http://i.somethingawful.com/forumsystem/emoticons/emot-arghfist.gif" title="SO ANGRY"/> |
What I was originally doing was reading the website source code line by line and using regex to capture :arghfist: and http://i.somethingawful.com/forumsystem/emoticons/emot-arghfist.gif. However, now I can simply use these two lines to capture every single emoticon:
1 2 | page = html.parse("http://forums.somethingawful.com/misc.php?s=&action=showsmilies") emoteDict = dict(zip(page.xpath("//li[@class='smilie']/div/text()"), page.xpath("//li[@class='smilie']/img/@src"))) |
That’s really all there is to it! I’ve re-uploaded the source code if you’d like to take a look at it. You can get it at http://www.fangsoft.net/public/SA-emotes-source.7z. Remember that you’ll need lxml in order to run this.

You should check out BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/