19 March 2012

Simple Python script to clean up HTML produced by Excel

Here's a throwaway Python script to clean up HTML produced by Microsoft Excel 2010. I leave it here just so that I can find it later, or if anybody else has the same problem -- for some reason, I couldn't google an easy solution anywhere. I'm sure this doesn't cover all the corner cases and complex layouts, but it's a starting point showing most of the techniques you'll ever need: tag stripping, attribute stripping (either en-masse or selective), and handling crappy declarations ("<!if" tags).

It's for Python 3 (although I think it'll work almost unmodified in 2.7, you'll just have to change open() calls with codecs.open()) and requires BeautifulSoup 4+, which really does all the magic. I don't know if it's the power of Py3k or BS getting better and better, but it's gone through a dozen files in a blink.


  1. Giulio Piancastelli19/3/12 22:34

    I still wonder why the natural home for this kind of scripts couldn't be a Gist on GitHub. Anyway, I just wrote to ask why inDir and outDir values are formatted with Windows-like forward slashes: shouldn't everything just work by using "data" and "output"? After all, path separators are what you imported join for, aren't they?

  2. Dude, your teacher's mindset is always on show... yeah, I was just being extra-careful, making it explicit that it references the current script dir.

    I've never used Gists and I don't like the "blog by github" way of life -- you can't see stats on anything on github.

  3. Also, I do know Blogger gets preferential treatment on Google, which is my real target for this sort of snippets. GitHub, not so much.

  4. Giulio Piancastelli20/3/12 08:50

    I see.

    Truth be told, I would need someone with the same showoff teaching attitude for the code I am currently writing - it is constantly on the verge of becoming a hideous mess.

    Anyway, I will hold back myself in the future.

    Gist (they are quite nice) pro: anyone with an account could easily fork your script, improve or extend it, and you could merge back the changes. They are backed by git, as the repositories on GitHub.
