Subclassed: Simple Python script to clean up HTML produced by Excel

19 March 2012

Simple Python script to clean up HTML produced by Excel

Here's a throwaway Python script to clean up HTML produced by Microsoft Excel 2010. I leave it here just so that I can find it later, or if anybody else has the same problem -- for some reason, I couldn't google an easy solution anywhere. I'm sure this doesn't cover all the corner cases and complex layouts, but it's a starting point showing most of the techniques you'll ever need: tag stripping, attribute stripping (either en-masse or selective), and handling crappy declarations ("<!if" tags).

It's for Python 3 (although I think it'll work almost unmodified in 2.7, you'll just have to change open() calls with codecs.open()) and requires BeautifulSoup 4+, which really does all the magic. I don't know if it's the power of Py3k or BS getting better and better, but it's gone through a dozen files in a blink.

4 comments:

Giulio Piancastelli said...: I still wonder why the natural home for this kind of scripts couldn't be a Gist on GitHub. Anyway, I just wrote to ask why inDir and outDir values are formatted with Windows-like forward slashes: shouldn't everything just work by using "data" and "output"? After all, path separators are what you imported join for, aren't they?; 19/3/12 22:34
toyg said...: Dude, your teacher's mindset is always on show... yeah, I was just being extra-careful, making it explicit that it references the current script dir.

I've never used Gists and I don't like the "blog by github" way of life -- you can't see stats on anything on github.; 19/3/12 22:37
toyg said...: Also, I do know Blogger gets preferential treatment on Google, which is my real target for this sort of snippets. GitHub, not so much.; 19/3/12 22:45
Giulio Piancastelli said...: I see.

Truth be told, I would need someone with the same showoff teaching attitude for the code I am currently writing - it is constantly on the verge of becoming a hideous mess.

Anyway, I will hold back myself in the future.

Gist (they are quite nice) pro: anyone with an account could easily fork your script, improve or extend it, and you could merge back the changes. They are backed by git, as the repositories on GitHub.; 20/3/12 08:50

Subclassed

19 March 2012

Simple Python script to clean up HTML produced by Excel

4 comments:

Pythonaro

Github

LinkedIn

StackOverflow

Linkagerie

Follow on Feedly