Here's a throwaway Python script to clean up HTML produced by Microsoft Excel 2010. I leave it here just so that I can find it later, or if anybody else has the same problem -- for some reason, I couldn't google an easy solution anywhere. I'm sure this doesn't cover all the corner cases and complex layouts, but it's a starting point showing most of the techniques you'll ever need: tag stripping, attribute stripping (either en-masse or selective), and handling crappy declarations ("<!if
" tags).
It's for Python 3 (although I think it'll work almost unmodified in 2.7, you'll just have to change open()
calls with codecs.open()
) and requires BeautifulSoup 4+, which really does all the magic. I don't know if it's the power of Py3k or BS getting better and better, but it's gone through a dozen files in a blink.
4 comments:
I still wonder why the natural home for this kind of scripts couldn't be a Gist on GitHub. Anyway, I just wrote to ask why inDir and outDir values are formatted with Windows-like forward slashes: shouldn't everything just work by using "data" and "output"? After all, path separators are what you imported join for, aren't they?
Dude, your teacher's mindset is always on show... yeah, I was just being extra-careful, making it explicit that it references the current script dir.
I've never used Gists and I don't like the "blog by github" way of life -- you can't see stats on anything on github.
Also, I do know Blogger gets preferential treatment on Google, which is my real target for this sort of snippets. GitHub, not so much.
I see.
Truth be told, I would need someone with the same showoff teaching attitude for the code I am currently writing - it is constantly on the verge of becoming a hideous mess.
Anyway, I will hold back myself in the future.
Gist (they are quite nice) pro: anyone with an account could easily fork your script, improve or extend it, and you could merge back the changes. They are backed by git, as the repositories on GitHub.
Post a Comment