Here's a throwaway Python script to clean up HTML produced by Microsoft Excel 2010. I leave it here just so that I can find it later, or if anybody else has the same problem -- for some reason, I couldn't google an easy solution anywhere. I'm sure this doesn't cover all the corner cases and complex layouts, but it's a starting point showing most of the techniques you'll ever need: tag stripping, attribute stripping (either en-masse or selective), and handling crappy declarations ("<!if
" tags).
It's for Python 3 (although I think it'll work almost unmodified in 2.7, you'll just have to change open()
calls with codecs.open()
) and requires BeautifulSoup 4+, which really does all the magic. I don't know if it's the power of Py3k or BS getting better and better, but it's gone through a dozen files in a blink.
I still wonder why the natural home for this kind of scripts couldn't be a Gist on GitHub. Anyway, I just wrote to ask why inDir and outDir values are formatted with Windows-like forward slashes: shouldn't everything just work by using "data" and "output"? After all, path separators are what you imported join for, aren't they?
ReplyDeleteDude, your teacher's mindset is always on show... yeah, I was just being extra-careful, making it explicit that it references the current script dir.
ReplyDeleteI've never used Gists and I don't like the "blog by github" way of life -- you can't see stats on anything on github.
Also, I do know Blogger gets preferential treatment on Google, which is my real target for this sort of snippets. GitHub, not so much.
ReplyDeleteI see.
ReplyDeleteTruth be told, I would need someone with the same showoff teaching attitude for the code I am currently writing - it is constantly on the verge of becoming a hideous mess.
Anyway, I will hold back myself in the future.
Gist (they are quite nice) pro: anyone with an account could easily fork your script, improve or extend it, and you could merge back the changes. They are backed by git, as the repositories on GitHub.