Scripsit Rick Cook:
Quote:
Why do word processors insist on inserting a lot of extraneous stuff
into the HTML documents they produce? |
Probably because the vendors want to preserve as much formatting as
possible, even including formatting as per word processor defaults (thus
often not _intentionally_ chosen by its user). The reason behind this is
that they don't understand web publishing but see it as desktop
publishing.
Quote:
And is there a simple way to get rid of this junk automatically? |
Some of it. Using "filtered output" in Word helps somewhat, but Word
still inserts a bulky stylesheet (easy to delete of course) and lots of
width and height attributes for table cells and other oddities. The Tidy
software is claimed to clean things up, but it's not reliable; it also
messes things up.
What I have done, after "filtered output", is some simple Perl-based
processing or, depending on my mood and the phase of the moon, some
Emacs processing with keyboard macros. But those tools are not for
everyone. And it's not possible to automate it all, since some of the
presentational-looking markup should _not_ be removed since it reflects
structural intentions, such as highlighting.
Quote:
MS Word is probably the worst, but OpenOffice Writer does it to. |
Yeah.
Quote:
It puts CRs at the end of each line, |
That's not serious; CR is just whitespace.
Quote:
starts each paragraph with a lot
of redundant font information, etc. |
That's worse, but what's much worse is <td width="20o"> and things like
that, which create rigit layout. Redundant stuff is just useless, not
worse.
Quote:
If you want clean HTML all that stuff has to be combed out. When
you're dealing with 3000-5000 word articles this is time-consuming. |
It might actually be faster to tell the word processor to save it as
plain text, then add adequate markup "by hand".
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/