HighDots Forums  

Why do they do this? And what to do about it?

HTML Writing HTML for the Web (comp.infosystems.www.authoring.html)


Discuss Why do they do this? And what to do about it? in the HTML forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Rick Cook
 
Posts: n/a

Default Why do they do this? And what to do about it? - 12-15-2007 , 02:05 AM






Why do word processors insist on inserting a lot of extraneous stuff
into the HTML documents they produce?

And is there a simple way to get rid of this junk automatically?

MS Word is probably the worst, but OpenOffice Writer does it to. It puts
CRs at the end of each line, starts each paragraph with a lot of
redundant font information, etc.

If you want clean HTML all that stuff has to be combed out. When you're
dealing with 3000-5000 word articles this is time-consuming. Editors, at
least the ones I've encountered, quite reasonably insist this is the
author's job.

Why comb it out? Because I'm not writing for my own pages. I'm writing
to editorial specification and every editor I've ever dealt with hates
this stuff.

What I'd like to find is either a) some kind of setting that turns this
stuff off (not likely) or b) some way to set up a filter to get rid of it.

Reply With Quote
  #2  
Old   
Blinky the Shark
 
Posts: n/a

Default Re: Why do they do this? And what to do about it? - 12-15-2007 , 03:39 AM






Rick Cook wrote:

Quote:
Why do word processors insist on inserting a lot of extraneous stuff
into the HTML documents they produce?
Because they're word processors, not text or real HTML editors.

Quote:
And is there a simple way to get rid of this junk automatically?
Don't write HTML with word processors and you won't have anything to get
rid of.

Everyone will suggest his or her favorite. I do 99% of my work in Linux,
but since your headers imply you're using Windows, I'll just mention that
for my occasional Windows HTML editing I use Crimson Editor. Any text
editor will work for you, but be sure it has syntax highlighting for
HTML/CSS. You might also consider an HTML/CSS editor that includes
project/file management; it's convenient to be able to upload your pages
with a click or two from within your HTML editor.


--
Blinky
Killing all posts from Google Groups
The Usenet Improvement Project - http://improve-usenet.org



Reply With Quote
  #3  
Old   
Jukka K. Korpela
 
Posts: n/a

Default Re: Why do they do this? And what to do about it? - 12-15-2007 , 06:16 AM



Scripsit Rick Cook:

Quote:
Why do word processors insist on inserting a lot of extraneous stuff
into the HTML documents they produce?
Probably because the vendors want to preserve as much formatting as
possible, even including formatting as per word processor defaults (thus
often not _intentionally_ chosen by its user). The reason behind this is
that they don't understand web publishing but see it as desktop
publishing.

Quote:
And is there a simple way to get rid of this junk automatically?
Some of it. Using "filtered output" in Word helps somewhat, but Word
still inserts a bulky stylesheet (easy to delete of course) and lots of
width and height attributes for table cells and other oddities. The Tidy
software is claimed to clean things up, but it's not reliable; it also
messes things up.

What I have done, after "filtered output", is some simple Perl-based
processing or, depending on my mood and the phase of the moon, some
Emacs processing with keyboard macros. But those tools are not for
everyone. And it's not possible to automate it all, since some of the
presentational-looking markup should _not_ be removed since it reflects
structural intentions, such as highlighting.

Quote:
MS Word is probably the worst, but OpenOffice Writer does it to.
Yeah.

Quote:
It puts CRs at the end of each line,
That's not serious; CR is just whitespace.

Quote:
starts each paragraph with a lot
of redundant font information, etc.
That's worse, but what's much worse is <td width="20o"> and things like
that, which create rigit layout. Redundant stuff is just useless, not
worse.

Quote:
If you want clean HTML all that stuff has to be combed out. When
you're dealing with 3000-5000 word articles this is time-consuming.
It might actually be faster to tell the word processor to save it as
plain text, then add adequate markup "by hand".

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/



Reply With Quote
  #4  
Old   
Gary Peek
 
Posts: n/a

Default Re: Why do they do this? And what to do about it? - 12-15-2007 , 07:52 AM



Rick Cook wrote:
Quote:
Why do word processors insist on inserting a lot of extraneous stuff
into the HTML documents they produce?

And is there a simple way to get rid of this junk automatically?
I wrote a Win32 utility called "Xtag" that you can find at
http://industrologic.com/basic/ I threw it together to convert
some documents that people gave me. I'll try to improve it if I
get feedback from those using it. It can't work miracles, but at
least it is something.

Gary Peek


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.