HighDots Forums  

Notepad and UTF-8

HTML Writing HTML for the Web (comp.infosystems.www.authoring.html)


Discuss Notepad and UTF-8 in the HTML forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Ben C
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-13-2008 , 01:47 PM






On 2008-03-13, Andreas Prilop <aprilop2008 (AT) trashmail (DOT) net> wrote:
Quote:
On Thu, 13 Mar 2008, Ben C wrote:

Better to use a Content-Language header and/or set the lang attribute on
the html element to tell the browser the language so it can use that as
a hint to pick a font.

But that does not work in Internet Explorer.
I didn't know that. It doesn't surprise me though.

Quote:
It works in Mozilla & Co.
http://www.unics.uni-hannover.de/nht...-attribute.htm
How about others like Opera?
In that test everything gets the same font. I think what Opera does,
but this is just a guess, is choose a font based on the actual
characters.

Although I don't know how they tell the difference between zh-tw and
zh-cn (languages and codepoints very similar but you need different
fonts-- simplified characters for zh-cn and traditional ones for zh-tw).


Reply With Quote
  #12  
Old   
Eric Lindsay
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-13-2008 , 03:44 PM






In article
<Pine.GSO.4.63.0803131832170.14921 (AT) s5b004 (DOT) rrzn.uni-hannover.de>,
Andreas Prilop <aprilop2008 (AT) trashmail (DOT) net> wrote:

Quote:
On Thu, 13 Mar 2008, Ben C wrote:

Better to use a Content-Language header and/or set the lang attribute on
the html element to tell the browser the language so it can use that as
a hint to pick a font.

But that does not work in Internet Explorer. It works in Mozilla & Co.
http://www.unics.uni-hannover.de/nht...-attribute.htm
How about others like Opera?
How do you even tell what characters are in the source of a page?

Each browser I try seems to have its own variation of what a page
contains.

For example, in Safari, in the title element of a commercial site
http://mecu.com.au I see a question mark on a black background between
"- " and " home". I do not see that in Opera or Firefox. The page in
question includes a meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1", however the server does not seem to provide a
charset.

curl -I http://mecu.com.au
HTTP/1.1 200 OK
Date: Thu, 13 Mar 2008 20:25:59 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 26406
Content-Type: text/html; Charset=,
Set-Cookie: ASPSESSIONIDASAQCRTB=NGJFGNAAPHDIMMHKOEJHEJDP; path=/
Cache-control: private

curl http://mecu.com.au
echo mecu - intelligent banking - ?home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 3f 68 6f
0000020 6d 65 0a

Source from Safari 3.0.4
echo mecu - intelligent banking - ??home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 ef bf bd
0000020 68 6f 6d 65 0a

Source from Firefox 2.0.0.10
echo mecu - intelligent banking - home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 68 6f 6d
0000020 65 0a

Source from Opera 9.5 beta
echo mecu - intelligent banking - ?\0home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 c2 a0 68
0000020 6f 6d 65 0a

--
http://www.ericlindsay.com


Reply With Quote
  #13  
Old   
dorayme
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-13-2008 , 05:16 PM



In article
<NOwebmasterSPAM-9366B5.06440514032008 (AT) freenews (DOT) iinet.net.au>,
Eric Lindsay <NOwebmasterSPAM (AT) ericlindsay (DOT) com> wrote:

Quote:
For example, in Safari, in the title element of a commercial site
http://mecu.com.au I see a question mark on a black background between
"- " and " home".
The "title element" is what in this case? The only "home" link I
am seeing in my Safari is in the footer.

--
dorayme


Reply With Quote
  #14  
Old   
Eric Lindsay
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-13-2008 , 08:58 PM



In article
<doraymeRidThis-989660.09164914032008 (AT) news-vip (DOT) optusnet.com.au>,
dorayme <doraymeRidThis (AT) optusnet (DOT) com.au> wrote:

Quote:
In article
NOwebmasterSPAM-9366B5.0644051403200...) iinet.net.au>,
Eric Lindsay <NOwebmasterSPAM (AT) ericlindsay (DOT) com> wrote:

For example, in Safari, in the title element of a commercial site
http://mecu.com.au I see a question mark on a black background between
"- " and " home".

The "title element" is what in this case? The only "home" link I
am seeing in my Safari is in the footer.
This is the text within the required <title> element within the <head>
element of the page. This is not part of the page viewpoint, therefore
not part of the viewable page content. It normally shows up at the very
top of the browser, outside the viewpoint. At least, it did for me in
Safari, Opera and Firefox. It is the text that often appears as the
content when you bookmark a page. The text in full is as it appears in
the lines in my post starting with echo.

I can find equivalent strange question marks elsewhere in the site,
within the regular content of a page. But they are in pages that are
often changed, so I couldn't be sure they would remain available.
Whereas the title has been the same for some time.

I think they may have intended to insert a non-breaking space, but since
this entire UTF-8 thing has made me retreat to printable ASCII
characters without the high bit set, I hesitate to even guess what was
originally intended. I would like to know why so many browsers treat it
different however, when you try to inspect the source html.

--
http://www.ericlindsay.com


Reply With Quote
  #15  
Old   
dorayme
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-13-2008 , 09:29 PM



In article
<NOwebmasterSPAM-34D4F8.11582214032008 (AT) freenews (DOT) iinet.net.au>,
Eric Lindsay <NOwebmasterSPAM (AT) ericlindsay (DOT) com> wrote:

Quote:
In article
doraymeRidThis-989660.09164914032008...ptusnet.com.au>,
dorayme <doraymeRidThis (AT) optusnet (DOT) com.au> wrote:

In article
NOwebmasterSPAM-9366B5.0644051403200...) iinet.net.au>,
Eric Lindsay <NOwebmasterSPAM (AT) ericlindsay (DOT) com> wrote:

For example, in Safari, in the title element of a commercial site
http://mecu.com.au I see a question mark on a black background between
"- " and " home".

The "title element" is what in this case? The only "home" link I
am seeing in my Safari is in the footer.

This is the text within the required <title> element within the <head
element of the page. This is not part of the page viewpoint, therefore
not part of the viewable page content. It normally shows up at the very
top of the browser, outside the viewpoint. At least, it did for me in
Safari, Opera and Firefox. It is the text that often appears as the
content when you bookmark a page. The text in full is as it appears in
the lines in my post starting with echo.

I can find equivalent strange question marks elsewhere in the site,
within the regular content of a page. But they are in pages that are
often changed, so I couldn't be sure they would remain available.
Whereas the title has been the same for some time.

I think they may have intended to insert a non-breaking space, but since
this entire UTF-8 thing has made me retreat to printable ASCII
characters without the high bit set, I hesitate to even guess what was
originally intended. I would like to know why so many browsers treat it
different however, when you try to inspect the source html.
I did see the title words (they are displayed at the top of the
browser) but they have nothing untoward about them in my
browsers. No question mark, no black background. The site
concerned, of course, is filed with lots of errors otherwise...

--
dorayme


Reply With Quote
  #16  
Old   
Ben C
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-14-2008 , 04:04 AM



On 2008-03-13, Eric Lindsay <NOwebmasterSPAM (AT) ericlindsay (DOT) com> wrote:
Quote:
In article
Pine.GSO.4.63.0803131832170.14921 (A...ni-hannover.de>,
Andreas Prilop <aprilop2008 (AT) trashmail (DOT) net> wrote:

On Thu, 13 Mar 2008, Ben C wrote:

Better to use a Content-Language header and/or set the lang attribute on
the html element to tell the browser the language so it can use that as
a hint to pick a font.

But that does not work in Internet Explorer. It works in Mozilla & Co.
http://www.unics.uni-hannover.de/nht...-attribute.htm
How about others like Opera?

How do you even tell what characters are in the source of a page?
Good question.

There's also content negotiation, which means that curl, Firefox,
Safari, etc. might actually get sent different bytes (and headers)
depending on what they say they want to receive.

View Source may not necessarily be exactly what the browser received
from the parser at all. And in looking at the hexdumps of View Source
you're presumably copying and pasting out of the source viewer, and who
knows what that's doing.

Curl is the best way to look at what actually was received from the
server.

[...]
Quote:
For example, in Safari, in the title element of a commercial site
http://mecu.com.au I see a question mark on a black background between
"- " and " home". I do not see that in Opera or Firefox. The page in
question includes a meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1", however the server does not seem to provide a
charset.

curl -I http://mecu.com.au
HTTP/1.1 200 OK
Date: Thu, 13 Mar 2008 20:25:59 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Content-Length: 26406
Content-Type: text/html; Charset=,
Set-Cookie: ASPSESSIONIDASAQCRTB=NGJFGNAAPHDIMMHKOEJHEJDP; path=/
Cache-control: private

curl http://mecu.com.au
echo mecu - intelligent banking - ?home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 3f 68 6f
0000020 6d 65 0a
A literal question mark before home.

But with this "echo" command I wonder if you've pasted something out
curl's output and that operation has changed things. If you try

$ curl --trace 1 http://mecu.com.au > 2

And then look in the two files named 1 and 2, you find an A0 in that
position, which is correct iso-8859-1 for non-breaking space.

Quote:
Source from Safari 3.0.4
echo mecu - intelligent banking - ??home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 ef bf bd
0000020 68 6f 6d 65 0a
Safari seems to have put the UTF-8 for U+FFFD in there which means
"dodgy character". I don't know what's happened there.

Quote:
Source from Firefox 2.0.0.10
echo mecu - intelligent banking - home | hexdump
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 68 6f 6d
0000020 65 0a
Firefox seems to have just dropped the offending character altogether.

Quote:
Source from Opera 9.5 beta
echo mecu - intelligent banking - ?
0000000 6d 65 63 75 20 2d 20 69 6e 74 65 6c 6c 69 67 65
0000010 6e 74 20 62 61 6e 6b 69 6e 67 20 2d 20 c2 a0 68
0000020 6f 6d 65 0a
And Opera gives us the UTF-8 encoding of a non-breaking space. It looks
like Opera has got it right-- but that its View Source function always
displays the source in UTF-8 whatever it was encoded with originally.

But I think you may mainly be debugging the clipboard and these
browsers' source viewers here.


Reply With Quote
  #17  
Old   
Andreas Prilop
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-14-2008 , 10:35 AM



On Thu, 13 Mar 2008, Ben C wrote:

Quote:
http://www.unics.uni-hannover.de/nht...-attribute.htm

In that test everything gets the same font. I think what Opera does,
but this is just a guess, is choose a font based on the actual
characters.
If that is true, you should be able to see different fonts for
Latin letters and Greek letters on
http://www.unics.uni-hannover.de/nhtcapri/greek.html7
and different fonts for Latin letters and Cyrillic letters on
http://www.unics.uni-hannover.de/nht...cyrillic.html5

But I doubt. I believe Opera uses only one font for each of
these two test pages.

Quote:
Although I don't know how they tell the difference between zh-tw and
zh-cn (languages and codepoints very similar but you need different
fonts-- simplified characters for zh-cn and traditional ones for zh-tw).
But how to do this with "charset=utf-8"? The codepoints in Unicode
are the same for CN and TW and JP.

--
Solipsists of the world - unite!


Reply With Quote
  #18  
Old   
Ben C
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-14-2008 , 04:38 PM



On 2008-03-14, Andreas Prilop <aprilop2008 (AT) trashmail (DOT) net> wrote:
Quote:
On Thu, 13 Mar 2008, Ben C wrote:

http://www.unics.uni-hannover.de/nht...-attribute.htm

In that test everything gets the same font. I think what Opera does,
but this is just a guess, is choose a font based on the actual
characters.

If that is true, you should be able to see different fonts for
Latin letters and Greek letters on
http://www.unics.uni-hannover.de/nhtcapri/greek.html7
and different fonts for Latin letters and Cyrillic letters on
http://www.unics.uni-hannover.de/nht...cyrillic.html5

But I doubt. I believe Opera uses only one font for each of
these two test pages.
Probably. I don't know what it does.

Quote:
Although I don't know how they tell the difference between zh-tw and
zh-cn (languages and codepoints very similar but you need different
fonts-- simplified characters for zh-cn and traditional ones for zh-tw).

But how to do this with "charset=utf-8"? The codepoints in Unicode
are the same for CN and TW and JP.
Exactly, that was my point.


Reply With Quote
  #19  
Old   
Man-wai Chang ToDie
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-15-2008 , 04:04 AM



Quote:
Please tell me there is an easier way... I need to
a) strip leading whitespace from the content of my html files and
b) save these files as UTF-8 and have them STAY UTF-8. Thanks
Check out Notepad2 and Notepad++


Reply With Quote
  #20  
Old   
Jukka K. Korpela
 
Posts: n/a

Default Re: Notepad and UTF-8 - 03-15-2008 , 02:08 PM



Scripsit Andreas Prilop:

Quote:
There is one reason *not* to use UTF-8. Browsers usually take the
typeface depending on the page's charset. So with charset=iso-8859-5,
the page is displayed in the reader's preferred Cyrillic typeface.
But with charset=utf-8, the page is displayed in the reader's
preferred *Latin* typeface, which might be less suitable.
Such behavior is worth noting (especially since it may confuse authors),
but I don't think it's a serious consideration in selecting the
encoding.

Logically, it uses the encoding to make a guess on some fuzzy notion
that confuses languages, scripts, encodings, and fonts. Then it uses
that guess to pick up a default font from an internal table in the
browser.

In practice, this is of little importance, since
1) most authors set the font somehow, so that those defaults don't
matter
2) most users don't even know about these default font settings or don't
understand them (no wonder) or don't bother setting them (since they
have so little impact, partly due to item 1)
3) when users don't touch those settings, the browser defaults will be
used, and we have little reason to expect that one encoding produces any
better results than another.

Quote:
This reason is even more important for Chinese text.
For Chinese text, the preferred way is to declare the language in
lang="..." attributes in HTML markup - this is one of the few uses of
that attribute in practice so far.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.