一天到晚游泳的鱼

  博客园 :: 首页 :: 博问 :: 闪存 :: 新随笔 :: 联系 :: 订阅 订阅 :: 管理 ::

修改一位前辈留下的一个asp.net程序时发现一个奇怪的问题,我只要一修改,程序马上报错,就算在无关紧要的地方加了一个空格也是,就算再删掉也不行,后来比较我修改前后的文件发现在文件的最开头有两个字符是不一样的,我就纳了闷了。后来终于从网上找到问题根源
http://www.aspnetresources.com/blog/unicode_in_vsnet.aspx
后来看同事的电脑,Visual Studio .net默认的文件编码是根据你系统的配置不同(应该是地域选择不同)而不同,要选择编码必须在Advanced Save Options中选择而且必须每次都选系统不会记住你上次的选择。如果是asp.net程序你可以在web.config中如下配置:
 <globalization
            requestEncoding="utf-8"
            responseEncoding="utf-8"
            fileEncoding="utf-8"
   />
不过这只能保证所有的页面文件保存为utf-8而其余的文件还是要手动修改。

下面是我搜到的原文,怕有些人出国访问会有问题我就贴在这里(算违反版权吗?)

It all began a couple of weeks ago when I worked on a Spanish site. I didn't expect a little paragraph of straightforward markup to cause this much trouble and help me understand the <globalization> section of web.config better. What was supposed to look like this:

Correct rendering of Spanish text

Rendered as:

Wrong rendering of Spanish text

Back then I let it go, but in the back of my mind I knew it was wrong. All along I've read and heard that text in .NET was Unicode by default.

Part I

In my C++ days I had to add a #define UNICODE directive to have the compiler import the right set of libraries for wide-character (Unicode) text manipulation and use LPWSTR or LPTSTR declarations of string pointers. When allocating memory you had to really watch it if it was for Unicode. It was a major pain.

.NET makes it a lot easier to develop applications with internationalization and localization in mind. This ease of handling Unicode was one of the selling points for me when I was first introduced to .NET. According to Jeffrey Richter

Int the CLR, all characters are represented as 16-bit Unicode code values and strings are composed of 16-bit Unicode code values. This makes working with characters and strings easy at run time.

This is where we need to talk about encoding. Quoting Jeffrey Richter further:

At times, however, you want to save strings to a file and transmit them over a network. If the strings consist mostly of characters readable by English-speaking people, then saving or transmitting a set of 16-bit values isn't very efficient because half of the bytes written would contain zeros. Instead, it would be more efficient to encode the 16-bit value into a compressed array of bytes and then decode the array of bytes back into an array of 16-bit values.

Knowing this I thought it was strange that Spanish characters came out all garbled in Firefox, Opera and Internet Explorer/Win.

All text processors I used over the years prepended file contents with a "Unicode signature". In geek parlance it's known as the Byte Order Mark (BOM). In a nutshell, BOM gives a hint to a text processor whether the file is encoded in some UTF format (UTF-16, UTF-8, UTF-7, etc). Even though the main purpose of BOM is to define the ordering of bytes in a text stream and therefore it's not essential that a UTF-8 encoded stream contain BOM (the Byte Order Mark (BOM) FAQ explains why), serious text processors nevertheless store it to avoid ambiguity.

Since text in .NET is all Unicode by default, my senses were telling me something was wrong with the format of my source file itself. There were no visible screw-ups because it was just plain HTML, so I decided to look at it in HEX. To my surprise the ASPX page had no "Unicode signature"!

This led me to mighty Google newsgroups where I found an advice to do the following: go to the File menu, and select Advanced Save Options.

Note: you must be editing a source file to have this option.

Whoa! You can select different encodings. By default, UTF-8 without signature is selected:

Advanced save options

I saved the file with UTF-8 with signature instead, requested my Spanish page and everything looked correct this time around. I also noticed that as long as I had the page opened in Visual Studio.NET it would save with the Unicode signature. But if I closed and reopened it any memory of my previous selection was lost, and I was back to saving it as UTF-8 without signature without realizing it!

Now, here's what I don't understand. Why are there two options: with and without BOM? I hope it's not about saving 3 bytes because these savings are ridiculous. Since this feature made it this far into VS.NET somebody must've given much thought to it and there's gotta be a reason. I'm very curious what this reason is. You use Unicode to play safe and have an ability to display pretty much any character in the world, so why this signature or no-signature saga? I say signature for UTF-8 it is. Always. And if there's a strong reason to keep both options, I advocate the one with the signature. Without the signature VS.NET fools itself and doesn't realize there are international characters in the source.

Here's another thought. Targeting only the US market, as big as it is, is short-term thinking. Locking yourself to a local codepage is narrow-minded thinking. You never know where your code ends up or who you hire to work with it. Stick to Unicode.

<digression>In one of my CS classes at BYU there was a guy who was surprised to find out people wrote Pascal code in English all around the globe! On a different note, when I worked with Sony developers they sent us some C++ code with comments all in Japanese. I can read 3 languages but Japanese isn't among them (I'm working on it. This language fascinates me). We never codeciphered the comments. This is to back up my point you never know where your code lands later in time.</digression>

The thought that saving reverts to UTF-8 without signature didn't feel right. I started digging deeper. Incidentally, I found a peculiar attribute of the <globalization> tag in web.config. The attribute is fileEncoding. MSDN defines it as follows:

Specifies the default encoding for .aspx, .asmx, and .asax file parsing. Unicode and UTF-8 files saved with the byte order mark prefix will be automatically recognized regardless of the value of fileEncoding.

"Automatically recognized" and "regardless" felt good. Thus I modified my web.config to contain this <globalization> element:

<globalization 
   requestEncoding="utf-8" 
   responseEncoding="utf-8"
   fileEncoding="utf-8" />

By the way, when you create a brand new web project both the request and response encodings are set to UTF-8 as shown above.

The setting of fileEncoding seemed to fix my problem. I saved my Spanish page with and without the BOM, and both times it came out just right in web browsers. fileEncoding seems to tell the page parser to treat a page as Unicode no matter what, which I welcome.

Then I started thinking, "How does it do it? How is this setting enforced?" Armed with Reflector I found a class in the System.Web.Configuration namespace called GlobalizationConfig. The class is marked as internal and therefore is not documented on MSDN. Its LoadValuesFromConfigurationXml method reads the values of fileEncoding, requestEncoding, responseEncoding, culture and uiCulture from web.config and initializes properties with corresponding names.

A diagram of the GlobalizationConfig class

Tracing further the FileEncoding property I arrived at ReaderFromFile method found in System.Web.UI.Util:

internal static TextReader ReaderFromFile (
       string filename, HttpContext context,  string configPath)

{

TextReader reader1;
GlobalizationConfig config1;
Encoding encoding1 = null;

if (context != null)
{
  if (configPath == null)
  {
   config1 = ((GlobalizationConfig) context.GetConfig(
              "system.web/globalization"));
  }
  else
  {
   config1 = ((GlobalizationConfig) context.GetConfig(
               "system.web/globalization", configPath));
  }

  if (config1 != null)
  { encoding1 = config1.FileEncoding; }
}

if (encoding1 == null)
{ encoding1 = Encoding.Default; }
try
{return new StreamReader(filename, encoding1, true, 4096);}
catch (UnauthorizedAccessException)
{ ... }
 
}
 return reader1;
}

Does system.web/globalization look familiar? As you can see, an instance of StreamReader is created with a certain encoding. If you specify no file encoding in web.config a default one is used. What does it default to, though?

internal static Encoding CreateDefaultEncoding()
{
 int num1 = Win32Native.GetACP();

 if (num1 == 1252)
 { return new CodePageEncoding(num1); }

 return Encoding.GetEncoding(num1);
}

GetACP is an old Windows API function which "retrieves the current ANSI code-page identifier for the system". You can find the CreateDefaultEncoding method in System.Text.Encoding.

Let's recap what we've learned. web.config contains an important section, <globalization>, with an attribute, fileEncoding, that controls what encoding source files are read in. If you work with Unicode and (might) have international characters in your ASPX pages setting the fileEncoding attribute to "utf-8" seems to be a good idea. Otherwise your pages will be processed according to the current ANSI code-page settings.

Part II

While we're on this subject let's also talk about the GlobalizationConfig config class and response/request encodings. The subject is quite important and I decided to cover it here. Members of the <globalization> section of web.config affect the encoding of HTTP responses.

The HttpResponse class has a property, ContentEncoding, which participates in construction of correct HTTP headers.

A response should contain the Content-Type header which is of utmost importance. A typical response of an ASP.NET page is shown below:

HTTP/1.x 200 OK
Date: Sun, 18 Jul 2004 05:06:38 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 1.1.4322
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 4262

If you request an RSS feed you see something along these lines:

HTTP/1.x 200 OK
...
Content-Type: text/xml
...

The Content-Type header tells the browser what it is it's receiving. In ASP.NET this header is built by the GenerateResponseHeaders method of System.Web.HttpResponse:

private ArrayList GenerateResponseHeaders(bool forCache)
{
 ...
 text2 = this._contentType;

 if ((this._contentType.IndexOf("charset=") < 0) && 
     (this._customCharSet || ((this._httpWriter != null) && 
     this._httpWriter.ResponseEncodingUsed)))
 {
  text3 = this.Charset;

  if (text3.Length > 0)
  { text2 = this._contentType + "; charset=" + text3; }

 }
 ...
}

Two important points here: what are this._contentType and this.Charset that are used to build the Content-Type header? The class has a public property, ContentType, which, I'm sure, most of you have set more than once via HttpContext.Response.ContentType="...". It is pre-initialized to "text/html" in the class constructor:

public HttpResponse(TextWriter writer)
{
 this._statusCode = 200;
 this._bufferOutput = true;
 this._contentType = "text/html";
  ...
}

The other significat half, CharSet, is a public property of the same class which gets its value from... content encoding!

public string get_Charset()
{
 if (this._charSet == null)
 { this._charSet = this.ContentEncoding.WebName; }

 return this._charSet;
}

//---------------------------------------------------
public Encoding get_ContentEncoding()
{
 GlobalizationConfig config1;

 if (this._encoding == null)
 {
   config1 = ((GlobalizationConfig) this._context.GetLKGConfig(
               "system.web/globalization"));

  if (config1 != null)
  { this._encoding = config1.ResponseEncoding; }

  if (this._encoding == null)
  { this._encoding = Encoding.Default; }
 }
 return this._encoding;
}

See the GlobalizationConfig class we've talked about? It's a small world, after all. As you can see, response encoding is read from the <globalization> section. If you omit declaring responseEncoding you're pretty much taking chances because a default one will be used for you.

Some people—including myself—use the http-equiv="content-type" meta tag. In the course of this research I learned that this header has no bearing on anything because ASP.NET will always set a response encoding—yours or a default one. The said http-equiv meta tag is more of a hint to the browser, but like I said, ASP.NET takes over anyway, so you can omit it. Also, it causes problems in old versions of Netscape.

Conclusion

If you are still awake and reading this congratulations! You made it! Encoding is no easy subject. I hope this post shed some light on this complicated subject. I do not claim to be an authority on Unicode, and what I covered here was my research in the face of a strange bug. Pay attention to the <globalization> section because it is a very important one even though its purpose is documented rather poorly.

posted on 2005-09-25 13:04  badog  阅读(4993)  评论(3编辑  收藏  举报