C#中的字符编码问题

假设有如下格式的一个工资文件(salary.txt)：

李富贵             0.01 
容闳           2,057.38 
欧阳吹雪 987,654,321.09

该文件的编码为GB18030，每行的宽度为23个字符，其中第1-8列为员工姓名，第10-23列为工资额。现在我们要写一个C#程序求出该单位员工的平均工资，如下所示：

using System;
2

using System.IO;
3

using System.Text;
4

namespace Skyiv.Ben.Test
6

{
7

sealed class Avg
8

{
9

static void Main()
10

{
11

try
12

{
13

Encoding encode = Encoding.GetEncoding("GB18030");
14

using (StreamReader sr = new StreamReader("salary.txt", encode))
15

{
16

decimal avg = 0;
17

long rows = 0;
18

for (; ; rows++)
19

{
20

string s = sr.ReadLine();
21

if (s == null) break;
22

decimal salary = Convert.ToDecimal(s.Substring(9, 14));
23

avg += salary;
24

}
25

avg /= rows;
26

Console.WriteLine(avg.ToString("N2"));
27

}
28

}
29

catch (Exception ex)
30

{
31

Console.WriteLine("错误: " + ex.Message);
32

}
33

}
34

}
35

}
36

运行结果如下所示：
错误: 索引和长度必须引用该字符串内的位置
参数名: length
稍一分析(或者使用debug工具)，就知道是该程序的第22行出错：
decimal salary = Convert.ToDecimal(s.Substring(9, 14));
实际上，C#中的string的编码是Unicode，每个全角的汉字也只能算一个字符，所以salary.txt中的第一行只有20个字符，第二行是21个字符，第三行是19个字符，均没有达到23个字符，所以s.Substring(9, 14)会抛出异常。实际上，只要把这一行改为以下语句就行了：
decimal salary = Convert.ToDecimal(encode.GetString(encode.GetBytes(s), 9, 14));
重新编译后再运行就可以得到正确的结果了: 329,218,792.83。
其实，更好的办法是把该程序的13-27行替换为以下语句：

const int bytesPerRow = 23 + 2;

Encoding encode = Encoding.GetEncoding("GB18030");

using (BinaryReader br = new BinaryReader(new FileStream("salary.txt", FileMode.Open)))

{

if (br.BaseStream.Length % bytesPerRow != 0) throw new Exception("文件长度错");

decimal avg = 0;

long rows = br.BaseStream.Length / bytesPerRow;

for (long i = 0; i < rows; i++)

{

byte [] bs = br.ReadBytes(bytesPerRow);

decimal salary = Convert.ToDecimal(encode.GetString(bs, 9, 14));

avg += salary;

}

avg /= rows;

Console.WriteLine(avg.ToString("N2"));

}

现在，假设我们的任务是生成salary.txt，以下程序能工作吗？

using System;
2

using System.IO;
3

using System.Text;
4

namespace Skyiv.Ben.Test
6

{
7

sealed class Salary
8

{
9

static void Main()
10

{
11

try
12

{
13

Encoding encode = Encoding.GetEncoding("GB18030");
14

string [] names = {"李富贵", "容闳", "欧阳吹雪"};
15

decimal [] salarys = {0.01m, 2057.38m, 987654321.09m};
16

using (StreamWriter sw = new StreamWriter("salary.txt", false, encode))
17

{
18

for (int i = 0; i < names.Length; i++)
19

sw.WriteLine("{0,-8} {1,14:N2}", names[i], salarys[i]);
20

}
21

}
22

catch (Exception ex)
23

{
24

Console.WriteLine("错误: " + ex.Message);
25

}
26

}
27

}
28

}
29

运行结果表明生成的文件中各行的宽度长短不一。怎么办呢？只要把程序中第19行改为：
sw.WriteLine("{0} {1,14:N2}", encode.GetString(encode.GetBytes(names[i].PadRight(8)), 0, 8), salarys[i]);
就行了。

假如salary.txt文件的编码是UTF-16，是否把程序中的
Encoding encode = Encoding.GetEncoding("GB18030");
改为：
Encoding encode = Encoding.Unicode;
就可以了呢？这个问题就留给读者们去思考了。

设想一下，如果在不远的将来能够实现在所有的操作系统中，字符编码都采用UTF-16，并且一个全角字符和一个半角在屏幕显示和在打印机上打印出来时所占的宽度都一样的(等宽字体的情况下，如果不是等宽字体，半角的A和i所占的宽度也不一样)。这时，也就不需要全角和半角概念了(反正大家都一样，全角也就是半角)，也就不存在本文中所讨论的问题了，就象现在英语国家的程序员不会有这个问题一样(对他们来说根本就不存在全角字符的概念)。

posted on 2005-08-31 15:08 银河阅读(7894) 评论(5) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

银河

公告