忽略大小写Replace效率瓶颈IndexOf - birdshome - 博客园

忽略大小写Replace效率瓶颈IndexOf

在整理My Documents文件夹时，发现了一个StringHelper.rar包，顺手解开一看是原来做的一个关于忽略大小写替换的demo。关于那次测试的来龙去脉，可以参看文章"忽略字符串大小写替换的更高效实现"。当时虽然已经取得了非常高的效率，可是对于那个ReplaceEx的实现我似乎还是有些不太满意，因为使用了两个String.ToUpper操作。

当时"忽略...实现"一文的讨论中，我就对String.ToUpper()方法有些耿耿于怀，觉得他们可能仍然是效率瓶颈，所以在打算清理掉这个项目文件前再做一次试验。不用ToUpper的第一步就是自己实现IndexOf，因为String.IndexOf是case-sensitive的。于是三下五除二作了一个纯C#版的IndexOf：

private static int IndexOf2(string T, string P, int i)

{

int step = T.Length-P.Length;

for ( ; i <= step ; ++i )

{

for ( int j=0 ; j < P.Length ; ++j )

{

if ( T[i+j] != P[j] )

{

goto LOOP;

}

}

return i;

LOOP:;

}

return -1;

}

// 是不是觉得IndexOf2里面的goto用得很faint? 其实这个的效率比没有goto的单循环版本还偏高，真晕

在使用IndexOf2替换String.IndexOf后，却发现原来那个ReplaceEx的瓶颈不是ToUpper，而是IndexOf，因为如果把IndexOf2中的:

if ( T[i+j] != P[j] )

改成

if ( char.ToUpper(T[i+j]) != char.ToUpper(P[j]) )

然后去掉ReplaceEx中的两次String.ToUpper调用，结果却比原来的实现还要糟糕很多

。

替换了String.IndexOf的ReplaceEx2代码如下：

private static string ReplaceEx2(string original, string pattern, string replacement)

{

int count, position0, position1;

count = position0 = position1 = 0;

string upperString = original.ToUpper();

string upperPattern = pattern.ToUpper();

int inc = (original.Length/pattern.Length)*(replacement.Length-pattern.Length);

char [] chars = new char[original.Length + Math.Max(0, inc)];

while( (position1 = IndexOf2(upperString, upperPattern, position0)) != -1 )

{

for ( int i=position0 ; i < position1 ; ++i ) chars[count++] = original[i];

for ( int i=0 ; i < replacement.Length ; ++i ) chars[count++] = replacement[i];

position0 = position1+pattern.Length;

}

if ( position0 == 0 ) return original;

for ( int i=position0 ; i < original.Length ; ++i ) chars[count++] = original[i];

return new string(chars, 0, count);

}

在研究String.IndexOf为什么那么衰的过程中，发现原来.NET Framework提供了case-insensitive的IndexOf的说。这样一来就可以不用String.ToUpper了

，继续改造ReplaceEx得到ReplaceEx3：

private static string ReplaceEx3(string original, string pattern, string replacement)

{

int count, position0, position1;

count = position0 = position1 = 0;

int inc = (original.Length/pattern.Length)*(replacement.Length-pattern.Length);

char [] chars = new char[original.Length + Math.Max(0, inc)];

while( (position1 = CompareInfo.GetCompareInfo("en").IndexOf(original, pattern, position0, CompareOptions.IgnoreCase)) != -1 )

{

for ( int i=position0 ; i < position1 ; ++i ) chars[count++] = original[i];

for ( int i=0 ; i < replacement.Length ; ++i ) chars[count++] = replacement[i];

position0 = position1+pattern.Length;

}

if ( position0 == 0 ) return original;

for ( int i=position0 ; i < original.Length ; ++i ) chars[count++] = original[i];

return new string(chars, 0, count);

}

// 结果使用CompareInfo.IndexOf的ReplaceEx3还是很衰，只比Regex版Replace强一点点

    使用和"忽略...实现"一文中相同的测试数据，得到的新测试结果如下(正常颜色的结果使用的.NET Framework 1.1, v1.1.4322; 棕色的结果是使用的.NET Framework 2.0, v2.0.50727)：

    1、string segment = "中abc华aBc共和国";

regexp
vb
vbReplace
ReplaceEx
ReplaceEx2
CompareInfo

= 3.38022346415536s
= 1.86814329754201s
= 1.75438442595358s
= 0.755939651547892s
= 0.361114610935189s !!!
= 2.23862194776152s

= 2.91428994467174s
= 1.29490088824138s
= 2.19564749151079s
= 0.955377848301949s
= 0.225040333338455s
= 2.83741704602121s

    // ReplaceEx2 > ReplaceEx > vbReplace > vb > CompareInfo > regexp (Fx 1.1)
    // ReplaceEx2 > ReplaceEx > vb > vbReplace > CompareInfo > regexp (Fx 2.0)

    2、string segment = "中abc华aBc共abC和国";

regexp
vb
vbReplace
ReplaceEx
ReplaceEx2
CompareInfo

= 4.89033326861375s
= 3.52037563433341s
= 3.43475804885817s
= 1.03512792827021s
= 0.50198586691884s !!!
= 3.37182379324747s

= 4.38558168705799s
= 1.78433711547138s
= 3.90191323198898s
= 1.34666193608406s
= 0.353885479858474s
= 4.24194024659559s

    // ReplaceEx2 > ReplaceEx > CompareInfo > vbReplace > vb > regexp (Fx 1.1)
    // ReplaceEx2 > ReplaceEx > vb > vbReplace > CompareInfo > regexp (Fx 2.0)

    3、string segment = "中abc华aBc共abC和Abc国";

regexp
vb
vbReplace
ReplaceEx
ReplaceEx2
CompareInfo

= 6.40228827965565s
= 5.46268701748407s
= 5.26379442079929s
= 1.17350473314346s
= 0.608962591614297s !!!
= 4.26881237699205s

= 5.7075686993738s
= 2.27852450520946s
= 5.92586404137956s
= 1.69969339678646s
= 0.403219403583416s
= 5.56232733489871s

    // ReplaceEx2 > ReplaceEx > CompareInfo > vbReplace > vb > regexp (Fx 1.1)
    // ReplaceEx2 > ReplaceEx > vb > CompareInfo > regexp > vbReplace (Fx 2.0)

    4、string segment = "ABCabcAbCaBcAbcabCABCAbcaBC";

regexp
vb
vbReplace
ReplaceEx
ReplaceEx2
CompareInfo

= 13.0521110923316s
= 22.04804465372s
= 21.8802247212984s
= 1.64543286926132s
= 0.829396473574155s !!!
= 7.74479242473555s

= 11.3952493962221s
= 3.76835407852115s
= 16.4102761663843s
= 2.70657611512078s
= 0.497913002909588s
= 10.5928678086181s

    // ReplaceEx2 > ReplaceEx > CompareInfo > regexp > vbReplace > vb (Fx 1.1)
    // ReplaceEx2 > ReplaceEx > vb > CompareInfo > regexp > vbReplace (Fx 2.0)

    看到这里，看来ReplaceEx2应该叫做ReplaceExPro，哈哈。那么是不是觉得ReplaceEx2就是无敌的快了呢？其实我也希望是，不过在极端的情况下，就是segment中没有可以替换的pattern时，ReplaceEx2就不行了:(

    5、string segment = "AaBbCc";

regexp
vb
vbReplace
ReplaceEx
ReplaceEx2
CompareInfo

= 0.471442599548267s
= 0.332260385048938s
= 0.282562727944473s
= 0.31851338647789s
= 0.15353375917889s :(
= 0.0689749674888848s

= 0.32674711450757s
= 0.283056086737281s
= 0.294417307227595s
= 0.302126387571605s
= 0.0874781571400834s
= 0.0842556805404039s

    // CompareInfo > ReplaceEx2 > vbReplace > ReplaceEx > vb > regexp (Fx 1.1)
    // CompareInfo > ReplaceEx2 > vb > vbReplace > ReplaceEx > regexp (Fx 2.0)

    测试进行到这里，似乎是在证明一个被我们误解的哲学命题：存在就是合理的。

    从上面的测试中，我们除了看到ReplaceEx2比ReplaceEx效率高出一点点外，我们还应该看到什么呢？原来我总是认为系统调用是最高效的，因为他们大都用C++编写，比如String.IndexOf最后七弯八拐调的就是native code: private static extern unsafe int IndexOfString(void* pSortingTable, int win32LCID, string source, string value, int startIndex, int count, int options);。那么造成底层代码效率反而降低的原因是什么呢？我认为除了多次的调用包裹外，底层代码还需要处理复杂的culture问题，所以带来了"大量"效率上的损耗。

    同样如果你能确定只进行英文的大小写转换操作，纯C#写的StringToUpper也比系统提供的String.ToUpper要快很多的说。当然这里不是说系统的玩意儿很差，底层的东西在字符串处理上由于需要处理多种场景，有效率上的牺牲也是没有办法的。

    整理完这篇文章我就可以删除StringHelper.rar了，宏观上的字符串效率研究基本也就是这些内容了，至于上面IndexOf2为什么没有使用KMP等算法来提高效率，因为这些试验的原始目的就是为了考察CLR下语言本身的能力，而不是比拼算法。

posted on 2005-12-09 15:17 birdshome 阅读(6284) 评论(8) 收藏举报

刷新页面返回顶部

导航

公告