Applied MS .Net Framework Programming 节选
下面是Jeffrey Rechter 在 <Applied MS .Net Framework Programming> 中对 String Interning 和 String Pooling的描叙
String Interning
As I said in the preceding section, comparing strings is a common operation for many
applications—it’s also a task that can hurt performance significantly. The reason for the
performance hit is that string comparisons require that each character in the string be
checked, one by one, until two characters are determined to be different. To compare a
string to see whether it contains the value “Hello”, a loop must compare two characters five
times. In addition, if you have several instances of the string “Hello” in memory, you’re
wasting memory because strings are immutable. You’ll use memory much more efficiently if
there is just one “Hello” string in memory and all references to the “Hello” string point to a
single object.
If your application frequently compares strings for equality or if you expect to have many
string objects with the same value, you can enhance performance substantially if you take
advantage of the string interning mechanism in the CLR. To understand how the string
interning mechanism works, examine the following code:
String s = "Hello";
Console.WriteLine(Object.ReferenceEquals("Hello", s));
Do you think this code displays “True” or “False”? Many people expect “False”. After all,
there are two “Hello” string objects and ReferenceEquals returns true only if the two
references passed to it point to the same object. Build and run this code, however, and you’ll
see that "True" is displayed. Let’s see why.
When the CLR initializes, it creates an internal hash table in which the keys are strings and
the values are references to string objects in the managed heap. Initially, the table is empty
(of course). When the JIT compiler compiles this method, it looks up each literal string in the
hash table. The compiler looks for the first “Hello” string and because it doesn’t find one,
constructs a new String object in the managed heap (that refers to this string) and adds
the "Hello" String and a reference to the object into the hash table. Then the JIT compiler
looks up the second "Hello" string in the hash table. It finds this string, so nothing happens.
Because there are no more literal strings in the code, the code can now execute.
When the code executes, it needs a reference to the “Hello” string. The CLR looks up “Hello”
in the hash table, finds it, and returns a reference to the previously created String object.
This reference is saved in the variable s. For the second line of code, the CLR again looks
up "Hello" in the hash table and finds it. The reference to the same String object is passed
along with s to Object’s static ReferenceEquals method, which returns true.
All literal strings embedded in the source code are always added to the internal hash table
when a method referencing the strings is JIT compiled. But what about strings that are
dynamically constructed at run time? What do you expect the following code to display?
String s1 = "Hello";
String s2 = "Hel";
String s3 = s2 + "lo";
Console.WriteLine(Object.ReferenceEquals(s1, s3));
Console.WriteLine(s1.Equals(s3));
In this code, the string referred to by s2 ("Hel"), and the literal string, "lo", are concatenated.
The result is a newly constructed string object, referred to by s3, that resides on the
managed heap.
This dynamically created string does contain “Hello”, but the string isn’t added to the internal
hash table. Therefore, ReferenceEquals returns false because the two references point
to different string objects. However, the call to Equals produces a result of true because
the strings do, in fact, represent the same set of characters. Obvi ously, ReferenceEquals
performs much better than Equals, and an application’s performance is greatly improved if
all the string comparisons simply compare references instead of characters. Plus, an
application requires fewer objects in the heap if there’s a way to collapse dynamic strings
with the same set of characters down to single objects in the heap.
Fortunately, the String type offers two static methods that allow you to do this:
public static String Intern(String str);
public static String IsInterned(String str);
The first method, Intern, takes a String and looks it up in the internal hash table. If the
string exists, a reference to the already existing String object is returned. If the application
no longer holds a reference to the original String object, the garbage collector is able to
free the memory of that string. The preceding code can now be rewritten using Intern as
follows:
String s1 = "Hello";
String s2 = "Hel";
String s3 = s2 + "lo";
s3 = String.Intern(s3);
Console.WriteLine(Object.ReferenceEquals(s1, s3));
Console.WriteLine(s1.Equals(s3));
Now ReferenceEquals returns a value of true and the comparison is much faster. In
addition, the String object that s3 originally referred to is now free to be garbage collected.
This code actually executes slower than the previous version because of the work that
String’s Intern method must perform. You should intern strings only if you intend to
compare a string multiple times in your application. Otherwise, you’ll hurt performance
instead of improve it.
Note that the garbage collector can’t free the strings the internal hash table refers to
because the hash table holds the reference to those String objects. String objects
referred to by the internal hash table can’t be freed until there are no AppDomains in the
process that refer to the string object. Also note that string interning occurs on a per-process
basis, meaning that a single string object can be accessed from multiple AppDomains,
conserving memory usage. The capability of multiple AppDomains to access a single string
also improves performance since strings never have to be marshaled across AppDomains
within a single process; just the reference is marshaled.
As I mentioned earlier, the String type also offers a static IsInterned method. Like the
Intern method, the IsInterned method takes a String and looks it up in the internal
hash table. If the string is in the hash table, IsInterned returns a reference to the interned
string object. If the string isn’t in the hash table, however, IsInterned returns null; it
doesn’t add the string to the hash table.
The C# compiler uses the IsInterned method to allow switch/case statements to work
efficiently on strings. For example, you can write the following C# code:
using System;
class App {
static void Main() {
Lookup("Jeff", "Richter");
Lookup("Fred", "Flintstone");
}
static void Lookup(String firstName, String lastName) {
switch (firstName + " " + lastName) {
case "Jeff Richter":
Console.WriteLine("Jeff");
break;
default:
Console.WriteLine("Unknown");
break;
}
}
}
I compiled this code and used ILDasm.exe to examine the IL, which follows. I’ve inserted
comments to fully explain what’s going on.
.method private hidebysig static void Lookup(string firstName,
string lastName) ci
l managed
{
// Code size 53 (0x35)
.maxstack 3
.locals (object V_0)
// Concatenate firstName, " ", and lastName into a new String.
IL_0000: ldarg.0
IL_0001: ldstr " "
IL_0006: ldarg.1
IL_0007: call string [mscorlib]System.String::Concat(str
ing,
str
ing,
str
ing)
// Duplicate the reference to the concatenated string.
IL_000c: dup
// Store a reference to the string in a temporary stack variabl
e.
IL_000d: stloc.0
// If Concat returns null, branch to IL_002a.
IL_000e: brfalse.s IL_002a
// See if the concatenated string is in the internal hash table
.
IL_0010: ldloc.0
IL_0011: call string [mscorlib]System.String::IsInterned(strin
g)
// Overwrite the temporary variable with a reference to the int
erned
// string. Note that null indicates the string wasn’t in the ha
sh table.
IL_0016: stloc.0
// Compare the reference of the interned ’switch’ string with a
// reference to the interned "Jeff Richter" string.
IL_0017: ldloc.0
IL_0018: ldstr "Jeff Richter"
// If references refer to different String objects, branch to I
L_002a.
IL_001d: bne.un.s IL_002a
// The references do match; display "Jeff" to the console and r
eturn.
IL_001f: ldstr "Jeff"
IL_0024: call void [mscorlib]System.Console::WriteLine(s
tring)
IL_0029: ret
// Display "Unknown" to the console and return.
IL_002a: ldstr "Unknown"
IL_002f: call void [mscorlib]System.Console::WriteLine(s
tring)
IL_0034: ret
} // end of method App::Lookup
The important thing to notice in this code is that the IL code calls IsInterned, passing the
string specified in the switch statement. If IsInterned returns null, the string can’t
match any of the case strings, causing the default code to execute: "Unknown" is
displayed to the user. However, if IsInterned sees that the switch string does exist in
the internal hash table, it returns a reference to the hash table’s String object. The address
of the interned string is then compared with the addresses of the interned literal strings
specified by each case statement. Comparing the addresses is much faster than comparing
all the characters in each string, and the code determines very quickly which case
statement to execute.
String Pooling
When compiling source code, your compiler must process each literal string and emit the
string into the managed module’s metadata. If the same literal string appears several times
in your source code, then emitting all these strings into the metadata will bloat the size of the
resulting file.
To remove this bloat, many compilers (include the C# compiler) write the literal string into the
module’s metadata only once. All code that references the string will be modified to refer to
the one string in the metadata. This ability of a compiler to merge multiple occurrences of a
single string into a single instance can reduce the size of a module substantially. This
process is nothing new—C/C++ compilers have been doing it for years. (Microsoft’s C/C++
compiler calls this string pooling.) Even so, string pooling is another way to improve the
performance of strings and just one more piece of knowledge you should have in your
repertoire.
String Interning
As I said in the preceding section, comparing strings is a common operation for many
applications—it’s also a task that can hurt performance significantly. The reason for the
performance hit is that string comparisons require that each character in the string be
checked, one by one, until two characters are determined to be different. To compare a
string to see whether it contains the value “Hello”, a loop must compare two characters five
times. In addition, if you have several instances of the string “Hello” in memory, you’re
wasting memory because strings are immutable. You’ll use memory much more efficiently if
there is just one “Hello” string in memory and all references to the “Hello” string point to a
single object.
If your application frequently compares strings for equality or if you expect to have many
string objects with the same value, you can enhance performance substantially if you take
advantage of the string interning mechanism in the CLR. To understand how the string
interning mechanism works, examine the following code:
String s = "Hello";
Console.WriteLine(Object.ReferenceEquals("Hello", s));
Do you think this code displays “True” or “False”? Many people expect “False”. After all,
there are two “Hello” string objects and ReferenceEquals returns true only if the two
references passed to it point to the same object. Build and run this code, however, and you’ll
see that "True" is displayed. Let’s see why.
When the CLR initializes, it creates an internal hash table in which the keys are strings and
the values are references to string objects in the managed heap. Initially, the table is empty
(of course). When the JIT compiler compiles this method, it looks up each literal string in the
hash table. The compiler looks for the first “Hello” string and because it doesn’t find one,
constructs a new String object in the managed heap (that refers to this string) and adds
the "Hello" String and a reference to the object into the hash table. Then the JIT compiler
looks up the second "Hello" string in the hash table. It finds this string, so nothing happens.
Because there are no more literal strings in the code, the code can now execute.
When the code executes, it needs a reference to the “Hello” string. The CLR looks up “Hello”
in the hash table, finds it, and returns a reference to the previously created String object.
This reference is saved in the variable s. For the second line of code, the CLR again looks
up "Hello" in the hash table and finds it. The reference to the same String object is passed
along with s to Object’s static ReferenceEquals method, which returns true.
All literal strings embedded in the source code are always added to the internal hash table
when a method referencing the strings is JIT compiled. But what about strings that are
dynamically constructed at run time? What do you expect the following code to display?
String s1 = "Hello";
String s2 = "Hel";
String s3 = s2 + "lo";
Console.WriteLine(Object.ReferenceEquals(s1, s3));
Console.WriteLine(s1.Equals(s3));
In this code, the string referred to by s2 ("Hel"), and the literal string, "lo", are concatenated.
The result is a newly constructed string object, referred to by s3, that resides on the
managed heap.
This dynamically created string does contain “Hello”, but the string isn’t added to the internal
hash table. Therefore, ReferenceEquals returns false because the two references point
to different string objects. However, the call to Equals produces a result of true because
the strings do, in fact, represent the same set of characters. Obvi ously, ReferenceEquals
performs much better than Equals, and an application’s performance is greatly improved if
all the string comparisons simply compare references instead of characters. Plus, an
application requires fewer objects in the heap if there’s a way to collapse dynamic strings
with the same set of characters down to single objects in the heap.
Fortunately, the String type offers two static methods that allow you to do this:
public static String Intern(String str);
public static String IsInterned(String str);
The first method, Intern, takes a String and looks it up in the internal hash table. If the
string exists, a reference to the already existing String object is returned. If the application
no longer holds a reference to the original String object, the garbage collector is able to
free the memory of that string. The preceding code can now be rewritten using Intern as
follows:
String s1 = "Hello";
String s2 = "Hel";
String s3 = s2 + "lo";
s3 = String.Intern(s3);
Console.WriteLine(Object.ReferenceEquals(s1, s3));
Console.WriteLine(s1.Equals(s3));
Now ReferenceEquals returns a value of true and the comparison is much faster. In
addition, the String object that s3 originally referred to is now free to be garbage collected.
This code actually executes slower than the previous version because of the work that
String’s Intern method must perform. You should intern strings only if you intend to
compare a string multiple times in your application. Otherwise, you’ll hurt performance
instead of improve it.
Note that the garbage collector can’t free the strings the internal hash table refers to
because the hash table holds the reference to those String objects. String objects
referred to by the internal hash table can’t be freed until there are no AppDomains in the
process that refer to the string object. Also note that string interning occurs on a per-process
basis, meaning that a single string object can be accessed from multiple AppDomains,
conserving memory usage. The capability of multiple AppDomains to access a single string
also improves performance since strings never have to be marshaled across AppDomains
within a single process; just the reference is marshaled.
As I mentioned earlier, the String type also offers a static IsInterned method. Like the
Intern method, the IsInterned method takes a String and looks it up in the internal
hash table. If the string is in the hash table, IsInterned returns a reference to the interned
string object. If the string isn’t in the hash table, however, IsInterned returns null; it
doesn’t add the string to the hash table.
The C# compiler uses the IsInterned method to allow switch/case statements to work
efficiently on strings. For example, you can write the following C# code:
using System;
class App {
static void Main() {
Lookup("Jeff", "Richter");
Lookup("Fred", "Flintstone");
}
static void Lookup(String firstName, String lastName) {
switch (firstName + " " + lastName) {
case "Jeff Richter":
Console.WriteLine("Jeff");
break;
default:
Console.WriteLine("Unknown");
break;
}
}
}
I compiled this code and used ILDasm.exe to examine the IL, which follows. I’ve inserted
comments to fully explain what’s going on.
.method private hidebysig static void Lookup(string firstName,
string lastName) ci
l managed
{
// Code size 53 (0x35)
.maxstack 3
.locals (object V_0)
// Concatenate firstName, " ", and lastName into a new String.
IL_0000: ldarg.0
IL_0001: ldstr " "
IL_0006: ldarg.1
IL_0007: call string [mscorlib]System.String::Concat(str
ing,
str
ing,
str
ing)
// Duplicate the reference to the concatenated string.
IL_000c: dup
// Store a reference to the string in a temporary stack variabl
e.
IL_000d: stloc.0
// If Concat returns null, branch to IL_002a.
IL_000e: brfalse.s IL_002a
// See if the concatenated string is in the internal hash table
.
IL_0010: ldloc.0
IL_0011: call string [mscorlib]System.String::IsInterned(strin
g)
// Overwrite the temporary variable with a reference to the int
erned
// string. Note that null indicates the string wasn’t in the ha
sh table.
IL_0016: stloc.0
// Compare the reference of the interned ’switch’ string with a
// reference to the interned "Jeff Richter" string.
IL_0017: ldloc.0
IL_0018: ldstr "Jeff Richter"
// If references refer to different String objects, branch to I
L_002a.
IL_001d: bne.un.s IL_002a
// The references do match; display "Jeff" to the console and r
eturn.
IL_001f: ldstr "Jeff"
IL_0024: call void [mscorlib]System.Console::WriteLine(s
tring)
IL_0029: ret
// Display "Unknown" to the console and return.
IL_002a: ldstr "Unknown"
IL_002f: call void [mscorlib]System.Console::WriteLine(s
tring)
IL_0034: ret
} // end of method App::Lookup
The important thing to notice in this code is that the IL code calls IsInterned, passing the
string specified in the switch statement. If IsInterned returns null, the string can’t
match any of the case strings, causing the default code to execute: "Unknown" is
displayed to the user. However, if IsInterned sees that the switch string does exist in
the internal hash table, it returns a reference to the hash table’s String object. The address
of the interned string is then compared with the addresses of the interned literal strings
specified by each case statement. Comparing the addresses is much faster than comparing
all the characters in each string, and the code determines very quickly which case
statement to execute.
String Pooling
When compiling source code, your compiler must process each literal string and emit the
string into the managed module’s metadata. If the same literal string appears several times
in your source code, then emitting all these strings into the metadata will bloat the size of the
resulting file.
To remove this bloat, many compilers (include the C# compiler) write the literal string into the
module’s metadata only once. All code that references the string will be modified to refer to
the one string in the metadata. This ability of a compiler to merge multiple occurrences of a
single string into a single instance can reduce the size of a module substantially. This
process is nothing new—C/C++ compilers have been doing it for years. (Microsoft’s C/C++
compiler calls this string pooling.) Even so, string pooling is another way to improve the
performance of strings and just one more piece of knowledge you should have in your
repertoire.
 
                    
                     
                    
                 
                    
                 
                
            
         
 
         浙公网安备 33010602011771号
浙公网安备 33010602011771号