Java字符串类的函数

Java对象在JVM的结构共包括如下几个部分：

对象头：8字节
原始数据类型（int,float,char）

其中：boolean，byte占用1个字节；char，short占用2个字节；int，float占用4个字节；long，double占用8个字节。

引用：4个字节
填充符

Java对象除了这些还会包括：class信息，ID，在虚拟机中状态等额外信息。

JDK的Hotspot虚拟机中，普通的java对象需要额外的8个字节：

private final char value[];  //存储字符串中的字符
private final int offset;    //偏移量
private final int count;     //字符串长度
private int hash;            //hash，默认为0

空String所占空间为：char数组对象（16字节，8个字节为对象头，4个字节为数组长度，填充后共16字节）+对象头（8字节）+3个int（3×4）+1个char数组引用=40字节

substring方法的源码如下：

public String substring(int beginIndex, int endIndex) {
    if (beginIndex < 0) { 
        throw new StringIndexOutOfBoundsException(beginIndex); 
    } 
    if (endIndex > count) {
        throw new StringIndexOutOfBoundsException(endIndex);
    }
    if (beginIndex > endIndex) {
        throw new StringIndexOutOfBoundsException(endIndex - beginIndex); 
    }
    return ((beginIndex == 0) && (endIndex == count)) ? this :
        new String(offset + beginIndex, endIndex - beginIndex, value);
}

可以看出，对于substring方法而言，如果子字符串和原字符串相同，返回的是对于同一个对象的引用；如果不同，返回的是不同的offset和count，但仍然是对同一个char数组对象的引用。这样，对于大文本截取少量字符串的应用而言，会导致内存的浪费；对于从一般文本截取一定数量字串的应用而言，这样的实现能够节省空间。

如果仅是截取部分字串，或者是统计字符数量，可以采用new String构造函数，得到仅包含所截取字串的对象，具体实现时可以调用toCharArray方法：

String newString = new String(smallString.toCharArray);

对于String的split方法，方法的描述如下：

Splits this string around matches of the given regular expression.The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expressionor is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

split方法(split(regex, 0))的源码如下：

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}

其中limit的描述如下：

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array’s length will be no greater than n, and the array’s last entry will contain all input beyond the last matched delimiter. If n isnon-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will beapplied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

Pattern的split方法如下：

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<String>();
    Matcher m = matcher(input);
 
    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                    input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }
 
    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};
 
    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());
 
    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

可以看出，在split方法中，先通过正则表达式找到所有的pattern，然后取pattern前面的字串，将所有的分隔开的字串放入matchList中。最后返回的是一个新的String数组。因此，调用split方法所得到的每个String对象即使内容相同也不是对于同一个对象的引用。这样看来，在解析大规模文本的时候split方法占用内存也比较大。如果每次对于split所得到的String对象都引用的话，那么将会出现非常多内容相同但是引用不同的对象。

其余String操作的优化：

拼接静态字符串尽量用“+”号，编译器会做优化；拼接动态字符串时候，尽量采用StringBuffer或者StringBuilder。
创建String对象可以用“=”号直接赋值，也可以利用String的构造函数。前者创建字符串时 JVM 会查看内部的缓存池是否已有相同的字符串存在：如果有，则不再使用构造函数构造一个新的字符串，直接返回已有的字符串实例；若不存在，则分配新的内存给新创建的字符串；后者直接调用构造函数来创建字符串，如果所创建的字符串在字符串缓存池中不存在则调用构造函数创建全新的字符串，如果所创建的字符串在字符串缓存池中已有则再拷贝一份到 Java 堆中。
在使用String构造函数时，会出现重复大量字符串的问题。有两种解决方案：

1.使用 String 的 intern()方法返回 JVM 对字符串缓存池里相应已存在的字符串引用，从而解决内存性能问题，但这个方法并不推荐。原因在于：首先，intern() 所使用的池会是 JVM 中一个全局的池，很多情况下我们的程序并不需要如此大作用域的缓存；其次，intern() 所使用的是 JVM heap 中 PermGen 相应的区域，在 JVM 中 PermGen 是用来存放装载类和创建类实例时用到的元数据。程序运行时所使用的内存绝大部分存放在 JVM heap 的其他区域，过多得使用 intern()将导致 PermGen 过度增长而最后返回 OutOfMemoryError，因为垃圾收集器不会对被缓存的 String 做垃圾回收。所以我们建议使用第二种方式。

2.用户自己构建缓存，这种方式的优点是更加灵活。创建 HashMap，将需缓存的 String 作为 key 和 value 存放入 HashMap。假设我们准备创建的字符串为 key，将 Map cacheMap 作为缓冲池，那么返回 key 的代码如下：

private String getCacheWord(String key) { 
    String tmp = cacheMap.get(key); 
    if(tmp != null) { 
        return tmp; 
    } else { 
        cacheMap.put(key, key); 
        return key; 
    } 
}

参考文章：

[1]Java性能优化之String篇-IBM DeveloperWorks