Apache poi读取大文件Excel时 内存溢出问题记录
问题描述
项目上有个需求,需要把前端传来的EXCEL文件做解析,通过ETL解析后再做后续处理。但是上传的excel文件有些是超过50w行的文件。线上报内存溢出。需要拆分excel再做后续处理
解决过程
开始大概是这么直接读取文件
/**
* 数据小没问题,一旦数据大就会爆内存
*/
public static void test(){
FileInputStream fi = new FileInputStream(filePath);
Workbook a = new XSSFWorkbook(fi);
fi.close();
...
}
尝试百度后,都是导出时内存溢出的问题。无果,去stackoverflow找到类似问题,找到两个方法 1.通过excel-streaming-reader工具读取 2.Apache poi 推荐用 SAX事件驱动方式解析大数据量Excel文件
Excel-streaming-reader
源问题地址 How to load a large xlsx file with Apache POI?
插件项目github Excel-streaming-reader
依赖导入
<dependencies>
<dependency>
<groupId>com.monitorjbl</groupId>
<artifactId>xlsx-streamer</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
InputStream is = new FileInputStream(new File("demo/xxx.xlsx"));
StreamingReader reader = StreamingReader.builder()
.rowCacheSize(1000) // 读取时内存中的行数 (defaults to 10)
.bufferSize(4096) // 读文件流时缓存字节 (defaults to 1024)
.sheetIndex(0) // 工作簿 (defaults to 0)
.read(is); // 文件流或者文件(xlsx) (required)
for (Row r : reader) {
for (Cell c : r) {
System.out.println(c.getStringCellValue());
}
}
SAX事件驱动方式解析大数据量Excel文件
源问题地址 Low memory writing/reading with Apache POI
同时也参考了 link
/**
* @description: 重写startElement() endElement() characters() 把Excel数据封装到LinkedHashMap
* @ClassName SheetHandler
* @CreatedTime 2022/7/13 13:14
*/
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.util.LinkedHashMap;
public class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private String cellPosition;
private LinkedHashMap<String, String> rowContents = new LinkedHashMap<String, String>();
public LinkedHashMap<String, String> getRowContents() {
return rowContents;
}
public void setRowContents(LinkedHashMap<String, String> rowContents) {
this.rowContents = rowContents;
}
public SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
if (name.equals("c")) {
cellPosition = attributes.getValue("r");
String cellType = attributes.getValue("t");
if (cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
if (nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
if (name.equals("v")) {
// 数据读取结束后,将单元格坐标,内容存入map中
if (!(cellPosition.length() == 2) || (cellPosition.length() == 2 && !"1".equals(cellPosition.substring(1)))) { //不保存第一行数据
rowContents.put(cellPosition, lastContents);
}
}
}
public void characters(char[] ch, int start, int length) throws SAXException {
lastContents += new String(ch, start, length);
}
}
/**
* 测试方法
* SAX是将xlsx格式的excel文件先解析成xml
*/
public static void main(String[] args) throws IOException{
String filePath = "/test.xlsx";
InputStream sheet = null;
OPCPackage pkg = null;
SheetHandler sheetHandler = null;
try {
pkg = OPCPackage.open(filePath);
XSSFReader reader = new XSSFReader(pkg);
SharedStringsTable table = reader.getSharedStringsTable();
sheetHandler = new SheetHandler(table);
XMLReader parser = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
parser.setContentHandler(sheetHandler);
sheet = reader.getSheet("rId1"); //rId1是sheet1
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource); // 解析excel的每条记录 在这个过程中startElement() characters() endElement() 这三个函数会依次执行
LinkedHashMap<String, String> rowContents = sheetHandler.getRowContents();
// System.out.println(rowContents);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (sheet != null) {
sheet.close();
}
if (pkg != null) {
pkg.close();
}
}
}

浙公网安备 33010602011771号