Apache poi读取大文件Excel时 内存溢出问题记录

问题描述

项目上有个需求,需要把前端传来的EXCEL文件做解析,通过ETL解析后再做后续处理。但是上传的excel文件有些是超过50w行的文件。线上报内存溢出。需要拆分excel再做后续处理


解决过程

开始大概是这么直接读取文件

/**
 * 数据小没问题,一旦数据大就会爆内存
 */
public static void test(){
    FileInputStream fi = new FileInputStream(filePath);
    Workbook a = new XSSFWorkbook(fi);
    fi.close();
    ...
}

尝试百度后,都是导出时内存溢出的问题。无果,去stackoverflow找到类似问题,找到两个方法 1.通过excel-streaming-reader工具读取 2.Apache poi 推荐用 SAX事件驱动方式解析大数据量Excel文件

Excel-streaming-reader

源问题地址 How to load a large xlsx file with Apache POI?
插件项目github Excel-streaming-reader


依赖导入

<dependencies>
  <dependency>
    <groupId>com.monitorjbl</groupId>
    <artifactId>xlsx-streamer</artifactId>
    <version>2.1.0</version>
  </dependency>
</dependencies>  
InputStream is = new FileInputStream(new File("demo/xxx.xlsx"));
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(1000)    // 读取时内存中的行数 (defaults to 10)
        .bufferSize(4096)     // 读文件流时缓存字节 (defaults to 1024)
        .sheetIndex(0)        // 工作簿 (defaults to 0)
        .read(is);            // 文件流或者文件(xlsx) (required)

for (Row r : reader) {
  for (Cell c : r) {
    System.out.println(c.getStringCellValue());
  }
}     

SAX事件驱动方式解析大数据量Excel文件

源问题地址 Low memory writing/reading with Apache POI
同时也参考了 link


/**
 * @description: 重写startElement() endElement() characters() 把Excel数据封装到LinkedHashMap
 * @ClassName SheetHandler
 * @CreatedTime 2022/7/13 13:14
 */
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import java.util.LinkedHashMap;

public class SheetHandler extends DefaultHandler {
    private SharedStringsTable sst;
    private String lastContents;
    private boolean nextIsString;
    private String cellPosition;
    private LinkedHashMap<String, String> rowContents = new LinkedHashMap<String, String>();

    public LinkedHashMap<String, String> getRowContents() {
        return rowContents;
    }

    public void setRowContents(LinkedHashMap<String, String> rowContents) {
        this.rowContents = rowContents;
    }

    public SheetHandler(SharedStringsTable sst) {
        this.sst = sst;
    }

    public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
        if (name.equals("c")) {
            cellPosition = attributes.getValue("r");
            String cellType = attributes.getValue("t");
            if (cellType != null && cellType.equals("s")) {
                nextIsString = true;
            } else {
                nextIsString = false;
            }
        }
        lastContents = "";
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        if (nextIsString) {
            int idx = Integer.parseInt(lastContents);
            lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
            nextIsString = false;
        }
        if (name.equals("v")) {
            // 数据读取结束后,将单元格坐标,内容存入map中
            if (!(cellPosition.length() == 2) || (cellPosition.length() == 2 && !"1".equals(cellPosition.substring(1)))) { //不保存第一行数据
                rowContents.put(cellPosition, lastContents);
            }
        }
    }
    public void characters(char[] ch, int start, int length) throws SAXException {
        lastContents += new String(ch, start, length);
    }
}
/**
 * 测试方法
 * SAX是将xlsx格式的excel文件先解析成xml
 */
  public static void main(String[] args) throws IOException{
        String filePath = "/test.xlsx";
        InputStream sheet = null;
        OPCPackage pkg = null;
        SheetHandler sheetHandler = null;
        try {
            pkg = OPCPackage.open(filePath); 
            XSSFReader reader = new XSSFReader(pkg);
            SharedStringsTable table = reader.getSharedStringsTable();
            sheetHandler = new SheetHandler(table);
            XMLReader parser = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
            parser.setContentHandler(sheetHandler);
            sheet = reader.getSheet("rId1"); //rId1是sheet1
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource); // 解析excel的每条记录 在这个过程中startElement() characters() endElement() 这三个函数会依次执行
            LinkedHashMap<String, String> rowContents = sheetHandler.getRowContents();
//            System.out.println(rowContents);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (sheet != null) {
                sheet.close();
            }
            if (pkg != null) {
                pkg.close();
            }
        }
    }
posted @ 2022-07-13 16:02  _zjk  阅读(1679)  评论(0)    收藏  举报