有一些数据需要实时的往 StarRocks 写,最开始想得很简单,直接用 insert into values 写入,后来遇到版本号超过 1000 问题。查官网才知道,每个 insert 都会生成一个版本,短时间往后台insert 大量数据是不可行的,才开始用 Stream Load 方式


java.sql.SQLSyntaxErrorException: Too many versions. tablet_id: 18398, version_count: 1001, limit: 1000

insert into values 方式写数据

insert into values 方式写数据很简单,就是 jdbc, 直接 insert 就可以。

样例:

INSERT INTO source_wiki_edit
WITH LABEL insert_load_wikipedia
VALUES
    ("2015-09-12 00:00:00","#en.wikipedia","AustinFF",0,0,0,0,0,21,5,0),
    ("2015-09-12 00:00:00","#ca.wikipedia","helloSR",0,1,0,1,0,3,23,0);

一切都如预想的一样,很快就把 insert into values 的程序写好了,上线运行才是噩梦的开始。

程序执行的时候,但凡数据量大一点(几万条),就经常触发 “Too many versions” 报错,只能降低写入频率,这样一来写入速度又慢了很多。

在看官网文档的时候,在 通过 INSERT INTO 语句导入数据 看到这样的描述:

频繁使用 INSERT INTO 语句导入小批量数据会产生过多的数据版本,从而影响查询性能,因此不建议您频繁使用 INSERT INTO 语句导入数据或将其作为生产环境的日常例行导入任务。如果您的业务场景需要流式导入或者小批量多次导入数据,建议使用 Apache Kafka® 作为数据源并通过 Routine Load 方式进行导入作业。

INSERT INTO VALUES 语句导入方式仅适用于导入少量数据作为验证 DEMO 用途,不适用于大规模测试或生产环境。如需大规模导入数据,请选择其他导入方式。

我们的业务场景是 流式导入,又不方便使用 flink,就决定“借鉴一下” StarRocks 提供 flink-connector-starrocks 的代码,通过缓存并批量由 stream load 导入

Stream Load 导入

官网介绍:

StarRocks 提供基于 HTTP 协议的 Stream Load 导入方式,帮助您从本地文件系统或流式数据源导入数据。

Stream Load 是一种同步的导入方式。您提交导入作业以后,StarRocks 会同步地执行导入作业,并返回导入作业的结果信息。您可以通过返回的结果信息来判断导入作业是否成功。

导入本地数据文件:

	一般可采用 curl 命令直接提交一个导入作业,将本地数据文件的数据导入到 StarRocks 中。

导入实时产生的数据流:

	一般可采用 Apache Flink® 等程序提交一个导入作业,持续生成一系列导入任务,将实时产生的数据流持续不断地导入到 StarRocks 中。

导入原理

Stream Load 需要您在客户端上通过 HTTP 发送导入作业请求给 FE 节点,FE 节点会通过 HTTP 重定向 (Redirect) 指令将请求转发给某一个 BE 节点。

接收导入作业请求的 BE 节点作为 Coordinator BE 节点,将数据按表结构划分、并分发数据到相关的 BE 节点。导入作业的结果信息由 Coordinator BE 节点返回给客户端。

主要流程:

注:以上内容来自官网

官网样例

StarRocks 的 github 提供了一个 java 集成 Stream Load 的样例:

public class StarRocksStreamLoad {
    private final static String STARROCKS_HOST = "xxx.com";
    private final static String STARROCKS_DB = "test";
    private final static String STARROCKS_TABLE = "stream_test";
    private final static String STARROCKS_USER = "root";
    private final static String STARROCKS_PASSWORD = "xxx";
    private final static int STARROCKS_HTTP_PORT = 8030;

    private void sendData(String content) throws Exception {
        final String loadUrl = String.format("http://%s:%s/api/%s/%s/_stream_load",
                STARROCKS_HOST,
                STARROCKS_HTTP_PORT,
                STARROCKS_DB,
                STARROCKS_TABLE);

        final HttpClientBuilder httpClientBuilder = HttpClients
                .custom()
                .setRedirectStrategy(new DefaultRedirectStrategy() {
                    @Override
                    protected boolean isRedirectable(String method) {
                        return true;
                    }
                });

        try (CloseableHttpClient client = httpClientBuilder.build()) {
            HttpPut put = new HttpPut(loadUrl);
            StringEntity entity = new StringEntity(content, "UTF-8");
            put.setHeader(HttpHeaders.EXPECT, "100-continue");
            put.setHeader(HttpHeaders.AUTHORIZATION, basicAuthHeader(STARROCKS_USER, STARROCKS_PASSWORD));
            // the label header is optional, not necessary
            // use label header can ensure at most once semantics
            put.setHeader("label", "39c25a5c-7000-496e-a98e-348a264c81de");
            put.setEntity(entity);

            try (CloseableHttpResponse response = client.execute(put)) {
                String loadResult = "";
                if (response.getEntity() != null) {
                    loadResult = EntityUtils.toString(response.getEntity());
                }
                final int statusCode = response.getStatusLine().getStatusCode();
                // statusCode 200 just indicates that starrocks be service is ok, not stream load
                // you should see the output content to find whether stream load is success
                if (statusCode != 200) {
                    throw new IOException(
                            String.format("Stream load failed, statusCode=%s load result=%s", statusCode, loadResult));
                }

                System.out.println(loadResult);
            }
        }
    }

    private String basicAuthHeader(String username, String password) {
        final String tobeEncode = username + ":" + password;
        byte[] encoded = Base64.encodeBase64(tobeEncode.getBytes(StandardCharsets.UTF_8));
        return "Basic " + new String(encoded);
    }

    public static void main(String[] args) throws Exception {
        int id1 = 1;
        int id2 = 10;
        String id3 = "Simon";
        int rowNumber = 10;
        String oneRow = id1 + "\t" + id2 + "\t" + id3 + "\n";

        StringBuilder stringBuilder = new StringBuilder();
        for (int i = 0; i < rowNumber; i++) {
            stringBuilder.append(oneRow);
        }
        
        stringBuilder.deleteCharAt(stringBuilder.length() - 1);

        String loadData = stringBuilder.toString();
        StarRocksStreamLoad starrocksStreamLoad = new StarRocksStreamLoad();
        starrocksStreamLoad.sendData(loadData);
    }
}

注: 代码来源 https://github.com/StarRocks/demo/blob/master/MiscDemo/stream_load/java/StarRocksStreamLoad.java

代码逻辑:
1、main 函数里面生产了 10 条数据,用 "\n" 做分隔符,拼接在一起
2、使用 http 的方式把数据发给 StarRocks

实际使用

实际使用的时候,参考了 flink-connector-starrocks 的代码

http 创建的时候添加 ContentLengthHeaderRemover 拦截器,移除 http request 中的 Content-Length 参数,否则可能遇到报错: Content-Length header already present

header 参数:
1. columns 设置数据中列的顺序
2. row_delimiter 设置行分隔符
3. column_separator 设置列分隔符

核心代码如下:


    private static void write(String database, String table, List<String> records) throws Exception {

        // 转 byte
        StarRocksSinkBufferEntity bufferEntity = new StarRocksSinkBufferEntity(database, table, labelPrefix);
        for (String record : records) {
            byte[] bts = record.getBytes(StandardCharsets.UTF_8);
            bufferEntity.addToBuffer(bts);
        }

        // 多行合并
        byte[] data = joinRows(bufferEntity.getBuffer(), (int) bufferEntity.getBatchSize());

        // 导入的 url
        String loadUrl = new StringBuilder(url)
                .append("/api/")
                .append(bufferEntity.getDatabase())
                .append("/")
                .append(bufferEntity.getTable())
                .append("/_stream_load")
                .toString();

        // random load label
        String label = createBatchLabel();
        // do http
        Map<String, Object> map = doHttpPut(loadUrl, label, data);

        for (Map.Entry entry : map.entrySet()) {
            System.out.println(entry.getKey() + " : " + entry.getValue().toString());
        }

    }

    public static String createBatchLabel() {
        StringBuilder sb = new StringBuilder();
        if (!Strings.isNullOrEmpty(labelPrefix)) {
            sb.append(labelPrefix);
        }
        return sb.append(UUID.randomUUID().toString()).toString();
    }

    private static Map<String, Object> doHttpPut(String loadUrl, String label, byte[] data) throws IOException {
        LOG.info(String.format("Executing stream load to: '%s', size: '%s', thread: %d", loadUrl, data.length, Thread.currentThread().getId()));
        final HttpClientBuilder httpClientBuilder = HttpClients.custom()
                .setRedirectStrategy(new DefaultRedirectStrategy() {
                    @Override
                    protected boolean isRedirectable(String method) {
                        return true;
                    }
                })
                // 解决 Content-Length bug : Caused by: org.apache.http.ProtocolException: Content-Length header already present at
                .addInterceptorFirst(new ContentLengthHeaderRemover());
        try (CloseableHttpClient httpclient = httpClientBuilder.build()) {
            HttpPut httpPut = new HttpPut(loadUrl);
            // 指定列名
            httpPut.setHeader("columns", "`col1`,`col2`,`col3`,`col4`");

            if (!httpPut.containsHeader("timeout")) {
                httpPut.setHeader("timeout", "60");
            }
            if ("json".equals(format)) {
                httpPut.setHeader("strip_outer_array", "true");
                httpPut.setHeader("format", "json");
            } else {
                // 行、列分隔符
                httpPut.setHeader("row_delimiter", "\\\\x02");
                httpPut.setHeader("column_separator", "\\\\x01");
            }
            httpPut.setHeader("Expect", "100-continue");

            httpPut.setHeader("label", label);
            // 用户名、密码
            httpPut.setHeader("Authorization", getBasicAuthHeader(user, pass));
            httpPut.setEntity(new ByteArrayEntity(data));
            httpPut.setConfig(RequestConfig.custom().setRedirectsEnabled(true).build());
            try (CloseableHttpResponse resp = httpclient.execute(httpPut)) {
                HttpEntity respEntity = getHttpEntity(resp);
                if (respEntity == null)
                    return null;
                return (Map<String, Object>) JSON.parse(EntityUtils.toString(respEntity));
            }
        }
    }

    private static String getBasicAuthHeader(String username, String password) {
        String auth = username + ":" + password;
        byte[] encodedAuth = Base64.encodeBase64(auth.getBytes(StandardCharsets.UTF_8));
        return new StringBuilder("Basic ").append(new String(encodedAuth)).toString();
    }

    private static HttpEntity getHttpEntity(CloseableHttpResponse resp) {
        int code = resp.getStatusLine().getStatusCode();
        if (200 != code) {
            LOG.warn("Request failed with code:{}", code);
            return null;
        }
        HttpEntity respEntity = resp.getEntity();
        if (null == respEntity) {
            LOG.warn("Request failed with empty response.");
            return null;
        }
        return respEntity;
    }
    private static class ContentLengthHeaderRemover implements HttpRequestInterceptor {
        @Override
        public void process(HttpRequest request, HttpContext context) throws HttpException, IOException {
            request.removeHeaders(HTTP.CONTENT_LEN);// fighting org.apache.http.protocol.RequestContent's ProtocolException("Content-Length header already present");
        }
    }

    private static byte[] joinRows(List<byte[]> rows, int totalBytes) {
        if ("csv".equals(format)) {
            // csv
            byte[] lineDelimiter = ROW_SEP.getBytes(StandardCharsets.UTF_8);
            ByteBuffer bos = ByteBuffer.allocate(totalBytes + rows.size() * lineDelimiter.length);
            Iterator var5 = rows.iterator();

            while (var5.hasNext()) {
                byte[] row = (byte[]) var5.next();
                bos.put(row);
                bos.put(lineDelimiter);
            }

            return bos.array();
        } else {
            ByteBuffer bos = ByteBuffer.allocate(totalBytes + (rows.isEmpty() ? 2 : rows.size() + 1));
            bos.put("[".getBytes(StandardCharsets.UTF_8));
            byte[] jsonDelimiter = ",".getBytes(StandardCharsets.UTF_8);
            boolean isFirstElement = true;
            for (byte[] row : rows) {
                if (!isFirstElement) {
                    bos.put(jsonDelimiter);
                }
                bos.put(row);
                isFirstElement = false;
            }
            bos.put("]".getBytes(StandardCharsets.UTF_8));
            return bos.array();
        }

    }


StarRocksSinkBufferEntity 如下(借鉴过来的):


public class StarRocksSinkBufferEntity implements Serializable {

    private static final long serialVersionUID = 1L;

    private ArrayList<byte[]> buffer = new ArrayList<>();
    private int batchCount = 0;
    private long batchSize = 0;
    private String label;
    private String database;
    private String table;
    private boolean EOF;
    private String labelPrefix;

    public StarRocksSinkBufferEntity(String database, String table, String labelPrefix) {
        this.database = database;
        this.table = table;
        this.labelPrefix = labelPrefix;
        label = createBatchLabel();
    }

    public StarRocksSinkBufferEntity asEOF() {
        EOF = true;
        return this;
    }

    public boolean EOF() {
        return EOF;
    }

    public String getDatabase() {
        return database;
    }

    public void setDatabase(String database) {
        this.database = database;
    }

    public String getTable() {
        return table;
    }

    public void setTable(String table) {
        this.table = table;
    }

    public String getLabel() {
        return label;
    }

    public ArrayList<byte[]> getBuffer() {
        return buffer;
    }

    public void addToBuffer(byte[] bts) {
        incBatchCount();
        incBatchSize(bts.length);
        buffer.add(bts);
    }
    public int getBatchCount() {
        return batchCount;
    }
    private void incBatchCount() {
        this.batchCount += 1;
    }
    public long getBatchSize() {
        return batchSize;
    }
    private void incBatchSize(long batchSize) {
        this.batchSize += batchSize;
    }

    public synchronized void clear() {
        buffer.clear();
        batchCount = 0;
        batchSize = 0;
        label = createBatchLabel();
    }

    public String reGenerateLabel() {
        return label = createBatchLabel();
    }

    public String createBatchLabel() {
        StringBuilder sb = new StringBuilder();
        if (!Strings.isNullOrEmpty(labelPrefix)) {
            sb.append(labelPrefix);
        }
        return sb.append(UUID.randomUUID().toString()).toString();
    }
}

总结

改成 stream load 设置的是:每分钟或者每 1000 条往 StarRocks 写一次,暂时没遇到什么问题

需要注意的是,字符串类型的 Null ,需要同 "\N" 替换,不然写到 StarRocks 会变成 字符串 ‘null’,或者 空字符串

参考:

github starrocks-connector-for-apache-flink
github starrocks demo

欢迎关注Flink菜鸟公众号,会不定期更新Flink(开发技术)相关的推文

posted on 2022-09-26 14:11  Flink菜鸟  阅读(3518)  评论(0编辑  收藏  举报