Loading

📑 Flume Taildir Source Bug修复

环境信息

Flume版本: 1.10.0-SNAPSHOT

问题现象

最近查看环境里日志信息时,偶然发现我们组业务的flume agent启动时,flume日志显示,有异常抛出。

组里同学跟我说,这个错误没啥影响,不影响业务,一切都正常。不过我有代码洁癖(哈哈,我手底下的小伙伴们没少被我烦),看到错误日志就烦,虽然不影响业务,但是我还是想了解清楚到底是啥情况,看看能不能消除掉这个错误,so,扒源码吧。

问题定位

首先,查看错误日志信息,它的错误是在启动agent时,Updating position from position file 报的错,定位到TaildirSource的start()方法,

reader = new ReliableTaildirEventReader.Builder()
          .filePaths(filePaths)
          .headerTable(headerTable)
          .positionFilePath(positionFilePath)
          .skipToEnd(skipToEnd)
          .addByteOffset(byteOffsetHeader)
          .cachePatternMatching(cachePatternMatching)
          .annotateFileName(fileHeader)
          .fileNameHeader(fileHeaderKey)
          .build();

然后接着往下定位,ReliableTaildirEventReader初始化,

rivate ReliableTaildirEventReader(Map<String, String> filePaths,
      Table<String, String, String> headerTable, String positionFilePath,
      boolean skipToEnd, boolean addByteOffset, boolean cachePatternMatching,
      boolean annotateFileName, String fileNameHeader) throws IOException {

  ... // 省略次要的信息
  loadPositionFile(positionFilePath);
}

错误入口:

一路追下去,其实是gson的JsonReader里的nextNonWhitespace(boolean throwOnEof)方法报错了,

nextNonWhitespace(boolean throwOnEof)详细代码
/**
   * Returns the next character in the stream that is neither whitespace nor a
   * part of a comment. When this returns, the returned character is always at
   * {@code buffer[pos-1]}; this means the caller can always push back the
   * returned character by decrementing {@code pos}.
   */
  private int nextNonWhitespace(boolean throwOnEof) throws IOException {
    /*
     * This code uses ugly local variables 'p' and 'l' representing the 'pos'
     * and 'limit' fields respectively. Using locals rather than fields saves
     * a few field reads for each whitespace character in a pretty-printed
     * document, resulting in a 5% speedup. We need to flush 'p' to its field
     * before any (potentially indirect) call to fillBuffer() and reread both
     * 'p' and 'l' after any (potentially indirect) call to the same method.
     */
    char[] buffer = this.buffer;
    int p = pos;
    int l = limit;
    while (true) {
      if (p == l) {
        pos = p;
        if (!fillBuffer(1)) {
          break;
        }
        p = pos;
        l = limit;
      }

      int c = buffer[p++];
      switch (c) {
      case '\t':
      case ' ':
      case '\n':
      case '\r':
        continue;

      case '/':
        pos = p;
        if (p == l) {
          pos--; // push back '/' so it's still in the buffer when this method returns
          boolean charsLoaded = fillBuffer(2);
          pos++; // consume the '/' again
          if (!charsLoaded) {
            return c;
          }
        }

        checkLenient();
        char peek = buffer[pos];
        switch (peek) {
        case '*':
          // skip a /* c-style comment */
          pos++;
          if (!skipTo("*/")) {
            throw syntaxError("Unterminated comment");
          }
          p = pos + 2;
          l = limit;
          continue;

        case '/':
          // skip a // end-of-line comment
          pos++;
          skipToEndOfLine();
          p = pos;
          l = limit;
          continue;

        default:
          return c;
        }

      case '#':
        pos = p;
        /*
         * Skip a # hash end-of-line comment. The JSON RFC doesn't
         * specify this behaviour, but it's required to parse
         * existing documents. See http://b/2571423.
         */
        checkLenient();
        skipToEndOfLine();
        p = pos;
        l = limit;
        continue;

      default:
        pos = p;
        return c;
      }
    }
    if (throwOnEof) {
      // 这里就是关键,它的报错信息所在
      throw new EOFException("End of input"
          + " at line " + getLineNumber() + " column " + getColumnNumber());
    } else {
      return -1;
    }
  }

所以,问题很明朗了,追根究底,说白了,其实就是因为我们的文件是空的,但是又用JsonReader去读这个文件,里面调到nextNonWhitespace()方法时,空文件自然读不到下一个NonWhitespace,所以就报错了。实际场景中,Taildir source如果启动过后,就会生成一个position文件,如果监控目录里一直没文件,或有了文件后来又被删除了,那么这个position文件就会是空的,这种情况下Taildir source再次启动时,就会抛出上面的异常,所以这是flume的一个小Bug。

解决方案

知道清楚原因,修改就很简单,逻辑就是我们在启动时,如果positionFile文件是空的,就不要去读这个文件做position加载了,跳过即可,so,动手吧。定位到代码位置ReliableTaildirEventReader,加上判断逻辑:

  /**
   * Create a ReliableTaildirEventReader to watch the given directory.
   */
  private ReliableTaildirEventReader(Map<String, String> filePaths,
      Table<String, String, String> headerTable, String positionFilePath,
      boolean skipToEnd, boolean addByteOffset, boolean cachePatternMatching,
      boolean annotateFileName, String fileNameHeader) throws IOException {
    
    // 省略到非关键信息
    .....
    logger.info("Updating position from position file: " + positionFilePath);
    File file = new File(positionFilePath);
    if (file.exists() && file.isFile() && file.length() != 0) {
      loadPositionFile(positionFilePath);
    }
  }

然后编译打包:

将jar上传到flume的lib目录里,然后再重启agent:

查看flume日志也无java.io.EOFException: End of input at line 1 column 1报错信息,搞定 ...

posted @ 2024-02-26 14:37  JasperD  阅读(118)  评论(0)    收藏  举报