📑 Flume Taildir Source Bug修复
环境信息
Flume版本: 1.10.0-SNAPSHOT
问题现象
最近查看环境里日志信息时,偶然发现我们组业务的flume agent启动时,flume日志显示,有异常抛出。

组里同学跟我说,这个错误没啥影响,不影响业务,一切都正常。不过我有代码洁癖(哈哈,我手底下的小伙伴们没少被我烦),看到错误日志就烦,虽然不影响业务,但是我还是想了解清楚到底是啥情况,看看能不能消除掉这个错误,so,扒源码吧。
问题定位
首先,查看错误日志信息,它的错误是在启动agent时,Updating position from position file 报的错,定位到TaildirSource的start()方法,
reader = new ReliableTaildirEventReader.Builder()
.filePaths(filePaths)
.headerTable(headerTable)
.positionFilePath(positionFilePath)
.skipToEnd(skipToEnd)
.addByteOffset(byteOffsetHeader)
.cachePatternMatching(cachePatternMatching)
.annotateFileName(fileHeader)
.fileNameHeader(fileHeaderKey)
.build();
然后接着往下定位,ReliableTaildirEventReader初始化,
rivate ReliableTaildirEventReader(Map<String, String> filePaths,
Table<String, String, String> headerTable, String positionFilePath,
boolean skipToEnd, boolean addByteOffset, boolean cachePatternMatching,
boolean annotateFileName, String fileNameHeader) throws IOException {
... // 省略次要的信息
loadPositionFile(positionFilePath);
}
错误入口:

一路追下去,其实是gson的JsonReader里的nextNonWhitespace(boolean throwOnEof)方法报错了,
nextNonWhitespace(boolean throwOnEof)详细代码
/**
* Returns the next character in the stream that is neither whitespace nor a
* part of a comment. When this returns, the returned character is always at
* {@code buffer[pos-1]}; this means the caller can always push back the
* returned character by decrementing {@code pos}.
*/
private int nextNonWhitespace(boolean throwOnEof) throws IOException {
/*
* This code uses ugly local variables 'p' and 'l' representing the 'pos'
* and 'limit' fields respectively. Using locals rather than fields saves
* a few field reads for each whitespace character in a pretty-printed
* document, resulting in a 5% speedup. We need to flush 'p' to its field
* before any (potentially indirect) call to fillBuffer() and reread both
* 'p' and 'l' after any (potentially indirect) call to the same method.
*/
char[] buffer = this.buffer;
int p = pos;
int l = limit;
while (true) {
if (p == l) {
pos = p;
if (!fillBuffer(1)) {
break;
}
p = pos;
l = limit;
}
int c = buffer[p++];
switch (c) {
case '\t':
case ' ':
case '\n':
case '\r':
continue;
case '/':
pos = p;
if (p == l) {
pos--; // push back '/' so it's still in the buffer when this method returns
boolean charsLoaded = fillBuffer(2);
pos++; // consume the '/' again
if (!charsLoaded) {
return c;
}
}
checkLenient();
char peek = buffer[pos];
switch (peek) {
case '*':
// skip a /* c-style comment */
pos++;
if (!skipTo("*/")) {
throw syntaxError("Unterminated comment");
}
p = pos + 2;
l = limit;
continue;
case '/':
// skip a // end-of-line comment
pos++;
skipToEndOfLine();
p = pos;
l = limit;
continue;
default:
return c;
}
case '#':
pos = p;
/*
* Skip a # hash end-of-line comment. The JSON RFC doesn't
* specify this behaviour, but it's required to parse
* existing documents. See http://b/2571423.
*/
checkLenient();
skipToEndOfLine();
p = pos;
l = limit;
continue;
default:
pos = p;
return c;
}
}
if (throwOnEof) {
// 这里就是关键,它的报错信息所在
throw new EOFException("End of input"
+ " at line " + getLineNumber() + " column " + getColumnNumber());
} else {
return -1;
}
}
所以,问题很明朗了,追根究底,说白了,其实就是因为我们的文件是空的,但是又用JsonReader去读这个文件,里面调到nextNonWhitespace()方法时,空文件自然读不到下一个NonWhitespace,所以就报错了。实际场景中,Taildir source如果启动过后,就会生成一个position文件,如果监控目录里一直没文件,或有了文件后来又被删除了,那么这个position文件就会是空的,这种情况下Taildir source再次启动时,就会抛出上面的异常,所以这是flume的一个小Bug。
解决方案
知道清楚原因,修改就很简单,逻辑就是我们在启动时,如果positionFile文件是空的,就不要去读这个文件做position加载了,跳过即可,so,动手吧。定位到代码位置ReliableTaildirEventReader,加上判断逻辑:
/**
* Create a ReliableTaildirEventReader to watch the given directory.
*/
private ReliableTaildirEventReader(Map<String, String> filePaths,
Table<String, String, String> headerTable, String positionFilePath,
boolean skipToEnd, boolean addByteOffset, boolean cachePatternMatching,
boolean annotateFileName, String fileNameHeader) throws IOException {
// 省略到非关键信息
.....
logger.info("Updating position from position file: " + positionFilePath);
File file = new File(positionFilePath);
if (file.exists() && file.isFile() && file.length() != 0) {
loadPositionFile(positionFilePath);
}
}
然后编译打包:

将jar上传到flume的lib目录里,然后再重启agent:

查看flume日志也无java.io.EOFException: End of input at line 1 column 1报错信息,搞定 ...

浙公网安备 33010602011771号