lucene4.7索引源码研究之Segment元文件
从段的元文件segment_开始
所谓元文件就是记录索引中段信息的文件,打开索引文件目录可以发现以segment_开头的文件,这些都是元文件
lucene根据segment_后面的数字来确认哪一个文件可以被读取
当加载索引文件DirectoryReader.open时,代码如下:
针对不同的索引物理存储lucene提供了不同的加载模板,以StandardDirectoryReader.java为例
/** called from DirectoryReader.open(...) methods */ static DirectoryReader open(final Directory directory, final IndexCommit commit, final int termInfosIndexDivisor) throws IOException { return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {//构造SegmentsInfos,设置directory,使用回调,执行run @Override protected Object doBody(String segmentFileName) throws IOException { SegmentInfos sis = new SegmentInfos(); sis.read(directory, segmentFileName);//找到元文件,加载所有段信息 final SegmentReader[] readers = new SegmentReader[sis.size()];//sis.size()段文件个数 for (int i = sis.size()-1; i >= 0; i--) {//循环构造segmentReader,封装段信息,提供入口 IOException prior = null; boolean success = false; try { readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);//构建segmentReader对象,提供读取入口 success = true; } catch(IOException ex) { prior = ex; } finally { if (!success) { IOUtils.closeWhileHandlingException(prior, readers); } } } return new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false); } }.run(commit);//首先执行run }
run的执行,lucene寻找符合规则的segment_N
public Object run(IndexCommit commit) throws IOException { if (commit != null) { if (directory != commit.getDirectory()) throw new IOException("the specified commit does not match the specified Directory"); return doBody(commit.getSegmentsFileName()); } String segmentFileName = null; long lastGen = -1; long gen = 0; int genLookaheadCount = 0; IOException exc = null; int retryCount = 0; boolean useFirstMethod = true; // Loop until we succeed in calling doBody() without // hitting an IOException. An IOException most likely // means a commit was in process and has finished, in // the time it took us to load the now-old infos files // (and segments files). It's also possible it's a // true error (corrupt index). To distinguish these, // on each retry we must see "forward progress" on // which generation we are trying to load. If we // don't, then the original error is real and we throw // it. // 循环在成功调用doBody并且有抛出IOException异常,也就是找到符合规则的segment之前会一直进行 // 抛出IOException异常的原因可能是意味着正在有一个commit进程完毕的时候,导致我们 // 同时获取到了新和旧的元数据。为了区分这些情况,我们必须在尝试加载信息的时候确认前后两次的版本,如果不这样 // 做,就可能会抛出IOException // We have three methods for determining the current // generation. We try the first two in parallel (when // useFirstMethod is true), and fall back to the third // when necessary. // 为了避免这种情况提供了3种方法,当userFirstMethod=true的时候,我们将尝试第一第二种方法 // 在失败之后有必要的情况下调用第三种方法 while(true) {//while true 直到return为止 if (useFirstMethod) {//第一种方法,初始化true // List the directory and use the highest // segments_N file. This method works well as long // as there is no stale caching on the directory // contents (NOTE: NFS clients often have such stale // caching): //列出所有的目录,使用N最大得segments,这个方法即使在没有缓存的情况下依然很高效 String[] files = null; long genA = -1; files = directory.listAll(); if (files != null) { genA = getLastCommitGeneration(files);//读取最大N作为genA } if (infoStream != null) { message("directory listing genA=" + genA); } // Also open segments.gen and read its // contents. Then we take the larger of the two // gens. This way, if either approach is hitting // a stale cache (NFS) we have a better chance of // getting the right generation.
// 获取segments.gen中的版本号,然后取两个中最大的那个
long genB = -1; IndexInput genInput = null; try { genInput = directory.openInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);//读取segments.gen文件 } catch (IOException e) { if (infoStream != null) { message("segments.gen open: IOException " + e); } } if (genInput != null) { try { int version = genInput.readInt();//读取segments.gen版本号,初始值-2 if (version == FORMAT_SEGMENTS_GEN_CURRENT) {//版本号与当前一致 long gen0 = genInput.readLong();//读取两次数值 long gen1 = genInput.readLong(); if (infoStream != null) { message("fallback check: " + gen0 + "; " + gen1); } if (gen0 == gen1) {//如果两个值一致,复制给genB // The file is consistent. genB = gen0; } } else {//否则抛异常,可能正在有进程commit throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_CURRENT, FORMAT_SEGMENTS_GEN_CURRENT); } } catch (IOException err2) { // rethrow any format exception if (err2 instanceof CorruptIndexException) throw err2; } finally { genInput.close(); } } if (infoStream != null) { message(IndexFileNames.SEGMENTS_GEN + " check: genB=" + genB); } // Pick the larger of the two gen's: gen = Math.max(genA, genB);//取两个值中较大的 if (gen == -1) {//-1,没有找到段信息元文件 // Neither approach found a generation throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files)); } } if (useFirstMethod && lastGen == gen && retryCount >= 2) {//当retryCount=2时放弃第一种方法 // Give up on first method -- this is 3rd cycle on // listing directory and checking gen file to // attempt to locate the segments file. useFirstMethod = false; } // Second method: both directory cache and // file contents cache seem to be stale, just // advance the generation. if (!useFirstMethod) { if (genLookaheadCount < defaultGenLookaheadCount) {//defaultGenLookaheadCount = 10,gen++ gen++; genLookaheadCount++; if (infoStream != null) { message("look ahead increment gen to " + gen); } } else { // All attempts have failed -- throw first exc: throw exc; } } else if (lastGen == gen) { // This means we're about to try the same // segments_N last tried. retryCount++; } else { // Segment file has advanced since our last loop // (we made "progress"), so reset retryCount: retryCount = 0; } lastGen = gen; //直接读取gen++ segment文件 segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen); try { Object v = doBody(segmentFileName); if (infoStream != null) { message("success on " + segmentFileName); } return v; } catch (IOException err) {//失败 // Save the original root cause: if (exc == null) { exc = err; } if (infoStream != null) { message("primary Exception on '" + segmentFileName + "': " + err + "'; will retry: retryCount=" + retryCount + "; gen = " + gen); } if (gen > 1 && useFirstMethod && retryCount == 1) {//第三种方式读取gen-1 // This is our second time trying this same segments // file (because retryCount is 1), and, there is // possibly a segments_(N-1) (because gen > 1). // So, check if the segments_(N-1) exists and // try it if so: String prevSegmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen-1); final boolean prevExists; prevExists = directory.fileExists(prevSegmentFileName); if (prevExists) { if (infoStream != null) { message("fallback to prior segment file '" + prevSegmentFileName + "'"); } try { Object v = doBody(prevSegmentFileName);//读取 if (infoStream != null) { message("success on fallback " + prevSegmentFileName); } return v; } catch (IOException err2) { if (infoStream != null) { message("secondary Exception on '" + prevSegmentFileName + "': " + err2 + "'; will retry"); } } } } } } }
解析Segment的结构(from追风的蓝宝):
Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount, CommitUserData, Footer
其中<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>表示一个段的信息,SegCount表示段的数量,所以
<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount 表示这样的SegCount个段连在一起。
head:
head是一个CodecHeader,包含了Magic,CodecName,Version三部分。
Magic是一个开始表示符,通常情况下为1071082519.
CodecName是文件的标识符
Version索引文件版本信息,当用某个版本号的IndexReader读取另一个版本号生成的索引的时候,会因为此值不同而报错。
Version:
索引的版本号,记录了IndexWriter将修改提交到索引文件中的次数
其初始值大多数情况下从索引文件里面读出。
我们并不关心IndexWriter将修改提交到索引的具体次数,而更关心到底哪个是最新的。IndexReader中常比较自己的version和索引文件中的version是否相同来判断此IndexReader被打开后,还有没有被IndexWriter更新
- NameCount
- 是下一个新段(Segment)的段名。
- 所有属于同一个段的索引文件都以段名作为文件名,一般为_0.xxx, _0.yyy, _1.xxx, _1.yyy ……
- 新生成的段的段名一般为原有最大段名加一。
- SegCount
- 段(Segment)的个数。
- SegCount个段的元数据信息:
- SegName:段名,所有属于同一个段的文件都有以段名作为文件名。
- SegCodec:编码segment的codec名字
- del文件的版本号
- Lucene中,在optimize之前,删除的文档是保存在.del文件中的。
- DelGen是每当IndexWriter向索引文件中提交删除操作的时候,加1,并生成新的.del文件
- 如果该值设为-1表示没有删除的document
- DeletionCount:本segment删除的documents个数
- FieldInfosGen:segment中域文件的版本信息,如果该值为-1表示对域文件未有更新操作,如果大于0表示有更新操作
- UpdatesFiles:存储本segment更新的文件列表
- CommitUserData:
- Footer:codec编码的结尾,包含了检验和以及检验算法ID
/** * Read a particular segmentFileName. Note that this may * throw an IOException if a commit is in process. * * @param directory -- directory containing the segments file * @param segmentFileName -- segment file to load * @throws CorruptIndexException if the index is corrupt * @throws IOException if there is a low-level IO error */ public final void read(Directory directory, String segmentFileName) throws IOException { boolean success = false; // Clear any previous segments: this.clear(); generation = generationFromSegmentsFileName(segmentFileName); lastGeneration = generation;//将最新的版本号更新成N ChecksumIndexInput input = new ChecksumIndexInput(directory.openInput(segmentFileName, IOContext.READ)); try { final int format = input.readInt();//读取lucene版本号 if (format == CodecUtil.CODEC_MAGIC) { // 4.0+ int actualFormat = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_40, VERSION_46);//读取header version = input.readLong();//indexwrite提交的修改次数 counter = input.readInt();//下一个新segment_N的N int numSegments = input.readInt();//索引段个数 if (numSegments < 0) { throw new CorruptIndexException("invalid segment count: " + numSegments + " (resource: " + input + ")"); }
//读取各个段信息 for(int seg=0;seg<numSegments;seg++) { String segName = input.readString();//段名称 Codec codec = Codec.forName(input.readString()); //System.out.println("SIS.read seg=" + seg + " codec=" + codec); SegmentInfo info = codec.segmentInfoFormat().getSegmentInfoReader().read(directory, segName, IOContext.READ);//开始读取<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>信息 info.setCodec(codec); long delGen = input.readLong(); int delCount = input.readInt(); if (delCount < 0 || delCount > info.getDocCount()) { throw new CorruptIndexException("invalid deletion count: " + delCount + " (resource: " + input + ")"); } long fieldInfosGen = -1; if (actualFormat >= VERSION_46) { fieldInfosGen = input.readLong(); } SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen); if (actualFormat >= VERSION_46) { int numGensUpdatesFiles = input.readInt();//UpdatesFiles的处理,存储本segment更新的文件列表 final Map<Long,Set<String>> genUpdatesFiles; if (numGensUpdatesFiles == 0) { genUpdatesFiles = Collections.emptyMap(); } else {//如果有,则向SegmentCommitInfo写入 genUpdatesFiles = new HashMap<Long,Set<String>>(numGensUpdatesFiles); for (int i = 0; i < numGensUpdatesFiles; i++) { genUpdatesFiles.put(input.readLong(), input.readStringSet()); } } siPerCommit.setGenUpdatesFiles(genUpdatesFiles); } add(siPerCommit); } userData = input.readStringStringMap(); } else { Lucene3xSegmentInfoReader.readLegacyInfos(this, directory, input, format); Codec codec = Codec.forName("Lucene3x"); for (SegmentCommitInfo info : this) { info.info.setCodec(codec); } } final long checksumNow = input.getChecksum(); final long checksumThen = input.readLong(); if (checksumNow != checksumThen) { throw new CorruptIndexException("checksum mismatch in segments file (resource: " + input + ")"); } success = true; } finally { if (!success) { // Clear any segment infos we had loaded so we // have a clean slate on retry: this.clear(); IOUtils.closeWhileHandlingException(input); } else { input.close(); } } }
加载segment_N中的各个segment段信息 .si文件
(from追风的蓝宝).si文件存储了段的元数据,主要涉及SegmentInfoFormat.java和Segmentinfo.java这两个文件。由于本文介绍的Solr4.8.0,所以对应的是SegmentInfoFormat的子类Lucene46SegmentInfoFormat。
首先来看下.si文件的格式
|
头部(header) |
版本(SegVersion) |
doc个数(SegSize) |
是否符合文档格式(IsCompoundFile) |
Diagnostics |
文件 |
Footer |
- 头部:同Segment_N的头部结构相同,包括包含了Magic,CodecName,Version三部分
- 版本:生成segment的编码版本
- 大小: segment索引的documents的个数
- IsCompoundFile:是否以复合文档格式存储,如果设置1则为复合文档格式
- Diagnostics:包含一些信息可以用于debug,比如Lucene版本,OS,java version,以及生成该segment生成的方式(merge,add,addindexs)等
- 文件:该段包含了哪些文件
- Footer: codec编码的结尾,包含了检验和以及检验算法ID
read
public class Lucene46SegmentInfoReader extends SegmentInfoReader {
/** Sole constructor. */
public Lucene46SegmentInfoReader() {
}
@Override
public SegmentInfo read(Directory dir, String segment, IOContext context) throws IOException {
final String fileName = IndexFileNames.segmentFileName(segment, "", Lucene46SegmentInfoFormat.SI_EXTENSION);//获取si文件名
final IndexInput input = dir.openInput(fileName, context);
boolean success = false;
try {
CodecUtil.checkHeader(input, Lucene46SegmentInfoFormat.CODEC_NAME,
Lucene46SegmentInfoFormat.VERSION_START,
Lucene46SegmentInfoFormat.VERSION_CURRENT);//检查header
final String version = input.readString();
final int docCount = input.readInt();//文档数量
if (docCount < 0) {
throw new CorruptIndexException("invalid docCount: " + docCount + " (resource=" + input + ")");
}
final boolean isCompoundFile = input.readByte() == SegmentInfo.YES;//是否符合文档格式
final Map<String,String> diagnostics = input.readStringStringMap();//for debug info
final Set<String> files = input.readStringSet();//该段下包含的文件
if (input.getFilePointer() != input.length()) {
throw new CorruptIndexException("did not read all bytes from file \"" + fileName + "\": read " + input.getFilePointer() + " vs size " + input.length() + " (resource: " + input + ")");
}
final SegmentInfo si = new SegmentInfo(dir, version, segment, docCount, isCompoundFile, null, diagnostics);//写入SegmentInfo中
si.setFiles(files);
success = true;
return si;
} finally {
if (!success) {
IOUtils.closeWhileHandlingException(input);
} else {
input.close();
}
}
}
}
write
/** * Lucene 4.0 implementation of {@link SegmentInfoWriter}. * * @see Lucene46SegmentInfoFormat * @lucene.experimental */ public class Lucene46SegmentInfoWriter extends SegmentInfoWriter { /** Sole constructor. */ public Lucene46SegmentInfoWriter() { } /** Save a single segment's info. */ @Override public void write(Directory dir, SegmentInfo si, FieldInfos fis, IOContext ioContext) throws IOException { final String fileName = IndexFileNames.segmentFileName(si.name, "", Lucene46SegmentInfoFormat.SI_EXTENSION); si.addFile(fileName); final IndexOutput output = dir.createOutput(fileName, ioContext); boolean success = false; try { CodecUtil.writeHeader(output, Lucene46SegmentInfoFormat.CODEC_NAME, Lucene46SegmentInfoFormat.VERSION_CURRENT);//写入头文件 // Write the Lucene version that created this segment, since 3.1 output.writeString(si.getVersion());//写入版本 output.writeInt(si.getDocCount());//doc数量 output.writeByte((byte) (si.getUseCompoundFile() ? SegmentInfo.YES : SegmentInfo.NO));//是否是复合索引 output.writeStringStringMap(si.getDiagnostics());//调试信息 output.writeStringSet(si.files());//段包含文件 success = true; } finally { if (!success) { IOUtils.closeWhileHandlingException(output); si.dir.deleteFile(fileName); } else { output.close(); } } } }
/** * Constructs a new SegmentReader with a new core. * @throws CorruptIndexException if the index is corrupt * @throws IOException if there is a low-level IO error */ // TODO: why is this public? public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException { this.si = si; // TODO if the segment uses CFS, we may open the CFS file twice: once for // reading the FieldInfos (if they are not gen'd) and second time by // SegmentCoreReaders. We can open the CFS here and pass to SCR, but then it // results in less readable code (resource not closed where it was opened). // Best if we could somehow read FieldInfos in SCR but not keep it there, but // constructors don't allow returning two things... fieldInfos = readFieldInfos(si); core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor); segDocValues = new SegmentDocValues(); boolean success = false; final Codec codec = si.info.getCodec(); try { if (si.hasDeletions()) { // NOTE: the bitvector is stored using the regular directory, not cfs liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE); } else { assert si.getDelCount() == 0; liveDocs = null; } numDocs = si.info.getDocCount() - si.getDelCount(); if (fieldInfos.hasDocValues()) { initDocValuesProducers(codec); } success = true; } finally { // With lock-less commits, it's entirely possible (and // fine) to hit a FileNotFound exception above. In // this case, we want to explicitly close any subset // of things that were opened so that we don't have to // wait for a GC to do so. if (!success) { doClose(); } } }
(from 追风的蓝宝)在Segmentinfo有个tostring()函数,当我们将solr的日志等级设置为debug时候,它会打印出.si的信息。比如它打印出"_a(3.1):c45/4",可以从中看出以下几个信息:
1. _a 是segment名字
2. (3.1)表示Lucene版本,如果出现?表示未知
3. c 表示复合文档格式,C表示非复合文档格式
4. 45 表示segment具有45个documents
5. 4 表示删除的documents个数

浙公网安备 33010602011771号