lucene4.7索引源码研究之Segment元文件

从段的元文件segment_开始

所谓元文件就是记录索引中段信息的文件,打开索引文件目录可以发现以segment_开头的文件,这些都是元文件

lucene根据segment_后面的数字来确认哪一个文件可以被读取

当加载索引文件DirectoryReader.open时,代码如下:

针对不同的索引物理存储lucene提供了不同的加载模板,以StandardDirectoryReader.java为例

/** called from DirectoryReader.open(...) methods */
  static DirectoryReader open(final Directory directory, final IndexCommit commit,
                          final int termInfosIndexDivisor) throws IOException {
    return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {//构造SegmentsInfos,设置directory,使用回调,执行run
      @Override
      protected Object doBody(String segmentFileName) throws IOException {
        SegmentInfos sis = new SegmentInfos();
        sis.read(directory, segmentFileName);//找到元文件,加载所有段信息
        final SegmentReader[] readers = new SegmentReader[sis.size()];//sis.size()段文件个数
        for (int i = sis.size()-1; i >= 0; i--) {//循环构造segmentReader,封装段信息,提供入口
          IOException prior = null;
          boolean success = false;
          try {
            readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);//构建segmentReader对象,提供读取入口
            success = true;
          } catch(IOException ex) {
            prior = ex;
          } finally {
            if (!success) {
              IOUtils.closeWhileHandlingException(prior, readers);
            }
          }
        }
        return new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
      }
    }.run(commit);//首先执行run
  }

run的执行,lucene寻找符合规则的segment_N

  public Object run(IndexCommit commit) throws IOException {
      if (commit != null) {
        if (directory != commit.getDirectory())
          throw new IOException("the specified commit does not match the specified Directory");
        return doBody(commit.getSegmentsFileName());
      }

      String segmentFileName = null;
      long lastGen = -1;
      long gen = 0;
      int genLookaheadCount = 0;
      IOException exc = null;
      int retryCount = 0;

      boolean useFirstMethod = true;

      // Loop until we succeed in calling doBody() without
      // hitting an IOException.  An IOException most likely
      // means a commit was in process and has finished, in
      // the time it took us to load the now-old infos files
      // (and segments files).  It's also possible it's a
      // true error (corrupt index).  To distinguish these,
      // on each retry we must see "forward progress" on
      // which generation we are trying to load.  If we
      // don't, then the original error is real and we throw
      // it.
      // 循环在成功调用doBody并且有抛出IOException异常,也就是找到符合规则的segment之前会一直进行
      // 抛出IOException异常的原因可能是意味着正在有一个commit进程完毕的时候,导致我们
    // 同时获取到了新和旧的元数据。为了区分这些情况,我们必须在尝试加载信息的时候确认前后两次的版本,如果不这样
      // 做,就可能会抛出IOException

      // We have three methods for determining the current
      // generation.  We try the first two in parallel (when
      // useFirstMethod is true), and fall back to the third
      // when necessary.
      // 为了避免这种情况提供了3种方法,当userFirstMethod=true的时候,我们将尝试第一第二种方法
      // 在失败之后有必要的情况下调用第三种方法
      while(true) {//while true 直到return为止

        if (useFirstMethod) {//第一种方法,初始化true

          // List the directory and use the highest
          // segments_N file.  This method works well as long
          // as there is no stale caching on the directory
          // contents (NOTE: NFS clients often have such stale
          // caching):
          //列出所有的目录,使用N最大得segments,这个方法即使在没有缓存的情况下依然很高效

          String[] files = null;

          long genA = -1;

          files = directory.listAll();
          
          if (files != null) {
            genA = getLastCommitGeneration(files);//读取最大N作为genA
          }
          
          if (infoStream != null) {
            message("directory listing genA=" + genA);
          }

          // Also open segments.gen and read its
          // contents.  Then we take the larger of the two
          // gens.  This way, if either approach is hitting
          // a stale cache (NFS) we have a better chance of
          // getting the right generation.
// 获取segments.gen中的版本号,然后取两个中最大的那个
long genB = -1; IndexInput genInput = null; try { genInput = directory.openInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);//读取segments.gen文件 } catch (IOException e) { if (infoStream != null) { message("segments.gen open: IOException " + e); } } if (genInput != null) { try { int version = genInput.readInt();//读取segments.gen版本号,初始值-2 if (version == FORMAT_SEGMENTS_GEN_CURRENT) {//版本号与当前一致 long gen0 = genInput.readLong();//读取两次数值 long gen1 = genInput.readLong(); if (infoStream != null) { message("fallback check: " + gen0 + "; " + gen1); } if (gen0 == gen1) {//如果两个值一致,复制给genB // The file is consistent. genB = gen0; } } else {//否则抛异常,可能正在有进程commit throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_CURRENT, FORMAT_SEGMENTS_GEN_CURRENT); } } catch (IOException err2) { // rethrow any format exception if (err2 instanceof CorruptIndexException) throw err2; } finally { genInput.close(); } } if (infoStream != null) { message(IndexFileNames.SEGMENTS_GEN + " check: genB=" + genB); } // Pick the larger of the two gen's: gen = Math.max(genA, genB);//取两个值中较大的 if (gen == -1) {//-1,没有找到段信息元文件 // Neither approach found a generation throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files)); } } if (useFirstMethod && lastGen == gen && retryCount >= 2) {//当retryCount=2时放弃第一种方法 // Give up on first method -- this is 3rd cycle on // listing directory and checking gen file to // attempt to locate the segments file. useFirstMethod = false; } // Second method: both directory cache and // file contents cache seem to be stale, just // advance the generation. if (!useFirstMethod) { if (genLookaheadCount < defaultGenLookaheadCount) {//defaultGenLookaheadCount = 10,gen++ gen++; genLookaheadCount++; if (infoStream != null) { message("look ahead increment gen to " + gen); } } else { // All attempts have failed -- throw first exc: throw exc; } } else if (lastGen == gen) { // This means we're about to try the same // segments_N last tried. retryCount++; } else { // Segment file has advanced since our last loop // (we made "progress"), so reset retryCount: retryCount = 0; } lastGen = gen;      //直接读取gen++ segment文件 segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen); try { Object v = doBody(segmentFileName); if (infoStream != null) { message("success on " + segmentFileName); } return v; } catch (IOException err) {//失败 // Save the original root cause: if (exc == null) { exc = err; } if (infoStream != null) { message("primary Exception on '" + segmentFileName + "': " + err + "'; will retry: retryCount=" + retryCount + "; gen = " + gen); } if (gen > 1 && useFirstMethod && retryCount == 1) {//第三种方式读取gen-1 // This is our second time trying this same segments // file (because retryCount is 1), and, there is // possibly a segments_(N-1) (because gen > 1). // So, check if the segments_(N-1) exists and // try it if so: String prevSegmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS, "", gen-1); final boolean prevExists; prevExists = directory.fileExists(prevSegmentFileName); if (prevExists) { if (infoStream != null) { message("fallback to prior segment file '" + prevSegmentFileName + "'"); } try { Object v = doBody(prevSegmentFileName);//读取 if (infoStream != null) { message("success on fallback " + prevSegmentFileName); } return v; } catch (IOException err2) { if (infoStream != null) { message("secondary Exception on '" + prevSegmentFileName + "': " + err2 + "'; will retry"); } } } } } } }

解析Segment的结构(from追风的蓝宝):

Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount, CommitUserData, Footer

其中<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>表示一个段的信息,SegCount表示段的数量,所以

<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>SegCount  表示这样的SegCount个段连在一起。

  head:

      head是一个CodecHeader,包含了Magic,CodecName,Version三部分。

                  Magic是一个开始表示符,通常情况下为1071082519.

              CodecName是文件的标识符

                  Version索引文件版本信息,当用某个版本号的IndexReader读取另一个版本号生成的索引的时候,会因为此值不同而报错。

  Version:

      索引的版本号,记录了IndexWriter将修改提交到索引文件中的次数

      其初始值大多数情况下从索引文件里面读出。

      我们并不关心IndexWriter将修改提交到索引的具体次数,而更关心到底哪个是最新的。IndexReader中常比较自己的version和索引文件中的version是否相同来判断此IndexReader被打开后,还有没有被IndexWriter更新

  •  NameCount
    • 是下一个新段(Segment)的段名。
    • 所有属于同一个段的索引文件都以段名作为文件名,一般为_0.xxx, _0.yyy,  _1.xxx, _1.yyy ……
    • 新生成的段的段名一般为原有最大段名加一。
  • SegCount
    • 段(Segment)的个数。
  • SegCount个段的元数据信息:
    • SegName:段名,所有属于同一个段的文件都有以段名作为文件名。
    • SegCodec:编码segment的codec名字
    • del文件的版本号
      • Lucene中,在optimize之前,删除的文档是保存在.del文件中的。
      • DelGen是每当IndexWriter向索引文件中提交删除操作的时候,加1,并生成新的.del文件
      • 如果该值设为-1表示没有删除的document
  • DeletionCount:本segment删除的documents个数
  • FieldInfosGen:segment中域文件的版本信息,如果该值为-1表示对域文件未有更新操作,如果大于0表示有更新操作
  • UpdatesFiles:存储本segment更新的文件列表
  • CommitUserData:
  • Footer:codec编码的结尾,包含了检验和以及检验算法ID
  /**
   * Read a particular segmentFileName.  Note that this may
   * throw an IOException if a commit is in process.
   *
   * @param directory -- directory containing the segments file
   * @param segmentFileName -- segment file to load
   * @throws CorruptIndexException if the index is corrupt
   * @throws IOException if there is a low-level IO error
   */
  public final void read(Directory directory, String segmentFileName) throws IOException {
    boolean success = false;

    // Clear any previous segments:
    this.clear();

    generation = generationFromSegmentsFileName(segmentFileName);

    lastGeneration = generation;//将最新的版本号更新成N

    ChecksumIndexInput input = new ChecksumIndexInput(directory.openInput(segmentFileName, IOContext.READ));
    try {
      final int format = input.readInt();//读取lucene版本号
      if (format == CodecUtil.CODEC_MAGIC) {
        // 4.0+
        int actualFormat = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_40, VERSION_46);//读取header
        version = input.readLong();//indexwrite提交的修改次数
        counter = input.readInt();//下一个新segment_N的N
        int numSegments = input.readInt();//索引段个数
        if (numSegments < 0) {
          throw new CorruptIndexException("invalid segment count: " + numSegments + " (resource: " + input + ")");
        }
//读取各个段信息
for(int seg=0;seg<numSegments;seg++) { String segName = input.readString();//段名称 Codec codec = Codec.forName(input.readString()); //System.out.println("SIS.read seg=" + seg + " codec=" + codec); SegmentInfo info = codec.segmentInfoFormat().getSegmentInfoReader().read(directory, segName, IOContext.READ);//开始读取<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>信息 info.setCodec(codec); long delGen = input.readLong(); int delCount = input.readInt(); if (delCount < 0 || delCount > info.getDocCount()) { throw new CorruptIndexException("invalid deletion count: " + delCount + " (resource: " + input + ")"); } long fieldInfosGen = -1; if (actualFormat >= VERSION_46) { fieldInfosGen = input.readLong(); } SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen); if (actualFormat >= VERSION_46) { int numGensUpdatesFiles = input.readInt();//UpdatesFiles的处理,存储本segment更新的文件列表 final Map<Long,Set<String>> genUpdatesFiles; if (numGensUpdatesFiles == 0) { genUpdatesFiles = Collections.emptyMap(); } else {//如果有,则向SegmentCommitInfo写入 genUpdatesFiles = new HashMap<Long,Set<String>>(numGensUpdatesFiles); for (int i = 0; i < numGensUpdatesFiles; i++) { genUpdatesFiles.put(input.readLong(), input.readStringSet()); } } siPerCommit.setGenUpdatesFiles(genUpdatesFiles); } add(siPerCommit); } userData = input.readStringStringMap(); } else { Lucene3xSegmentInfoReader.readLegacyInfos(this, directory, input, format); Codec codec = Codec.forName("Lucene3x"); for (SegmentCommitInfo info : this) { info.info.setCodec(codec); } } final long checksumNow = input.getChecksum(); final long checksumThen = input.readLong(); if (checksumNow != checksumThen) { throw new CorruptIndexException("checksum mismatch in segments file (resource: " + input + ")"); } success = true; } finally { if (!success) { // Clear any segment infos we had loaded so we // have a clean slate on retry: this.clear(); IOUtils.closeWhileHandlingException(input); } else { input.close(); } } } 

加载segment_N中的各个segment段信息  .si文件

  (from追风的蓝宝).si文件存储了段的元数据,主要涉及SegmentInfoFormat.java和Segmentinfo.java这两个文件。由于本文介绍的Solr4.8.0,所以对应的是SegmentInfoFormat的子类Lucene46SegmentInfoFormat。

     首先来看下.si文件的格式 

头部(header)

版本(SegVersion)

doc个数(SegSize)

是否符合文档格式(IsCompoundFile)

Diagnostics

文件

Footer

 

  • 头部:同Segment_N的头部结构相同,包括包含了Magic,CodecName,Version三部分
  • 版本:生成segment的编码版本
  • 大小:  segment索引的documents的个数
  • IsCompoundFile:是否以复合文档格式存储,如果设置1则为复合文档格式
  • Diagnostics:包含一些信息可以用于debug,比如Lucene版本,OS,java version,以及生成该segment生成的方式(merge,add,addindexs)等
  • 文件:该段包含了哪些文件
  • Footer: codec编码的结尾,包含了检验和以及检验算法ID

read

public class Lucene46SegmentInfoReader extends SegmentInfoReader {

  /** Sole constructor. */
  public Lucene46SegmentInfoReader() {
  }

  @Override
  public SegmentInfo read(Directory dir, String segment, IOContext context) throws IOException {
    final String fileName = IndexFileNames.segmentFileName(segment, "", Lucene46SegmentInfoFormat.SI_EXTENSION);//获取si文件名
    final IndexInput input = dir.openInput(fileName, context);
    boolean success = false;
    try {
      CodecUtil.checkHeader(input, Lucene46SegmentInfoFormat.CODEC_NAME,
                                   Lucene46SegmentInfoFormat.VERSION_START,
                                   Lucene46SegmentInfoFormat.VERSION_CURRENT);//检查header
      final String version = input.readString();
      final int docCount = input.readInt();//文档数量
      if (docCount < 0) {
        throw new CorruptIndexException("invalid docCount: " + docCount + " (resource=" + input + ")");
      }
      final boolean isCompoundFile = input.readByte() == SegmentInfo.YES;//是否符合文档格式
      final Map<String,String> diagnostics = input.readStringStringMap();//for debug info
      final Set<String> files = input.readStringSet();//该段下包含的文件
      
      if (input.getFilePointer() != input.length()) {
        throw new CorruptIndexException("did not read all bytes from file \"" + fileName + "\": read " + input.getFilePointer() + " vs size " + input.length() + " (resource: " + input + ")");
      }

      final SegmentInfo si = new SegmentInfo(dir, version, segment, docCount, isCompoundFile, null, diagnostics);//写入SegmentInfo中
      si.setFiles(files);

      success = true;

      return si;

    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(input);
      } else {
        input.close();
      }
    }
  }
}

 

write

/**
 * Lucene 4.0 implementation of {@link SegmentInfoWriter}.
 * 
 * @see Lucene46SegmentInfoFormat
 * @lucene.experimental
 */
public class Lucene46SegmentInfoWriter extends SegmentInfoWriter {

  /** Sole constructor. */
  public Lucene46SegmentInfoWriter() {
  }

  /** Save a single segment's info. */
  @Override
  public void write(Directory dir, SegmentInfo si, FieldInfos fis, IOContext ioContext) throws IOException {
    final String fileName = IndexFileNames.segmentFileName(si.name, "", Lucene46SegmentInfoFormat.SI_EXTENSION);
    si.addFile(fileName);

    final IndexOutput output = dir.createOutput(fileName, ioContext);

    boolean success = false;
    try {
      CodecUtil.writeHeader(output, Lucene46SegmentInfoFormat.CODEC_NAME, Lucene46SegmentInfoFormat.VERSION_CURRENT);//写入头文件
      // Write the Lucene version that created this segment, since 3.1
      output.writeString(si.getVersion());//写入版本
      output.writeInt(si.getDocCount());//doc数量

      output.writeByte((byte) (si.getUseCompoundFile() ? SegmentInfo.YES : SegmentInfo.NO));//是否是复合索引
      output.writeStringStringMap(si.getDiagnostics());//调试信息
      output.writeStringSet(si.files());//段包含文件

      success = true;
    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(output);
        si.dir.deleteFile(fileName);
      } else {
        output.close();
      }
    }
  }
}

 

 

 

/**
   * Constructs a new SegmentReader with a new core.
   * @throws CorruptIndexException if the index is corrupt
   * @throws IOException if there is a low-level IO error
   */
  // TODO: why is this public?
  public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException {
    this.si = si;
    // TODO if the segment uses CFS, we may open the CFS file twice: once for
    // reading the FieldInfos (if they are not gen'd) and second time by
    // SegmentCoreReaders. We can open the CFS here and pass to SCR, but then it
    // results in less readable code (resource not closed where it was opened).
    // Best if we could somehow read FieldInfos in SCR but not keep it there, but
    // constructors don't allow returning two things...
    fieldInfos = readFieldInfos(si);
    core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor);
    segDocValues = new SegmentDocValues();
    
    boolean success = false;
    final Codec codec = si.info.getCodec();
    try {
      if (si.hasDeletions()) {
        // NOTE: the bitvector is stored using the regular directory, not cfs
        liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
      } else {
        assert si.getDelCount() == 0;
        liveDocs = null;
      }
      numDocs = si.info.getDocCount() - si.getDelCount();
      
      if (fieldInfos.hasDocValues()) {
        initDocValuesProducers(codec);
      }

      success = true;
    } finally {
      // With lock-less commits, it's entirely possible (and
      // fine) to hit a FileNotFound exception above.  In
      // this case, we want to explicitly close any subset
      // of things that were opened so that we don't have to
      // wait for a GC to do so.
      if (!success) {
        doClose();
      }
    }
  }

 

(from 追风的蓝宝)在Segmentinfo有个tostring()函数,当我们将solr的日志等级设置为debug时候,它会打印出.si的信息。比如它打印出"_a(3.1):c45/4",可以从中看出以下几个信息:

1. _a 是segment名字

2. (3.1)表示Lucene版本,如果出现?表示未知

3. c 表示复合文档格式,C表示非复合文档格式

4. 45 表示segment具有45个documents

5. 4 表示删除的documents个数

posted @ 2015-04-14 16:18  mini强  阅读(767)  评论(0)    收藏  举报