lucene4.7索引源码研究之Segment元文件

从段的元文件segment_开始

所谓元文件就是记录索引中段信息的文件，打开索引文件目录可以发现以segment_开头的文件，这些都是元文件

lucene根据segment_后面的数字来确认哪一个文件可以被读取

当加载索引文件DirectoryReader.open时，代码如下：

针对不同的索引物理存储lucene提供了不同的加载模板，以StandardDirectoryReader.java为例

/** called from DirectoryReader.open(...) methods */
  static DirectoryReader open(final Directory directory, final IndexCommit commit,
                          final int termInfosIndexDivisor) throws IOException {
    return (DirectoryReader) new SegmentInfos.FindSegmentsFile(directory) {//构造SegmentsInfos，设置directory，使用回调，执行run
      @Override
      protected Object doBody(String segmentFileName) throws IOException {
        SegmentInfos sis = new SegmentInfos();
        sis.read(directory, segmentFileName);//找到元文件，加载所有段信息
        final SegmentReader[] readers = new SegmentReader[sis.size()];//sis.size()段文件个数
        for (int i = sis.size()-1; i >= 0; i--) {//循环构造segmentReader，封装段信息，提供入口
          IOException prior = null;
          boolean success = false;
          try {
            readers[i] = new SegmentReader(sis.info(i), termInfosIndexDivisor, IOContext.READ);//构建segmentReader对象，提供读取入口
            success = true;
          } catch(IOException ex) {
            prior = ex;
          } finally {
            if (!success) {
              IOUtils.closeWhileHandlingException(prior, readers);
            }
          }
        }
        return new StandardDirectoryReader(directory, readers, null, sis, termInfosIndexDivisor, false);
      }
    }.run(commit);//首先执行run
  }

run的执行，lucene寻找符合规则的segment_N

  public Object run(IndexCommit commit) throws IOException {
      if (commit != null) {
        if (directory != commit.getDirectory())
          throw new IOException("the specified commit does not match the specified Directory");
        return doBody(commit.getSegmentsFileName());
      }

      String segmentFileName = null;
      long lastGen = -1;
      long gen = 0;
      int genLookaheadCount = 0;
      IOException exc = null;
      int retryCount = 0;

      boolean useFirstMethod = true;

      // Loop until we succeed in calling doBody() without
      // hitting an IOException.  An IOException most likely
      // means a commit was in process and has finished, in
      // the time it took us to load the now-old infos files
      // (and segments files).  It's also possible it's a
      // true error (corrupt index).  To distinguish these,
      // on each retry we must see "forward progress" on
      // which generation we are trying to load.  If we
      // don't, then the original error is real and we throw
      // it.
      // 循环在成功调用doBody并且有抛出IOException异常，也就是找到符合规则的segment之前会一直进行
      // 抛出IOException异常的原因可能是意味着正在有一个commit进程完毕的时候，导致我们
　　　 // 同时获取到了新和旧的元数据。为了区分这些情况，我们必须在尝试加载信息的时候确认前后两次的版本，如果不这样
      // 做，就可能会抛出IOException

      // We have three methods for determining the current
      // generation.  We try the first two in parallel (when
      // useFirstMethod is true), and fall back to the third
      // when necessary.
      // 为了避免这种情况提供了3种方法，当userFirstMethod=true的时候，我们将尝试第一第二种方法
      // 在失败之后有必要的情况下调用第三种方法
      while(true) {//while true 直到return为止

        if (useFirstMethod) {//第一种方法，初始化true

          // List the directory and use the highest
          // segments_N file.  This method works well as long
          // as there is no stale caching on the directory
          // contents (NOTE: NFS clients often have such stale
          // caching):
          //列出所有的目录，使用N最大得segments，这个方法即使在没有缓存的情况下依然很高效

          String[] files = null;

          long genA = -1;

          files = directory.listAll();
          
          if (files != null) {
            genA = getLastCommitGeneration(files);//读取最大N作为genA
          }
          
          if (infoStream != null) {
            message("directory listing genA=" + genA);
          }

          // Also open segments.gen and read its
          // contents.  Then we take the larger of the two
          // gens.  This way, if either approach is hitting
          // a stale cache (NFS) we have a better chance of
          // getting the right generation.
          // 获取segments.gen中的版本号，然后取两个中最大的那个

          long genB = -1;
          IndexInput genInput = null;
          try {
            genInput = directory.openInput(IndexFileNames.SEGMENTS_GEN, IOContext.READONCE);//读取segments.gen文件
          } catch (IOException e) {
            if (infoStream != null) {
              message("segments.gen open: IOException " + e);
            }
          }
  
          if (genInput != null) {
            try {
              int version = genInput.readInt();//读取segments.gen版本号，初始值-2
              if (version == FORMAT_SEGMENTS_GEN_CURRENT) {//版本号与当前一致
                long gen0 = genInput.readLong();//读取两次数值
                long gen1 = genInput.readLong();
                if (infoStream != null) {
                  message("fallback check: " + gen0 + "; " + gen1);
                }
                if (gen0 == gen1) {//如果两个值一致，复制给genB
                  // The file is consistent.
                  genB = gen0;
                }
              } else {//否则抛异常，可能正在有进程commit
                throw new IndexFormatTooNewException(genInput, version, FORMAT_SEGMENTS_GEN_CURRENT, FORMAT_SEGMENTS_GEN_CURRENT);
              }
            } catch (IOException err2) {
              // rethrow any format exception
              if (err2 instanceof CorruptIndexException) throw err2;
            } finally {
              genInput.close();
            }
          }

          if (infoStream != null) {
            message(IndexFileNames.SEGMENTS_GEN + " check: genB=" + genB);
          }

          // Pick the larger of the two gen's:
          gen = Math.max(genA, genB);//取两个值中较大的

          if (gen == -1) {//-1，没有找到段信息元文件
            // Neither approach found a generation
            throw new IndexNotFoundException("no segments* file found in " + directory + ": files: " + Arrays.toString(files));
          }
        }

        if (useFirstMethod && lastGen == gen && retryCount >= 2) {//当retryCount=2时放弃第一种方法
          // Give up on first method -- this is 3rd cycle on
          // listing directory and checking gen file to
          // attempt to locate the segments file.
          useFirstMethod = false;
        }

        // Second method:  both directory cache and
        // file contents cache seem to be stale, just
        // advance the generation.
        if (!useFirstMethod) {
          if (genLookaheadCount < defaultGenLookaheadCount) {//defaultGenLookaheadCount = 10，gen++
            gen++;
            genLookaheadCount++;
            if (infoStream != null) {
              message("look ahead increment gen to " + gen);
            }
          } else {
            // All attempts have failed -- throw first exc:
            throw exc;
          }
        } else if (lastGen == gen) {
          // This means we're about to try the same
          // segments_N last tried.
          retryCount++;
        } else {
          // Segment file has advanced since our last loop
          // (we made "progress"), so reset retryCount:
          retryCount = 0;
        }

        lastGen = gen;
　　　　　//直接读取gen++ segment文件
        segmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                "",
                                                                gen);

        try {
          Object v = doBody(segmentFileName);
          if (infoStream != null) {
            message("success on " + segmentFileName);
          }
          return v;
        } catch (IOException err) {//失败

          // Save the original root cause:
          if (exc == null) {
            exc = err;
          }

          if (infoStream != null) {
            message("primary Exception on '" + segmentFileName + "': " + err + "'; will retry: retryCount=" + retryCount + "; gen = " + gen);
          }

          if (gen > 1 && useFirstMethod && retryCount == 1) {//第三种方式读取gen-1

            // This is our second time trying this same segments
            // file (because retryCount is 1), and, there is
            // possibly a segments_(N-1) (because gen > 1).
            // So, check if the segments_(N-1) exists and
            // try it if so:
            String prevSegmentFileName = IndexFileNames.fileNameFromGeneration(IndexFileNames.SEGMENTS,
                                                                               "",
                                                                               gen-1);

            final boolean prevExists;
            prevExists = directory.fileExists(prevSegmentFileName);

            if (prevExists) {
              if (infoStream != null) {
                message("fallback to prior segment file '" + prevSegmentFileName + "'");
              }
              try {
                Object v = doBody(prevSegmentFileName);//读取
                if (infoStream != null) {
                  message("success on fallback " + prevSegmentFileName);
                }
                return v;
              } catch (IOException err2) {
                if (infoStream != null) {
                  message("secondary Exception on '" + prevSegmentFileName + "': " + err2 + "'; will retry");
                }
              }
            }
          }
        }
      }
    }

解析Segment的结构（from追风的蓝宝）：

Header, Version, NameCounter, SegCount, <SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>^SegCount, CommitUserData, Footer

其中<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>表示一个段的信息，SegCount表示段的数量，所以

<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>^SegCount表示这样的SegCount个段连在一起。

　　head：

　　　　　　head是一个CodecHeader，包含了Magic,CodecName,Version三部分。

Magic是一个开始表示符，通常情况下为1071082519.

　　 CodecName是文件的标识符

Version索引文件版本信息，当用某个版本号的IndexReader读取另一个版本号生成的索引的时候，会因为此值不同而报错。

　　Version：

　　　　　　索引的版本号，记录了IndexWriter将修改提交到索引文件中的次数

　　　　　　其初始值大多数情况下从索引文件里面读出。

　　　　　　我们并不关心IndexWriter将修改提交到索引的具体次数，而更关心到底哪个是最新的。IndexReader中常比较自己的version和索引文件中的version是否相同来判断此IndexReader被打开后，还有没有被IndexWriter更新

NameCount
- 是下一个新段(Segment)的段名。
- 所有属于同一个段的索引文件都以段名作为文件名，一般为_0.xxx, _0.yyy, _1.xxx, _1.yyy ……
- 新生成的段的段名一般为原有最大段名加一。
SegCount
- 段(Segment)的个数。

SegCount个段的元数据信息：
- SegName：段名，所有属于同一个段的文件都有以段名作为文件名。
- SegCodec：编码segment的codec名字
- del文件的版本号
  - Lucene中，在optimize之前，删除的文档是保存在.del文件中的。
  - DelGen是每当IndexWriter向索引文件中提交删除操作的时候，加1，并生成新的.del文件
  - 如果该值设为-1表示没有删除的document
DeletionCount：本segment删除的documents个数
FieldInfosGen：segment中域文件的版本信息，如果该值为-1表示对域文件未有更新操作，如果大于0表示有更新操作
UpdatesFiles：存储本segment更新的文件列表
CommitUserData：
Footer：codec编码的结尾，包含了检验和以及检验算法ID

  /**
   * Read a particular segmentFileName.  Note that this may
   * throw an IOException if a commit is in process.
   *
   * @param directory -- directory containing the segments file
   * @param segmentFileName -- segment file to load
   * @throws CorruptIndexException if the index is corrupt
   * @throws IOException if there is a low-level IO error
   */
  public final void read(Directory directory, String segmentFileName) throws IOException {
    boolean success = false;

    // Clear any previous segments:
    this.clear();

    generation = generationFromSegmentsFileName(segmentFileName);

    lastGeneration = generation;//将最新的版本号更新成N

    ChecksumIndexInput input = new ChecksumIndexInput(directory.openInput(segmentFileName, IOContext.READ));
    try {
      final int format = input.readInt();//读取lucene版本号
      if (format == CodecUtil.CODEC_MAGIC) {
        // 4.0+
        int actualFormat = CodecUtil.checkHeaderNoMagic(input, "segments", VERSION_40, VERSION_46);//读取header
        version = input.readLong();//indexwrite提交的修改次数
        counter = input.readInt();//下一个新segment_N的N
        int numSegments = input.readInt();//索引段个数
        if (numSegments < 0) {
          throw new CorruptIndexException("invalid segment count: " + numSegments + " (resource: " + input + ")");
        }
        //读取各个段信息
        for(int seg=0;seg<numSegments;seg++) {
          String segName = input.readString();//段名称
          Codec codec = Codec.forName(input.readString());
          //System.out.println("SIS.read seg=" + seg + " codec=" + codec);
          SegmentInfo info = codec.segmentInfoFormat().getSegmentInfoReader().read(directory, segName, IOContext.READ);//开始读取<SegName, SegCodec, DelGen, DeletionCount, FieldInfosGen, UpdatesFiles>信息
          info.setCodec(codec);
          long delGen = input.readLong();
          int delCount = input.readInt();
          if (delCount < 0 || delCount > info.getDocCount()) {
            throw new CorruptIndexException("invalid deletion count: " + delCount + " (resource: " + input + ")");
          }
          long fieldInfosGen = -1;
          if (actualFormat >= VERSION_46) {
            fieldInfosGen = input.readLong();
          }
          SegmentCommitInfo siPerCommit = new SegmentCommitInfo(info, delCount, delGen, fieldInfosGen);
          if (actualFormat >= VERSION_46) {
            int numGensUpdatesFiles = input.readInt();//UpdatesFiles的处理，存储本segment更新的文件列表
            final Map<Long,Set<String>> genUpdatesFiles;
            if (numGensUpdatesFiles == 0) {
              genUpdatesFiles = Collections.emptyMap();
            } else {//如果有，则向SegmentCommitInfo写入
              genUpdatesFiles = new HashMap<Long,Set<String>>(numGensUpdatesFiles);
              for (int i = 0; i < numGensUpdatesFiles; i++) {
                genUpdatesFiles.put(input.readLong(), input.readStringSet());
              }
            }
            siPerCommit.setGenUpdatesFiles(genUpdatesFiles);
          }
          add(siPerCommit);
        }
        userData = input.readStringStringMap();
      } else {
        Lucene3xSegmentInfoReader.readLegacyInfos(this, directory, input, format);
        Codec codec = Codec.forName("Lucene3x");
        for (SegmentCommitInfo info : this) {
          info.info.setCodec(codec);
        }
      }

      final long checksumNow = input.getChecksum();
      final long checksumThen = input.readLong();
      if (checksumNow != checksumThen) {
        throw new CorruptIndexException("checksum mismatch in segments file (resource: " + input + ")");
      }

      success = true;
    } finally {
      if (!success) {
        // Clear any segment infos we had loaded so we
        // have a clean slate on retry:
        this.clear();
        IOUtils.closeWhileHandlingException(input);
      } else {
        input.close();
      }
    }
  }

加载segment_N中的各个segment段信息 .si文件

　　（from追风的蓝宝）.si文件存储了段的元数据，主要涉及SegmentInfoFormat.java和Segmentinfo.java这两个文件。由于本文介绍的Solr4.8.0，所以对应的是SegmentInfoFormat的子类Lucene46SegmentInfoFormat。

首先来看下.si文件的格式

头部(header)

版本(SegVersion)

doc个数(SegSize)

是否符合文档格式(IsCompoundFile)

Diagnostics

文件

Footer

头部：同Segment_N的头部结构相同，包括包含了Magic,CodecName,Version三部分
版本：生成segment的编码版本
大小: segment索引的documents的个数
IsCompoundFile：是否以复合文档格式存储，如果设置1则为复合文档格式
Diagnostics：包含一些信息可以用于debug，比如Lucene版本，OS，java version，以及生成该segment生成的方式(merge，add，addindexs)等
文件：该段包含了哪些文件
Footer: codec编码的结尾，包含了检验和以及检验算法ID

read

public class Lucene46SegmentInfoReader extends SegmentInfoReader {

  /** Sole constructor. */
  public Lucene46SegmentInfoReader() {
  }

  @Override
  public SegmentInfo read(Directory dir, String segment, IOContext context) throws IOException {
    final String fileName = IndexFileNames.segmentFileName(segment, "", Lucene46SegmentInfoFormat.SI_EXTENSION);//获取si文件名
    final IndexInput input = dir.openInput(fileName, context);
    boolean success = false;
    try {
      CodecUtil.checkHeader(input, Lucene46SegmentInfoFormat.CODEC_NAME,
                                   Lucene46SegmentInfoFormat.VERSION_START,
                                   Lucene46SegmentInfoFormat.VERSION_CURRENT);//检查header
      final String version = input.readString();
      final int docCount = input.readInt();//文档数量
      if (docCount < 0) {
        throw new CorruptIndexException("invalid docCount: " + docCount + " (resource=" + input + ")");
      }
      final boolean isCompoundFile = input.readByte() == SegmentInfo.YES;//是否符合文档格式
      final Map<String,String> diagnostics = input.readStringStringMap();//for debug info
      final Set<String> files = input.readStringSet();//该段下包含的文件
      
      if (input.getFilePointer() != input.length()) {
        throw new CorruptIndexException("did not read all bytes from file \"" + fileName + "\": read " + input.getFilePointer() + " vs size " + input.length() + " (resource: " + input + ")");
      }

      final SegmentInfo si = new SegmentInfo(dir, version, segment, docCount, isCompoundFile, null, diagnostics);//写入SegmentInfo中
      si.setFiles(files);

      success = true;

      return si;

    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(input);
      } else {
        input.close();
      }
    }
  }
}

write

/**
 * Lucene 4.0 implementation of {@link SegmentInfoWriter}.
 * 
 * @see Lucene46SegmentInfoFormat
 * @lucene.experimental
 */
public class Lucene46SegmentInfoWriter extends SegmentInfoWriter {

  /** Sole constructor. */
  public Lucene46SegmentInfoWriter() {
  }

  /** Save a single segment's info. */
  @Override
  public void write(Directory dir, SegmentInfo si, FieldInfos fis, IOContext ioContext) throws IOException {
    final String fileName = IndexFileNames.segmentFileName(si.name, "", Lucene46SegmentInfoFormat.SI_EXTENSION);
    si.addFile(fileName);

    final IndexOutput output = dir.createOutput(fileName, ioContext);

    boolean success = false;
    try {
      CodecUtil.writeHeader(output, Lucene46SegmentInfoFormat.CODEC_NAME, Lucene46SegmentInfoFormat.VERSION_CURRENT);//写入头文件
      // Write the Lucene version that created this segment, since 3.1
      output.writeString(si.getVersion());//写入版本
      output.writeInt(si.getDocCount());//doc数量

      output.writeByte((byte) (si.getUseCompoundFile() ? SegmentInfo.YES : SegmentInfo.NO));//是否是复合索引
      output.writeStringStringMap(si.getDiagnostics());//调试信息
      output.writeStringSet(si.files());//段包含文件

      success = true;
    } finally {
      if (!success) {
        IOUtils.closeWhileHandlingException(output);
        si.dir.deleteFile(fileName);
      } else {
        output.close();
      }
    }
  }
}

/**
   * Constructs a new SegmentReader with a new core.
   * @throws CorruptIndexException if the index is corrupt
   * @throws IOException if there is a low-level IO error
   */
  // TODO: why is this public?
  public SegmentReader(SegmentCommitInfo si, int termInfosIndexDivisor, IOContext context) throws IOException {
    this.si = si;
    // TODO if the segment uses CFS, we may open the CFS file twice: once for
    // reading the FieldInfos (if they are not gen'd) and second time by
    // SegmentCoreReaders. We can open the CFS here and pass to SCR, but then it
    // results in less readable code (resource not closed where it was opened).
    // Best if we could somehow read FieldInfos in SCR but not keep it there, but
    // constructors don't allow returning two things...
    fieldInfos = readFieldInfos(si);
    core = new SegmentCoreReaders(this, si.info.dir, si, context, termInfosIndexDivisor);
    segDocValues = new SegmentDocValues();
    
    boolean success = false;
    final Codec codec = si.info.getCodec();
    try {
      if (si.hasDeletions()) {
        // NOTE: the bitvector is stored using the regular directory, not cfs
        liveDocs = codec.liveDocsFormat().readLiveDocs(directory(), si, IOContext.READONCE);
      } else {
        assert si.getDelCount() == 0;
        liveDocs = null;
      }
      numDocs = si.info.getDocCount() - si.getDelCount();
      
      if (fieldInfos.hasDocValues()) {
        initDocValuesProducers(codec);
      }

      success = true;
    } finally {
      // With lock-less commits, it's entirely possible (and
      // fine) to hit a FileNotFound exception above.  In
      // this case, we want to explicitly close any subset
      // of things that were opened so that we don't have to
      // wait for a GC to do so.
      if (!success) {
        doClose();
      }
    }
  }

（from 追风的蓝宝）在Segmentinfo有个tostring()函数，当我们将solr的日志等级设置为debug时候，它会打印出.si的信息。比如它打印出"_a(3.1):c45/4"，可以从中看出以下几个信息：

1. _a 是segment名字

2. (3.1)表示Lucene版本，如果出现?表示未知

3. c 表示复合文档格式，C表示非复合文档格式

4. 45 表示segment具有45个documents

5. 4 表示删除的documents个数

posted @ 2015-04-14 16:18 mini强阅读(767) 评论(0) 收藏举报

刷新页面返回顶部

mini强

孤独使人强大

lucene4.7索引源码研究之Segment元文件

公告