Stanford TokensRegex的使用

参考网站:https://nlp.stanford.edu/software/tokensregex.html

To match a regular expression over tokens

Programmatically, TokensRegex follows a similar paradigm as the Java regular expression library. Regular expressions are defined using a string which is then compiled into a TokenSequencePattern. For a given sequence of tokens a TokenSequenceMatcher is created that can match the pattern against that sequence. This allows the TokensSequencePattern to be compiled just once. The regular expression language over tokens is described under Pattern Language, below.
Example Usage:

List<CoreLabel> tokens = ...;
   TokenSequencePattern pattern = TokenSequencePattern.compile(...);
   TokenSequenceMatcher matcher = pattern.getMatcher(tokens);
   
   while (matcher.find()) {
     String matchedString = matcher.group();
     List<CoreMap> matchedTokens = matcher.groupNodes();
     ...
   }

To match multiple regular expressions over tokens

Often you may have not just one, but many regular expressions that you would like to match. TokensRegex provides a utility class, MultiPatternMatcher for matching against multiple regular expressions. It gives higher performance than matching many patterns in turn.
Example Usage:

   List<CoreLabel> tokens = ...;
   List<TokenSequencePattern> tokenSequencePatterns = ...;
   MultiPatternMatcher multiMatcher = TokenSequencePattern.getMultiPatternMatcher(tokenSequencePatterns);
   
   // Finds all non-overlapping sequences using specified list of patterns
   // When multiple patterns overlap, matches selected based on priority, length, etc.
   List<SequenceMatchResult<CoreMap>> multiMatcher.findNonOverlapping(tokens);

For more complicated use cases, TokensRegex provides a pipeline for matching against multiple regular expressions in stages. It also provides a language for defining rules and for how the expression should be matched. The CoreMapExpressionExtractor class reads the TokensRegex rules from a file and uses the TokensRegex pipeline to extract matched expressions. This is used by the TokensRegexAnnotator and for developing SUTime.

 List<CoreMap> sentences = ...;
 CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), file1, file2,...);
 for (CoreMap sentence:sentences) {
   List<MatchedExpression> matched = extractor.extractExpressions(sentence);
   ...
 }

 

posted @ 2017-09-25 15:23  wbinbin  阅读(780)  评论(0)    收藏  举报