Lucene 分词器Analyzer 示例

编程教程 > Java (93) 2025-05-14 09:21:11

StandardAnalyzer

自带的标准分词器

源码示例

    public static void main(String[] args) throws IOException {
        Analyzer analyzer = new StandardAnalyzer();

        TokenStream ts = analyzer.tokenStream("name","Hello China");
//        TokenStream ts = analyzer.tokenStream("name","你好中国");
        ts.reset();
        while (ts.incrementToken()){
            System.out.println(ts.reflectAsString(false));
        }
    }

英文执行结果

Lucene 分词器Analyzer 示例_图示-4a7a555b0cea4405beb1f1826d3c2946.png

中文执行结果：

Lucene 分词器Analyzer 示例_图示-805561801a4e4b22b00ea8fc780152d5.png

总结

StandardAnalyzer 标准分词器对英文分词比较合理对中文分词没啥搞头。

SimpleAnalyzer

自带的简单分词器

源码示例

    public static void main(String[] args) throws IOException {
        try (Analyzer analyzer = new SimpleAnalyzer();){
        TokenStream ts = analyzer.tokenStream("name","Hello China");
//            TokenStream ts = analyzer.tokenStream("name","你好中国");
            ts.reset();
            while (ts.incrementToken()){
                System.out.println(ts.reflectAsString(false));
            }
        }
    }

英文执行结果

Lucene 分词器Analyzer 示例_图示-bb3cbf6867a446baa1389664f6cb4838.png

中文执行结果

Lucene 分词器Analyzer 示例_图示-c4d55e3b9f6b4ef1a7c18f3b07e389c4.png

SmartChineseAnalyzer

中文分词器

需要单独引入依赖：

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>8.11.4</version>
        </dependency>

9.x版本以上

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <!--9.x renamed analyzers to analysis-->
            <artifactId> lucene-analysis-smartcn</artifactId>
            <version>9.12.1</version>
        </dependency>

源码示例

    public static void main(String[] args) throws IOException {
        try (Analyzer analyzer = new SmartChineseAnalyzer();){
//        TokenStream ts = analyzer.tokenStream("name","Hello China");
            TokenStream ts = analyzer.tokenStream("name","你好中国");
            ts.reset();
            while (ts.incrementToken()){
                System.out.println(ts.reflectAsString(false));
            }
        }
    }

英文执行结果：

Lucene 分词器Analyzer 示例_图示-a0eade66cd154a37a9acfb315a0be850.png

中文执行结果：

Lucene 分词器Analyzer 示例_图示-588456732d734c178699facb51587a7d.png

扩展自定义停止分词字典

    public static void main(String[] args) throws IOException {
        String [] custom_stop_words = {"的","在","呢","是","和","后","小",",","，"};
        CharArraySet stopWordsSet = new CharArraySet(Arrays.asList(custom_stop_words), true);
        try (Analyzer analyzer = new SmartChineseAnalyzer();){
//        try (Analyzer analyzer = new SmartChineseAnalyzer(stopWordsSet);){
//        TokenStream ts = analyzer.tokenStream("name","Hello China");
            TokenStream ts = analyzer.tokenStream("name",
                    "lucene分析器使用分词器和过滤器构成一个“管道”，文本在流经这个管道后成为可以进入索引的最小单位，它主要作用是对切出来的词进行进一步的处理（如去掉敏感词、英文大小写转换、单复数处理）等。");
            ts.reset();
            int i=0;
            while (ts.incrementToken()){
                System.out.println(i+++" "+ts.reflectAsString(false));
            }
        }
    }

默认未启用分词器执行分词数量：

Lucene 分词器Analyzer 示例_图示-068a24e04d424704a93dd47e98cb56c4.png

启用自定义分词器后执行分词数量：

Lucene 分词器Analyzer 示例_图示-e35662934bab465a8b8b9040b70a5e34.png

Lucene 分词器Analyzer 示例_图示-c2f6c8b6322e4cb4828f91981e160be5.png

SmartChineseAnalyzer有三个构造函数，无参构造，Boolean构造，无参构造默认Boolean构造值为true，也就是使用内置停止分词字典

通过源码可以看到默认读取的是classpath路径下面的stopwords.txt文件

  /**
   * Atomically loads the DEFAULT_STOP_SET in a lazy fashion once the outer class accesses the
   * static final set the first time.;
   */
  private static class DefaultSetHolder {
    static final CharArraySet DEFAULT_STOP_SET;

    static {
      try {
        DEFAULT_STOP_SET = loadDefaultStopWordSet();
      } catch (IOException ex) {
        // default set should always be present as it is part of the
        // distribution (JAR)
        throw new UncheckedIOException("Unable to load default stopword set", ex);
      }
    }

    static CharArraySet loadDefaultStopWordSet() throws IOException {
      // make sure it is unmodifiable as we expose it in the outer class
      return CharArraySet.unmodifiableSet(
          WordlistLoader.getWordSet(
              IOUtils.requireResourceNonNull(
                  SmartChineseAnalyzer.class.getResourceAsStream(DEFAULT_STOPWORD_FILE),
                  DEFAULT_STOPWORD_FILE),
              STOPWORD_FILE_COMMENT));
    }
  }

Lucene 分词器Analyzer 示例_图示-5b371c6bfa2545f0aba3484dd7a09654.png

Lucene 分词器Analyzer 示例_图示-85d3819b475945e4842c79bd07ad6165.png

IKAnalyzer

支持中文友好的分词器

需要单独引入依赖

        <dependency>
            <groupId>com.github.magese</groupId>
            <artifactId>ik-analyzer</artifactId>
            <version>8.5.0</version>
        </dependency>

源码示例

    public static void main(String[] args) throws IOException {
        try (Analyzer analyzer = new IKAnalyzer();){
        TokenStream ts = analyzer.tokenStream("name","Hello China");
//            TokenStream ts = analyzer.tokenStream("name","你好中国");
            ts.reset();
            while (ts.incrementToken()){
                System.out.println(ts.reflectAsString(false));
            }
        }
    }

英文执行结果

Load extended dictionary:ext.dic
Load stopwords dictionary:stopword.dic
term=hello,bytes=[68 65 6c 6c 6f],startOffset=0,endOffset=5,positionIncrement=1,positionLength=1,type=ENGLISH,termFrequency=1
term=china,bytes=[63 68 69 6e 61],startOffset=6,endOffset=11,positionIncrement=1,positionLength=1,type=ENGLISH,termFrequency=1

中文执行结果

Load extended dictionary:ext.dic
Load stopwords dictionary:stopword.dic
term=你好,bytes=[e4 bd a0 e5 a5 bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=你,bytes=[e4 bd a0],startOffset=0,endOffset=1,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=好中国,bytes=[e5 a5 bd e4 b8 ad e5 9b bd],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=好中,bytes=[e5 a5 bd e4 b8 ad],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=中国,bytes=[e4 b8 ad e5 9b bd],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1

提示：IKAnalyzer analyzer = new IKAnalyzer(true) 有个智能useSmart模式，默认是false，开启后有时候聪明有时候傻傻分不清。

IKAnalyzer 扩展

自定义扩展分词和停止分词库

从上面，我们看到，默认情况将你好中国分成了你好，你，好中国，好中，中国

如果我们不想要分“你”,“好中”两个词则可以进行自定义停止分词字典配置实现。

添加IKAnalyzer扩展配置

以maven项目为例，在resources跟目录创建文件IKAnalyzer.cfg.xml，并设置下面内容

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">ext.dic</entry>
    <!--用户可以在这里配置自己的扩展停止词字典 -->
    <entry key="ext_stopwords">stopword.dic;stopword2.dic</entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

创建停止分词字典文件stopword.dic ，stopword2.dic 为啥用两个文件？为了展示配置里面支持配置多个文件

Lucene 分词器Analyzer 示例_图示-a46389e216a34e1db753eba7b9796af4.png — 配置文件示例

执行结果参考：

Load extended dictionary:ext.dic
Load stopwords dictionary:stopword.dic
Load stopwords dictionary:stopword2.dic
term=你好,bytes=[e4 bd a0 e5 a5 bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=好中国,bytes=[e5 a5 bd e4 b8 ad e5 9b bd],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=中国,bytes=[e4 b8 ad e5 9b bd],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1