自带的标准分词器
public static void main(String[] args) throws IOException {
Analyzer analyzer = new StandardAnalyzer();
TokenStream ts = analyzer.tokenStream("name","Hello China");
// TokenStream ts = analyzer.tokenStream("name","你好中国");
ts.reset();
while (ts.incrementToken()){
System.out.println(ts.reflectAsString(false));
}
}
StandardAnalyzer 标准分词器对英文分词比较合理对中文分词没啥搞头。
自带的简单分词器
public static void main(String[] args) throws IOException {
try (Analyzer analyzer = new SimpleAnalyzer();){
TokenStream ts = analyzer.tokenStream("name","Hello China");
// TokenStream ts = analyzer.tokenStream("name","你好中国");
ts.reset();
while (ts.incrementToken()){
System.out.println(ts.reflectAsString(false));
}
}
}
中文执行结果
中文分词器
需要单独引入依赖:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>8.11.4</version>
</dependency>
9.x版本以上
<dependency>
<groupId>org.apache.lucene</groupId>
<!--9.x renamed analyzers to analysis-->
<artifactId> lucene-analysis-smartcn</artifactId>
<version>9.12.1</version>
</dependency>
public static void main(String[] args) throws IOException {
try (Analyzer analyzer = new SmartChineseAnalyzer();){
// TokenStream ts = analyzer.tokenStream("name","Hello China");
TokenStream ts = analyzer.tokenStream("name","你好中国");
ts.reset();
while (ts.incrementToken()){
System.out.println(ts.reflectAsString(false));
}
}
}
英文执行结果:
中文执行结果:
public static void main(String[] args) throws IOException {
String [] custom_stop_words = {"的","在","呢","是","和","后","小",",",","};
CharArraySet stopWordsSet = new CharArraySet(Arrays.asList(custom_stop_words), true);
try (Analyzer analyzer = new SmartChineseAnalyzer();){
// try (Analyzer analyzer = new SmartChineseAnalyzer(stopWordsSet);){
// TokenStream ts = analyzer.tokenStream("name","Hello China");
TokenStream ts = analyzer.tokenStream("name",
"lucene分析器使用分词器和过滤器构成一个“管道”,文本在流经这个管道后成为可以进入索引的最小单位,它主要作用是对切出来的词进行进一步的处理(如去掉敏感词、英文大小写转换、单复数处理)等。");
ts.reset();
int i=0;
while (ts.incrementToken()){
System.out.println(i+++" "+ts.reflectAsString(false));
}
}
}
默认未启用分词器执行分词数量:
启用自定义分词器后执行分词数量:
SmartChineseAnalyzer有三个构造函数,无参构造,Boolean构造,无参构造默认Boolean构造值为true,也就是使用内置停止分词字典
通过源码可以看到默认读取的是classpath路径下面的stopwords.txt文件
/**
* Atomically loads the DEFAULT_STOP_SET in a lazy fashion once the outer class accesses the
* static final set the first time.;
*/
private static class DefaultSetHolder {
static final CharArraySet DEFAULT_STOP_SET;
static {
try {
DEFAULT_STOP_SET = loadDefaultStopWordSet();
} catch (IOException ex) {
// default set should always be present as it is part of the
// distribution (JAR)
throw new UncheckedIOException("Unable to load default stopword set", ex);
}
}
static CharArraySet loadDefaultStopWordSet() throws IOException {
// make sure it is unmodifiable as we expose it in the outer class
return CharArraySet.unmodifiableSet(
WordlistLoader.getWordSet(
IOUtils.requireResourceNonNull(
SmartChineseAnalyzer.class.getResourceAsStream(DEFAULT_STOPWORD_FILE),
DEFAULT_STOPWORD_FILE),
STOPWORD_FILE_COMMENT));
}
}
支持中文友好的分词器
需要单独引入依赖
<dependency>
<groupId>com.github.magese</groupId>
<artifactId>ik-analyzer</artifactId>
<version>8.5.0</version>
</dependency>
public static void main(String[] args) throws IOException {
try (Analyzer analyzer = new IKAnalyzer();){
TokenStream ts = analyzer.tokenStream("name","Hello China");
// TokenStream ts = analyzer.tokenStream("name","你好中国");
ts.reset();
while (ts.incrementToken()){
System.out.println(ts.reflectAsString(false));
}
}
}
Load extended dictionary:ext.dic
Load stopwords dictionary:stopword.dic
term=hello,bytes=[68 65 6c 6c 6f],startOffset=0,endOffset=5,positionIncrement=1,positionLength=1,type=ENGLISH,termFrequency=1
term=china,bytes=[63 68 69 6e 61],startOffset=6,endOffset=11,positionIncrement=1,positionLength=1,type=ENGLISH,termFrequency=1
Load extended dictionary:ext.dic
Load stopwords dictionary:stopword.dic
term=你好,bytes=[e4 bd a0 e5 a5 bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=你,bytes=[e4 bd a0],startOffset=0,endOffset=1,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=好中国,bytes=[e5 a5 bd e4 b8 ad e5 9b bd],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=好中,bytes=[e5 a5 bd e4 b8 ad],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=中国,bytes=[e4 b8 ad e5 9b bd],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
提示:IKAnalyzer analyzer = new IKAnalyzer(true) 有个智能useSmart模式,默认是false,开启后有时候聪明有时候傻傻分不清。
自定义扩展分词和停止分词库
从上面,我们看到,默认情况将 你好中国分成了 你好,你,好中国,好中,中国
如果我们不想要分“你”,“好中”两个词则可以进行自定义停止分词字典配置实现。
添加IKAnalyzer扩展配置
以maven项目为例,在resources跟目录创建文件IKAnalyzer.cfg.xml,并设置下面内容
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">ext.dic</entry>
<!--用户可以在这里配置自己的扩展停止词字典 -->
<entry key="ext_stopwords">stopword.dic;stopword2.dic</entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
创建停止分词字典文件stopword.dic
,stopword2.dic
为啥用两个文件?为了展示配置里面支持配置多个文件
执行结果参考:
Load extended dictionary:ext.dic
Load stopwords dictionary:stopword.dic
Load stopwords dictionary:stopword2.dic
term=你好,bytes=[e4 bd a0 e5 a5 bd],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=好中国,bytes=[e5 a5 bd e4 b8 ad e5 9b bd],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
term=中国,bytes=[e4 b8 ad e5 9b bd],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=CN_WORD,termFrequency=1
如果用于含中文全文检索建议使用后两个分词器,其中
ik-analyzer
分词器还支持自定义排查分词字典和自定义停止分词字典,功能强大。smartcn
分词器支持自定义停止分词字典
https://blog.xqlee.com/article/2504291149319024.html