Apache Lucene 入门篇

编程教程 > Java (116) 2025-04-30 14:37:56

Apache Lucene

一款 Apache托管的全文索引组件，纯Java实现。

Lucene的作用

用户—>服务器—>Lucene API—>索引库—>数据库/文档/web网页—>再返回。

常用的查询算法

【1】顺序扫描法：（数据量大的时候就GG），mysql的like查询就是，还有文本编辑器的Ctrl+F。
【2】倒排索引：把文章提取出来—>文档(正文)—>切分词组成索引目录。查询的时候先查目录，然后再找正文。切分词是个关键。
为什么倒排索引快？去掉重复的词，去掉停用词（的、地、得、a、an、the）。查字词典肯定比文章少。字典原理所以快。
优点：准确率高、速度快。但是空间占用量肯定会大，时间与空间不能兼得。它是用空间换时间。额外占用磁盘空间来存储目录。

全文检索技术使用场景

站内搜索（百度贴吧、京东、淘宝）。
垂直领域的搜索（818工作网）。
专业搜索引擎（谷歌、百度）

当然搜索引擎这类都属于独立开发了一套自有的全文索引软件。并非直接用Apache Lucene这类开源组件。

什么是Lucene

【1】文章—>词—>索引（目录）
【2】全文检索：查先查目录，再查文本，这就是全文检索。
【3】Doug Cutting是Lucene、Nutch、Hadoop等项目的发起人。捐献给了Apache基金会。
【4】官网 https://lucene.apache.org

索引和搜索流程概述

【1】原始文档—>创建索引（获得文档-构建文档对象-分词-创建索引）—>索引库(肯定是提前创建)。
【2】用户查询—>创建查询—>执行查询—>渲染结果—>返回结果。

Lucene索引流程详细

【1】Document文档（唯一ID）。Field域（key value的形式）。id:1 name:华为手机64G brandName:华为。id:2 name:华为手机128G brandName:
【2】会根据text提取分词，分析后得到的词：....................。关键词 1 在文档1，关键词2 在文档2 ，关键词手机在文档1&文档2。这种方式存储。
【3】然后先找到关键词在哪个文档，然后再去对应文档查，有道理呀..

Lucene入门

以Java maven项目为例

引入Lucene相关依赖包

    <--...下为属性/版本定义部分-->
    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <lucene.version>9.12.1</lucene.version>
        <hutool.all.version>5.8.26</hutool.all.version>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    
    <--...下为依赖部分-->
        <!-- Lucene Search engines must -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <!--9.x renamed analyzers to analysis-->
            <artifactId>lucene-analysis-common</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queries</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>${lucene.version}</version>
        </dependency>

        <!-- Chinese word segmentation dependence -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <!--9.x renamed analyzers to analysis-->
            <artifactId> lucene-analysis-smartcn</artifactId>
            <version>${lucene.version}</version>
        </dependency>
        <dependency>
            <groupId>com.github.magese</groupId>
            <artifactId>ik-analyzer</artifactId>
            <version>8.5.0</version>
        </dependency>

提示：注意选择与自己jdk版本一致版本 Apache Lucene 与JDK版本对应关系-XQLEE'Blog

示例数据对象

@AllArgsConstructor
@NoArgsConstructor
@Data
public class Product {
    /** 商品id **/
    Long id;
    /** 商品名称 **/
    String name;
    /** 商品价格 **/
    Integer price;
    /** 商品库存 **/
    Integer stock;
    /** 商品图片 **/
    String image;
    /** 商品品牌 **/
    String brand;
}

创建索引

    /**
     * 写索引/创建索引
     */
    public static void createIndex() throws IOException {
       
        //1.数据来源，模拟数据（正常业务情况来源于数据库）
        Product p1 = new Product(1L,"小米手机15PRO",4500,20,"xiaomi15.jpg","小米");
        Product p2 = new Product(2L,"红米手机K80PRO",2999,50,"redmi_k80.jpg","红米");
        Product p3 = new Product(3L,"魅族手机20",2999,50,"meizu_20.jpg","魅族");
        List<Product> products = new ArrayList<>();
        products.add(p1);
        products.add(p2);
        products.add(p3);

        //2.索引文档集合
        //org.apache.lucene.document.Document
        List<Document> docs = new ArrayList<Document>();
        for (Product product : products) {
            //文档创建
            Document doc = new Document();
            //Field.Store.YES - 存储原始值，索引查询后会用到，拿id查详情业务
            doc.add(new TextField("id",product.getId().toString(),Field.Store.YES));
            //Field.Store.YES - 存储原始值，搜索列表页面展示需要名称
            doc.add(new TextField("name",product.getName(),Field.Store.YES));
            /* ************************** 特殊字段 START ******************************/
            //IntPoint - int类型索引存储，为了能使用范围查询
            doc.add(new IntPoint("price",product.getPrice()));
            //StoredField - 配合上面的IntPoint完成存储+索引（范围查询支持）
            //目的：既要存储原始值又要值支持IntPoint的范围查询特性（后面查询用到）
            doc.add(new StoredField("price",product.getPrice()));
            /* ************************** 特殊字段 END ******************************/
            //图片直接存储
            doc.add(new StringField("image",product.getImage(),Field.Store.YES));
            //品牌 - 存储
            doc.add(new TextField("brand",product.getBrand(),Field.Store.YES));

            docs.add(doc);
        }

        //3.分词器创建
        //默认的StandardAnalyzer对中文不友好
//        Analyzer analyzer = new StandardAnalyzer();
        Analyzer analyzer = new SmartChineseAnalyzer();

        //4.文件索引目录创建 (相对路径/绝对路径)
       try (FSDirectory directory = FSDirectory.open(Paths.get("src/main/resources/lucene/index/products"))){
           //5.创建IndexWriterConfig对象， 这个对象指定切分词使用的分词器
           IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
           //6.创建IndexWriter输出流对象，指定输出位置和使用的config初始化对象。
           try (IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);){
               // 7.写入文档到索引库
               for (Document doc : docs) {
                   indexWriter.addDocument(doc);
               }
               indexWriter.flush();
           }
       }

    }

索引查询

 public static void searchIndex() throws ParseException, IOException {
        // 1.创建分词器（对搜索的内容进行分词使用）。如华为手机可能拆分为 华为 手机
//        Analyzer analyzer = new StandardAnalyzer();
        Analyzer analyzer = new SmartChineseAnalyzer();
        // 注意！！！：分词器要和创建索引的时候使用的分词器一模一样（不然搜索的时候就有问题）
        // 2.创建查询对象  // 第一个arg默认查询域   //
        QueryParser queryParser = new QueryParser("name", analyzer);
        // 3.设置搜索关键词
        Query query = queryParser.parse("小米");
        // queryParser.parse("id:小米") 指定从id查，不指定就从默认的name域查

        try (
            // 4.设置Directory目录对象，指定索引库的位置（与写入索引目录一致）
            Directory directory = FSDirectory.open(Paths.get("src/main/resources/lucene/index/products"));
            // 5.创建输入流对象
            IndexReader indexReader = DirectoryReader.open(directory);
        ){
            // 6.创建搜索对象
            IndexSearcher indexSearcher = new IndexSearcher(indexReader);
            // 7.搜索并返回结果
            TopDocs topDocs_10 = indexSearcher.search(query, 10);
            // 8.获取结果集
            ScoreDoc[] scoreDocArray = topDocs_10.scoreDocs;
            //打印结果
            printResult(scoreDocArray,indexSearcher);
        }
    }


    public static void printResult(ScoreDoc [] scoreDocArray,IndexSearcher indexSearcher) throws IOException {
        // 9.遍历结果集
        System.out.println("共查询到 " + scoreDocArray.length + " 条数据");
        for (ScoreDoc temp : scoreDocArray) {
            // 获取查询到的文档唯一ID，这个ID是Lucene在创建文档的时候自动分配的。
            int docId = temp.doc;
            // 通过文档ID读取文档
            Document document = indexSearcher.doc(docId);
            System.out.println("******************************************************************************************************");
            System.out.println("id： " + document.get("id"));
            System.out.println("name： " + document.get("name"));
            System.out.println("price： " + document.get("price"));
        }
    }

查询示例

查name 字段关键词小米

Apache Lucene 入门篇_图示-1933267db0d6481599f2aaa6fc91fb79.png

查 name字段关键词手机

Apache Lucene 入门篇_图示-2001e6cde9bf46cb9067fce8602da2da.png

查 brand 字段，关键词红米

Apache Lucene 入门篇_图示-cc34dc93494343e4a54e63be7a48056f.png

提示：字段名和查询关键词之间的冒号必须是英文半角冒号 :

索引更新

    public static void updateIndex() throws IOException {
        //1.文档数据模拟
        Document doc = new Document();
        doc.add(new TextField("id","11",Field.Store.YES));
        doc.add(new TextField("name","魅族手机15",Field.Store.YES));
        doc.add(new IntPoint("price",1999));
        doc.add(new StoredField("price",1999));
        doc.add(new TextField("image","meizu15.jpg",Field.Store.YES));
        //品牌 - 存储
        doc.add(new TextField("brand","魅族",Field.Store.YES));
        // 3.创建分词器
        //默认的StandardAnalyzer对中文不友好
//        Analyzer analyzer = new StandardAnalyzer();
        Analyzer analyzer = new SmartChineseAnalyzer();
        // 4.创建index目录对象，目录对象表示索引库的位置
        try (
                Directory directory = FSDirectory.open(Paths.get("src/main/resources/lucene/index/products"));
                ){
            // 5.创建IndexWriterConfig对象， 这个对象指定切分词使用的分词器
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
            // 6.创建IndexWriter输出流对象，指定输出位置和使用的config初始化对象。
            try (IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);){
                // 7.修改文档
                // 提示：此处更新后 id = 1 的文档就没了，只有id=11的
                indexWriter.updateDocument(new Term("id", "1"), doc);
            }
        }
    }

索引删除

    /**
     * 根据指定条件删除索引，例如：下面的根据id删除索引
     * @throws IOException
     */
    public static void deleteIndex() throws IOException {
        //1.创建分词器
        Analyzer analyzer = new StandardAnalyzer();
        // 2.创建index目录对象，目录对象表示索引库的位置
        try (Directory directory = FSDirectory.open(Paths.get("src/main/resources/lucene/index/products"));){
            // 3.创建IndexWriterConfig对象， 这个对象指定切分词使用的分词器
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
            // 4.创建IndexWriter输出流对象，指定输出位置和使用的config初始化对象。
            try (IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);){
                // 5.修改文档 (根据条件删除)
                indexWriter.deleteDocuments(new Term("id", "11"));
                // 删除所有 慎用
                //indexWriter.deleteAll();
            }
        }

    }

    /**
     * 删除所有索引
     * @throws IOException
     */
    public static void deleteAllIndex()throws IOException{
        // 1.创建index目录对象，目录对象表示索引库的位置
        try (Directory directory = FSDirectory.open(Paths.get("src/main/resources/lucene/index/products"));){
            // 2.创建IndexWriterConfig对象， 这个对象指定切分词使用的分词器
            IndexWriterConfig indexWriterConfig = new IndexWriterConfig();
            // 3.创建IndexWriter输出流对象，指定输出位置和使用的config初始化对象。
            try (IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);){
                // 4.删除所有 慎用
                indexWriter.deleteAll();
            }
        }
    }

Lucene全文索引与搜索入门篇2 - 可视化理解Lucene-XQLEE'Blog