倒排索引

网图

概念

正排索引

正排索引，又称正向索引，是信息检索中非常常见的一种索引方式。它的主要思想是将文档的内容分解为一系列的词条（或关键词），并对每个词条建立一个索引，指向包含该词条的文档。

我们可以将正排索引理解为一个巨大的表格，表格的每一行对应一个文档，每一列对应一个词条。如果某个文档包含某个词条，那么在表格的相应位置就会有一个标记。

例如，假设我们有三篇文档，内容分别为：

“苹果是一种水果”
“香蕉也是一种水果”
“橘子和苹果都是水果”

那么，我们可以建立一个如下的正排索引：

文档	苹果	水果	香蕉	也是	橘子	和	都是
文档1	√	√
文档2		√	√	√
文档3	√	√			√	√	√

通过这个索引，我们可以快速查询到所有包含某个词条的文档，从而提高信息检索的效率。例如，如果我们想找到所有包含"苹果"的文档，只需要查看"苹果"这一列就可以了。

正排索引是构建全文搜索、信息检索系统的基础，也是许多搜索引擎的核心技术之一。

倒排索引，也称为反向索引，是一种非常常用的索引方法，特别是在全文搜索中。它的主要作用是用来存储某个单词在一个文档或者一组文档中的存储位置的映射。

我们可以通过一个简单的例子来理解倒排索引。

假设我们有以下三篇文档：

“苹果是一种水果”
“香蕉也是一种水果”
“橘子和苹果都是水果”

倒排索引就是按照每个词条来建立索引，每个词条指向包含这个词条的文档。比如我们建立的倒排索引可能如下：

词条	出现的文档
苹果	文档1, 文档3
水果	文档1, 文档2, 文档3
香蕉	文档2
也是	文档2
橘子	文档3
和	文档3
都是	文档3

可以看到，每个词条后面都跟着包含这个词条的文档列表。这样，当我们想要查找包含某个词条的所有文档时，只需要查找这个词条在倒排索引中的记录即可，大大提高了检索的效率。

倒排索引是搜索引擎等信息检索系统中的重要数据结构，它使得我们能够快速地在大量文档中查找包含特定词条的文档。

总结，正排索引的关键在于"从文档到词条"的映射，而倒排索引的关键在于"从词条到文档"的映射。在实际的信息检索系统中，正排索引和倒排索引往往会结合使用，以实现更高效的搜索和检索。

代码实现

类型定义

倒排项的定义如下:

struct InvertTerm {
    InvertTerm(const string& docid, int freqs, int loc)
        : docid_(docid)
        , freqs_(freqs) {
        locations_.emplace_back(loc);
    }
    
    bool operator==(const InvertTerm& term) const {
        return docid_ == term.docid_;
    }
    
    bool operator<(const InvertTerm& term) const {
        return docid_ < term.docid_;
    }
    
    string    docid_;     // 单词所在的文档
    int       freqs_;     // 单词出现的次数
    list<int> locations_; // 单词出现的位置
};

倒排列表的定义如下：

class InvertList {
public:
    InvertList()  = default;
    ~InvertList() = default;
    
public:
    // 添加倒排项
    void addTerm(const string& docid, int loc);
    // 获取倒排列表内容
    const list<InvertTerm>& getInvertList() const {
        return termList_;
    }
    
private:
    list<InvertTerm> termList_; // 存储当前倒排列表所有的倒排项
}

倒排索引的定义如下：

class InvertIndex {
public:
    // 查询接口
    void query(const string& phrase);
    // 设置本地搜索路径
    void setSearchPath(const string& path);
    // 添加文件后缀
    void addSuffix(const string& suffix);
    // 删除文件后缀
    void removeSuffix(const string& suffix);
    // 清空文件后缀
    void clearSuffix();
    // 后缀是否已经存在
    bool isSuffixExist(const string& suffix);
    
private:
    // 创建倒排索引结构
    void createInvertedIndex();
    // 递归扫描路径下的所有文件
    int getAllFile(const char* path);
    // 去掉单词前后多余的字符
    char* trim(char* word);
    
private:
    vector<string>                    suffixs_;   // 过滤文档后缀
    list<string>                      fileList_;  // 存储所有需要建立倒排索引的文件
    unordered_map<string, InvertList> invertMap_; // 词典+倒排列表
};

创建倒排索引

创建倒排索引的相关代码如下：

void InvertIndex::createInvertedIndex() {
    int curidx = 1;
    for (string& filePath : fileList_) {
        cout << "Current progress: " << curidx++ << " / " << fileList_.size();
        FILE* pf = fopen(filePath.c_str(), "r");
        if (pf == nullptr) {
            cout << " [Failed to open " << filePath << "!]" << endl;
            continue;
        }
        
        // 按行读取文件中的内容 按照空格进行分词
        vector<string> wordList;
        int            loc             = 0;
        const int      LINE_SIZE       = 2048;
        char           line[LINE_SIZE] = { 0 };
        
        while (!feof(pf)) {
            // 读一行文件内容
            fgets(line, LINE_SIZE, pf);
            
            // 按照空格进行分词
            char* word = strtok(line, " ");
            while (word != nullptr) {
                // 过滤前后多余的空格和换行符
                word = trim(word);
                if (strlen(word) > 0) {
                    wordList.emplace_back(word);
                }
                word = strtok(nullptr, " ");
            }
            
	        // 开始给单词创建倒排列表
            for (string& w : wordList) {
                loc++;
                auto it = invertMap_.find(w);
                if (it == invertMap_.end()) {
                    // 新建该单词的倒排列表
                    InvertList list;
                    list.addTerm(filePath, loc);
                    invertMap_.emplace(w, list);
                } else {
                    it->second.addTerm(filePath, loc);
                }
            }
        }
        
        fclose(pf);
        cout << endl;
    }
}

关键词搜索

关键词搜索相关的代码如下：

void InvertIndex::query(const string& phrase) {
    // 先进行句子分词操作
    vector<string> wordList;
    char*          word = strtok(const_cast<char*>(phrase.c_str()), " ");
    while (word != nullptr) {
        word = trim(word);
        if (strlen(word) > 0) {
            wordList.emplace_back(word);
        }
        word = strtok(nullptr, " ");
    }
    
    if (wordList.empty())
        return;
        
    if (wordList.size() == 1) {
        auto it = invertMap_.find(wordList[0]);
        if (it == invertMap_.end()) {
            cout << "No match was found!" << endl;
            return;
        }
        
        for (auto& term : it->second.getInvertList()) {
            cout << "id: " << term.docid_ << "  freq: " << term.freqs_ << endl;
        }
    } else {
        // 搜索多个单词
        vector<InvertList> invertList;
        for (int i = 0; i < wordList.size(); ++i) {
            auto it = invertMap_.find(wordList[i]);
            if (it != invertMap_.end()) {
                invertList.emplace_back(it->second);
            }
        }
        
        // 开始遍历所有的倒排列表 求倒排项的交集
        vector<InvertTerm> v1(invertList[0].getInvertList().begin(),
            invertList[0].getInvertList().end());
        vector<InvertTerm> termShared;
        for (int i = 1; i < invertList.size(); i++) {
            vector<InvertTerm> v2(invertList[i].getInvertList().begin(),
                invertList[i].getInvertList().end());
                
            // 使用set_intersection时要保证数据有序
            sort(v1.begin(), v1.end());
            sort(v2.begin(), v2.end());
            set_intersection(v1.begin(), v1.end(),
                v2.begin(), v2.end(),
                back_inserter(termShared));
                
            v1.swap(termShared);
            termShared.clear();
        }
        
        for (auto& term : v1) {
            cout << "id: " << term.docid_ << "  freq: " << term.freqs_ << endl;
        }
    }
}