NLP笔记 - Word Embedding // bag of words

面向读者：nlp入门学者，python选手

可能还没做过nlp的项目，就对 word embedding（词嵌入）有所耳闻。深度学习为什么那么火，其中之一是不用怎么操心前期数据清洗。在（深度）语义匹配里，进行embedding（嵌入）是进行深度学习的前一步。

概念解释

语义匹配（semantic matching）：根据语义来匹配，看两句话（或者多句话）说的是不是一个意思。比如“我想入门nlp。”和“如何学nlp技术？”可以认为是同一个意思，那么这两句话就匹配成功。传统的方法只是字字匹配（term matching），不会将“入门”和“学习”这两个匹配起来。再加一句“nlp的深度模型有哪些？”，明显和前两句不是一个意思，那么就匹配失败。语义匹配经常用在搜索引擎或像知乎问答上，你提问“如何学nlp技术？”，而“我想入门nlp。”这个已经有人回答过了，存在知识库里，机器需要做的就是把你的问题与已有答案的问题匹配起来，把对应的答案传送给你。
字典（dictionary）：像新华字典一样的存在，机器也需要有一个字典来理解文字。一个单词对应一个索引，这个索引index往往是一个序列整数。
语料库（corpora）：字典是如何来的，自然是因为有很多很多的文字材料。语料可以是所有莎士比亚写的文章，或者所有维基百科的文章，或者一个特定的人发的推文。
词/句/文本嵌入（embedding）：不要被中文的“嵌入”意思带偏。embedding是一个数学术语，代表的是一个映射关系。比如汉英字典里的中文“钞票”映射到英文就是单词“money”。这项技术把词汇表中的单词或短语映射成由实数构成的向量。在计算机中，一个单词映射到的往往就是它的索引数字。毕竟目前计算机也只能理解数字。
TF-IDF（term frequency–inverse document frequency）：TF意思是词频(Term Frequency)，IDF意思是逆文本频率指数(Inverse Document Frequency)。TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

跑个小例子

在getting started，提起过gensim这个python包。本文就具体讲一下这个包的使用方法。首先pip install gensim，然后打开python3，其它没下载的包请自己手动下载。（jupyter版本链接）

输入：

1 2	import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))

下面是一个迷你的语料库，由9个字符串文本组成，每个字符串包含一个句子。语料是指一组文档的集合。这个集合是gensim的输入，gensim会从这个语料中推断出它的结构，主题等。从语料中推断出的隐含结构，可以用来对一个新的文档指定一个主题。

语料库输入：

from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

首先，做些预处理。

文本进行分词（tokenization）
删去一些常用词/停用词（像for/ a/ of/ the/…这些词）
删去只出现一次的词（防止太稀疏）

输入：

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # pretty-printer
pprint(texts)

输出：

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

预处理的方式可以千变万化，上面只是举个例子。接下来根据上面剩下的单词生成字典，输入：

1
2
3

dictionary = corpora.Dictionary(texts)
dictionary.save('deerwester.dict')  # store the dictionary, for future reference
print(dictionary)

输出：

1	Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...)

可以看出语料库生成的字典里有12个不同的单词。意味着语料库的每一个文本，也就是每一句话，都可以被12维的稀疏向量表示。

输入：

1	print(dictionary.token2id)

输出字典mapping，语料中的每一个单词关联一个唯一的id。字典单词与id能一一对应就行，不同的人跑的id数字可能变化：

1	{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 'user': 4, 'system': 5, 'response': 6, 'time': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

如果要对文档的隐含结构进行推断，就需要一种数学上能处理的文档表示方法。一种方法是把每个文档表达为一个向量。有很多种表示方法，一种常见的方法是bag-of-words模型，也叫做“词袋”。在词袋模型中，每篇文档（在这里是每个字符串句子）被表示成一个向量，代表字典中每个词出现的次数。词袋模型的一个重要特点是，它完全忽略的单词在句子中出现的顺序，这也就是“词袋”这个名字的由来。

词袋示例，输入：

1
2
3

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

输出：

1	[(0, 1), (2, 1)]

新样本是一个新句子（注意到这句话并没有出现在原始的预料中）：”Human computer interaction”

doc2bow()函数生成的元组中，括号左边代表单词id，括号右边代表单词在样例中的出现次数。生成的是一个像[(word_id, word_count), …]的稀疏向量，也就是词袋。

“Human”和“computer”是出现在语料库的，因此也存在在字典里，其id分别是0和2，各自在新样本里出现过一次，因此出现频次都是1。因此(0, 1), (2, 1)分别代表“Human”和“computer”。“interaction”不存在字典里，不在稀疏向量里出现。而其他存在在字典里，却在新句子中出现0次的单词，也不显示在稀疏向量里。也就说明每个小括号右边的数字不会小于1。

因此这个新句子的12维向量最终结果是[(0, 1), (2, 1)]。如果不想出现频次这个特征，可以尝试下doc2idx这个函数，同时按照单词在句子中出现的顺序进行id的显示。

把语料库的句子都转换成稀疏向量，输入：

1
2
3

corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('deerwester.mm', corpus)  # store to disk, for later use
print(corpus)

输出：

[(0, 1), (1, 1), (2, 1)]
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(1, 1), (4, 1), (5, 1), (8, 1)]
[(0, 1), (5, 2), (8, 1)]
[(4, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(3, 1), (10, 1), (11, 1)]

跑个大例子

上个例子的语料库是非常小的文本，但实际情况是，语料库里会有百万上亿条文本，想想新华字典都那么厚。把语料全部存在RAM 不实际。假设文本放在一个文件夹里，一行话一行话的形式存储，gensim就可以实现一次返回一个句子的稀疏向量。

所以大例子的精华无非是，一次跑一条文本。点击这里下载样本’mycorpus.txt’

输入：

class MyCorpus(object):
    def __iter__(self):
        for line in open('mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())
          
corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
print(corpus_memory_friendly) 
# <__main__.MyCorpus object at 0x10d5690>

输入：

1 2	for vector in corpus_memory_friendly: # load one vector into memory at a time print(vector)

输出：

[(0, 1), (1, 1), (2, 1)]
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(1, 1), (4, 1), (5, 1), (8, 1)]
[(0, 1), (5, 2), (8, 1)]
[(4, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(3, 1), (10, 1), (11, 1)]

虽然看起来结果跟跑个小例子一样，但是这个跑的过程对内存更友好。现在你可以随意扩充语料库。

接下来，生成字典，但无需一次性加载所有的文本到内存里，输入：

>>> from six import iteritems
>>> # collect statistics about all tokens
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
>>> # remove stop words and words that appear only once
>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
>>>             if stopword in dictionary.token2id]
>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
>>> dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
>>> dictionary.compactify()  # remove gaps in id sequence after words that were removed
>>> print(dictionary)
Dictionary(12 unique tokens)

Transformation

现在已经向量化了语料，接下来可以使用各种向量转换transformation了，指的是把文档转化成另一个。在gensim中，文档用向量来表示，所以模型可以认为是在两个向量空间进行转换。这个转换是从语料训练集中学习出来的。

比较简单的一个叫TF-IDF。TF-IDF把词袋表达的向量转换到另一个向量空间，这个向量空间中，词频是根据语料中每个词的相对稀有程度（relative rarity）进行加权处理的。

看一个简单的例子。首先初始化一个tf-idf，在我们的语料中进行训练，然后对“system minors”进行处理。（参考）

输入：

from gensim import models
tfidf = models.TfidfModel(bow_corpus)
string = "system minors"
string_bow = dictionary.doc2bow(string.lower().split())
string_tfidf = tfidf[string_bow]
print(string_bow)
print(string_tfidf)

输出：

1 2	[(5, 1), (11, 1)] [(5, 0.5898341626740045), (11, 0.8075244024440723)]

TF-IDF返回了一组元组。元组中第一个元素表示id，第二个表示tf-idf权重。注意到，“system”在原语料中出现4次，“minors”出现2次，所以第一个权重比第二个小。

其它的还有下面几个转换，具体转换代码点这里：

写在最后

Word Embedding相关的有很多技术，pensim里也有更多好用的功能，比如word2vec，doc2vec等，这里只是抛砖引玉，举个小例子。跑一遍后，对这个词嵌入技术有个大概的感受就算目的达成了~😎