NLP实战 - 基于SimNet的Quora问句语义匹配

Quora Question Pairs是kaggle里的问句语义匹配比赛。这场比赛对于nlp选手应该不陌生了,数据集也是大家入门nlp必备。本文在深度语义匹配使用的是百度开源的语义匹配框架AnyQ里的SimNet

环境说明

  • Linux
  • python 2.7
  • TensorFlow 1.7.0
  • CPU
  • Jupyter Notebook

下载AnyQ

首先需要有git,如果没有可以点这里下载。在Linux环境选定路径下敲入git clone https://github.com/baidu/AnyQ进行AnyQ的下载。

下载完,查看SimNet的路径是AnyQ/tools/simnet/train/tf/

SimNet 的结构如下

1
2
3
4
5
6
7
8
9
simnet
|-tf
|- date //示例数据
|- examples //示例配置文件
|- layers //网络中使用操作层的实现
|- losses //损失函数实现
|- nets //网络结构实现
|- tools //数据转化及评价工具
|- util //工具类

❤另外已经专门写了一篇关于SimNet的代码走读,强烈推荐打开那篇文章放在旁边与这篇一起看,点这里查看

另外,在Linux查看/修改代码,推荐jupyter notebook

其它说明:

  • 保存模型文件的路径需要自己手动添加,在目录上新建model和pointwise文件夹:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    simnet
    |-tf
    |- date
    |- examples
    |- layers
    |- losses
    |- nets
    |- tools
    |- util

    # 新建下面的文件夹
    |- model
    |- pointwise

下载数据集

请点击 Quora 进行数据集下载。本文只用到训练数据集,所以下载train.csv即可。train.csv还不能直接作为SimNet的输入数据,需要做词嵌入等数据预处理。

词嵌入处理

SimNet的训练数据是有格式要求的。具体请查看SimNet的README.md。这一步数据处理也可以在win环境下操作。由于Quora训练集是两个问句列加一个label列,适合SimNet的pointwise数据格式。

  • pointwise数据格式:数据包含三列,依次为Query1的ID序列(ID间使用空格分割),Query2的ID序列(ID间使用空格分割),Label,每列间使用TAB分割,例如;
1
2
3
1 1 1 1 1   2 2 2 2 2   0
1 1 1 1 1 1 1 1 1 1 1
...

pointwise需要问句都以id的形式,所以word embedding选择词袋法(BOW)

新建一个quora.py文件来做词嵌入处理,先处理了空(null)问题,再预览10个问题对(question pairs)。

输入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# -*- coding: utf-8 -*-
"""
Created on Wed Aug 29 16:50:59 2018

@author: Yi
"""

import os
os.chdir("C:/Users/Yi/Desktop/nlp/quora") # quora.py的路径

import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re
from string import punctuation

train = pd.read_csv("data/train.csv") # 共404290个问句对
#test = pd.read_csv("data/test.csv")

# Check for any null values
print(train.isnull().sum())
#print(test.isnull().sum())

# Add the string 'empty' to empty strings
train = train.fillna('empty')
#test = test.fillna('empty')

# Preview some of the pairs of questions
a = 0
for i in range(a,a+10):
print(train.question1[i])
print(train.question2[i])
print()

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
What is the step by step guide to invest in share market in india?
What is the step by step guide to invest in share market?

What is the story of Kohinoor (Koh-i-Noor) Diamond?
What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?

How can I increase the speed of my internet connection while using a VPN?
How can Internet speed be increased by hacking through DNS?

Why am I mentally very lonely? How can I solve it?
Find the remainder when [math]23^{24}[/math] is divided by 24,23?

Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?
Which fish would survive in salt water?

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?

Should I buy tiago?
What keeps childern active and far from phone and video games?

How can I be a good geologist?
What should I do to be a great geologist?

When do you use シ instead of し?
When do you use "&" instead of "and"?

Motorola (company): Can I hack my Charter Motorolla DCX3400?
How do I hack Motorola DCX3400 for free internet?

继续做同义词替换、停用词处理、删除介词等数据清洗,输入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
stop_words = ['the','a','an','and','but','if','or','because','as','what','which','this','that','these','those','then',
'just','so','than','such','both','through','about','for','is','of','while','during','to','What','Which',
'Is','If','While','This']

def text_to_wordlist(text, remove_stop_words=True, stem_words=False):
# Clean the text, with the option to remove stop_words and to stem words.

# Clean the text
text = re.sub(r"[^A-Za-z0-9]", " ", text)
text = re.sub(r"what's", "", text)
text = re.sub(r"What's", "", text)
text = re.sub(r"\'s", " ", text)
text = re.sub(r"\'ve", " have ", text)
text = re.sub(r"can't", "cannot ", text)
text = re.sub(r"n't", " not ", text)
text = re.sub(r"I'm", "I am", text)
text = re.sub(r" m ", " am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r"60k", " 60000 ", text)
text = re.sub(r" e g ", " eg ", text)
text = re.sub(r" b g ", " bg ", text)
text = re.sub(r"\0s", "0", text)
text = re.sub(r" 9 11 ", "911", text)
text = re.sub(r"e-mail", "email", text)
text = re.sub(r"\s{2,}", " ", text)
text = re.sub(r"quikly", "quickly", text)
text = re.sub(r" usa ", " America ", text)
text = re.sub(r" USA ", " America ", text)
text = re.sub(r" u s ", " America ", text)
text = re.sub(r" uk ", " England ", text)
text = re.sub(r" UK ", " England ", text)
text = re.sub(r"india", "India", text)
text = re.sub(r"switzerland", "Switzerland", text)
text = re.sub(r"china", "China", text)
text = re.sub(r"chinese", "Chinese", text)
text = re.sub(r"imrovement", "improvement", text)
text = re.sub(r"intially", "initially", text)
text = re.sub(r"quora", "Quora", text)
text = re.sub(r" dms ", "direct messages ", text)
text = re.sub(r"demonitization", "demonetization", text)
text = re.sub(r"actived", "active", text)
text = re.sub(r"kms", " kilometers ", text)
text = re.sub(r"KMs", " kilometers ", text)
text = re.sub(r" cs ", " computer science ", text)
text = re.sub(r" upvotes ", " up votes ", text)
text = re.sub(r" iPhone ", " phone ", text)
text = re.sub(r"\0rs ", " rs ", text)
text = re.sub(r"calender", "calendar", text)
text = re.sub(r"ios", "operating system", text)
text = re.sub(r"gps", "GPS", text)
text = re.sub(r"gst", "GST", text)
text = re.sub(r"programing", "programming", text)
text = re.sub(r"bestfriend", "best friend", text)
text = re.sub(r"dna", "DNA", text)
text = re.sub(r"III", "3", text)
text = re.sub(r"the US", "America", text)
text = re.sub(r"Astrology", "astrology", text)
text = re.sub(r"Method", "method", text)
text = re.sub(r"Find", "find", text)
text = re.sub(r"banglore", "Banglore", text)
text = re.sub(r" J K ", " JK ", text)


# Remove punctuation from text
text = ''.join([c for c in text if c not in punctuation])

# Optionally, remove stop words
if remove_stop_words:
text = text.split()
text = [w for w in text if not w in stop_words]
text = " ".join(text)

# Optionally, shorten words to their stems
if stem_words:
text = text.split()
stemmer = SnowballStemmer('english')
stemmed_words = [stemmer.stem(word) for word in text]
text = " ".join(stemmed_words)

# Return a list of words
return(text)

def process_questions(question_list, questions, question_list_name, dataframe):
'''transform questions and display progress'''
for question in questions:
question_list.append(text_to_wordlist(question))
if len(question_list) % 100000 == 0:
progress = len(question_list)/len(dataframe) * 100
print("{} is {}% complete.".format(question_list_name, round(progress, 1)))

train_question1 = []
process_questions(train_question1, train.question1, 'train_question1', train)

train_question2 = []
process_questions(train_question2, train.question2, 'train_question2', train)

#test_question1 = []
#process_questions(test_question1, test.question1, 'test_question1', test)
#
#test_question2 = []
#process_questions(test_question2, test.question2, 'test_question2', test)


# Preview some transformed pairs of questions
a = 0
for i in range(a,a+10):
print(train_question1[i])
print(train_question2[i])
print()

输出处理之后的10个问题对:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
step by step guide invest in share market in India
step by step guide invest in share market

story Kohinoor Koh i Noor Diamond
would happen Indian government stole Kohinoor Koh i Noor diamond back

How can I increase speed my internet connection using VPN
How can Internet speed be increased by hacking DNS

Why am I mentally very lonely How can I solve it
find remainder when math 23 24 math divided by 24 23

one dissolve in water quickly sugar salt methane carbon di oxide
fish would survive in salt water

astrology I am Capricorn Sun Cap moon cap rising does say me
I am triple Capricorn Sun Moon ascendant in Capricorn does say me

Should I buy tiago
keeps childern active far from phone video games

How can I be good geologist
should I do be great geologist

When do you use instead
When do you use instead

Motorola company Can I hack my Charter Motorolla DCX3400
How do I hack Motorola DCX3400 free internet

继续删除自定义停用词,删除只出现过一次的词,将问题1和问题2合并成一个大语料库,输入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import itertools
raw_corpus = list(itertools.chain.from_iterable([train_question1,train_question2]))
#[train_question1,train_question2]

stoplist = stop_words
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in raw_corpus]

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1

precessed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

形成字典,输入:

1
2
3
4
5
from gensim import corpora
dictionary = corpora.Dictionary(precessed_corpus)
print(dictionary)

print(dictionary.token2id)

输出字典:

1
Dictionary(52069 unique tokens: ['vicodin', 'mermaid', 'kgb', 'dusk', 'glonass']...)

输入一个新文本,试一下这个字典,输入:

1
2
3
4
5
new_doc = "would happen Indian government stole Kohinoor Koh i Noor diamond back"
new_vec = dictionary.doc2bow(new_doc.lower().split())
#dictionary.doc2idx(new_doc.lower().split())
print(new_vec)
#列表中每个元组中,第一个元素表示字典中单词的ID,第二个表示在这个句子中这个单词出现的次数。

输出:

1
2
3
4
5
6
7
8
9
10
11
[(8, 1),
(9, 1),
(10, 1),
(11, 1),
(12, 1),
(107, 1),
(186, 1),
(226, 1),
(416, 1),
(828, 1),
(4496, 1)]

感觉还可以,那么将quora的所有问题对的语料都用字典里的id代替,输入:

1
2
3
4
5
bow_corpus = [dictionary.doc2idx(text) for text in precessed_corpus]

bow_corpus_plus_1 = [[i+1 for i in bow_corpu] for bow_corpu in bow_corpus]
bow_corpus_str = [[str(i) for i in bow_corpu_plus] for bow_corpu_plus in bow_corpus_plus_1]
bow_corpus_join = [' '.join(bow_corpus_) for bow_corpus_ in bow_corpus_str]

由于语料库是问题1和问题2按顺序组成,那么用id代替后的语料库前一半的是词袋处理后问题1,后一半是词袋处理后问题2,最终恢复到pointwise格式的数据,输入:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 生成文件
pointwise_train = pd.DataFrame(bow_corpus_join[:404290], columns = ['question1'])
pointwise_train['question2'] = bow_corpus_join[404290:]
pointwise_train['is_duplicate'] = train['is_duplicate']

# 防止空(null)问题
pointwise_train = pointwise_train[[len(i)>0 for i in pointwise_train['question1']]]
pointwise_train = pointwise_train[[len(i)>0 for i in pointwise_train['question2']]]

# 拆分训练集和测试集
size = round(len(pointwise_train)*0.8) # 比例为8:2

# tsv格式的数据文件
pointwise_train[:size].to_csv('data/train_0829.tsv',sep = '\t', index=False, header=False)
pointwise_train[size:].to_csv('data/test_0829.tsv',sep = '\t', index=False, header=False)

得到tsv格式的train_0829.tsvtest_0829.tsv,可以随便命名。

数据准备

切换到Linux,将嵌入完的 tsv 数据集放入AnyQ/tools/simnet/train/tf/data路径下:

1
2
3
4
5
simnet
|-tf
|- date //示例数据,tsv格式,没有表头
|- train_0829.tsv //训练集数据
|- test_0829.tsv //测试集数据

按照下图路径AnyQ/tools/simnet/train/tf/,新建 run_convert_data.sh 脚本文件,

其实就是把原来的run_train.sh里转换数据的命令拿出来。因为之后需要多模型跑一样的数据,数据转换做一次就够了,内容如下:

1
2
3
4
5
6
7
8
9
set -e # set -o errexit
set -u # set -o nounset
set -o pipefail

echo "convert train data"
python ./tools/tf_record_writer.py pointwise ./data/待转换的训练数据文件名train_0829.tsv ./data/已转换的训练数据文件名convert_train_0829 0 32
echo "convert test data"
python ./tools/tf_record_writer.py pointwise ./data/待转换的测试数据文件名test_0829.tsv ./data/已转换的测试数据文件名convert_test_0829 0 32
echo "convert data finish"

在Linux黑命令框里敲入命令./run_convert_data.sh,如果有permission denied情况,先使用chmod 777 文件名,在这里是chmod 777 run_convert_data.sh。如果成功将打印出:

1
2
3
convert train data
convert test data
convert data finish

在data文件夹目录下,新增两个转换后的数据文件,convert_train_0829 和 convert_test_0829。

修改代码

修改配置文件

用 jupyter notebook 打开 examples 文件夹下的所有形如 xxx-pointwise.json的配置文件,修改以下几个参数数值:

  • data_size = 323273 , 因为train_0829.tsv有 323273 条样本
  • vocabulary_size = 1000000
  • batch_size = 800
  • num_epochs = 1
  • print_iter = 10
  • train_file = data/convert_train_0829
  • test_file = data/convert_test_0829

以上,只是修改模型训练的配置参数,还得另外修改模型检验的配置参数,打开AnyQ/tools/simnet/train/tf/目录下的 tf_simnet.py,找到def predict(conf_dict),找到如下代码(应该在第90行):

1
2
conf_dict.update({"num_epochs": "1", "batch_size": "1",
"shuffle": "0", "train_file": conf_dict["test_file"]})

将其修改成:

1
2
conf_dict.update({"num_epochs": "1", "batch_size": "400",
"shuffle": "0", "train_file": conf_dict["test_file"]})

保存,关闭文件。

修改保存模型文件规则

default的代码将在每个epoch迭代时保存一个模型,且最终跑完还会保存一个模型。由于模型文件过大,所以将把代码修改成只保存最后跑完的模型,如果不需要可不做此修改。打开utils文件夹下的controler.py,将100-104行隐去:

1
2
3
4
5
#                if step % epoch_iter == 0:
# print("save model epoch%d" % (epoch_num))
# save_path = saver.save(sess,
# "%s/%s.epoch%d" % (model_path, model_file, epoch_num))
# epoch_num += 1

修改打印命令

default的代码在模型训练过程中,每一个print_iter会打印出一个loss值,在模型检验过程中,最后会打印出一个accuracy值。但是为了观察跑迭代的速度和精度,还需要在每次报loss的时候,打印出每个print_iter花了几秒钟(如不需要此功能可不做修改)。打开utils文件夹下的controler.py,将90-99行代码修改成如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
epoch_num = 1
last_timestamp = datetime.datetime.now() # 增加的代码
while not coord.should_stop():
try:
step += 1
c, _= sess.run([loss, optimizer])
avg_cost += c

if step % print_iter == 0:
now_timestamp = datetime.datetime.now() # 增加的代码
print("step: %d, loss: %4.4f (%4.2f sec/print_iter)" % (step,(avg_cost / print_iter),(now_timestamp-last_timestamp).seconds)) # 修改的代码
avg_cost = 0.0
last_timestamp = now_timestamp # 增加的代码

保存,关闭文件。

比对模型效果

AnyQ/tools/simnet/train/tf/路径增加 .sh 文件。因为SimNet目前有7个可选择的网络,分别是bow, cnn, knrm, lstm, mmdnn, mvlstm, pyramid,分别与nets文件夹里的文件一一对应,所以每种任务都有7个 .sh 脚本。任务类型分别是 train/predict/freeze,对应模型训练,模型检验,模型结果示意。

增加模型训练任务的 .sh文件

以cnn为例,新建run_train_cnn.sh,内容如下:

1
2
3
4
5
6
7
8
9
set -e # set -o errexit
set -u # set -o nounset
set -o pipefail

in_task_type='train'
in_task_conf='./examples/cnn-pointwise.json'
python tf_simnet.py \
--task $in_task_type \
--task_conf $in_task_conf

增加模型验证任务的 .sh文件

以cnn为例,新建run_predict_cnn.sh,内容如下:

1
2
3
4
5
6
7
8
9
set -e # set -o errexit
set -u # set -o nounset
set -o pipefail

in_task_type='predict'
in_task_conf='./examples/cnn-pointwise.json'
python tf_simnet.py \
--task $in_task_type \
--task_conf $in_task_conf

增加模型结果示意任务的 .sh文件

以cnn为例,新建run_freeze_cnn.sh,内容如下:

1
2
3
4
5
6
7
8
9
set -e # set -o errexit
set -u # set -o nounset
set -o pipefail

in_task_type='freeze'
in_task_conf='./examples/cnn-pointwise.json'
python tf_simnet.py \
--task $in_task_type \
--task_conf $in_task_conf

最终,生成7个 run_train_xxx.sh 文件,7个 run_predict_xxx.sh文件,7个 run_freeze_xxx.sh文件,或者选择性生成几个,示意如下图:

切换回Linux命令界面,开始运行各种命令,结果如配图。

bow

  • ./run_train_bow.sh

  • ./run_predict_bow.sh

cnn

  • ./run_train_cnn.sh

  • ./run_predict_cnn.sh

knrm

  • ./run_train_knrm.sh

  • ./run_predict_knrm.sh

lstm

  • ./run_train_lstm.sh

  • ./run_predict_lstm.sh

mmdnn

  • ./run_train_mmdnn.sh

  • ./run_predict_mmdnn.sh

mvlstm

  • ./run_train_mvlstm.sh

  • ./run_predict_mvlstm.sh

pyramid

  • ./run_train_pyramid.sh
  • ./run_predict_pyramid.sh

可自行尝试./run_freeze_xxx.sh命令系列。当跑完上面所有命令时,原文件夹中就自动形成了预测文件,如下:

随便打开其中一个看一下:

如下是自动保存的模型文件:

也会自动生成log文件夹,运行freeze任务后会自动生成graph文件夹,把任务结果保存在文件夹里。

优缺点

缺点:

  1. 由于SimNet的数据格式有要求,将文本都以ID格式代替,因此词嵌入使用的是词袋bow处理。
  2. 在前期文本清洗的规则可以再完善些。
  3. 呃。。。。百度(摊手🤷‍♀️

优点:

  1. SimNet是一个集成体,有很多深度模型可以选择。

写在最后

  • 跑了个SimNet流程,仅作效果比对,没有追求精度的提升。后来我有尝试把batch_size调成30,那么将有10000多步,精度有几个百分点的提升,但还远远不够。所以任重道远呀,还有很多需要学习的。
  • 在本文没有用到测试集,如果要参加比赛,在词嵌入做字典时,是否应该把test测试集的单词也加入到字典里来?这个还没有真的尝试一下,毕竟测试集test.cvs大的吓人!
  • 如有疑问,欢迎留言或者点这里找到我
would you buy me a coffee☕~
0%