Kaggle - Bag of Words Meets Bags of Popcorn

说起sentiment analysis,就不得不说起NLP选手必做题:Bag of Words Meets Bags of Popcorn ,以下简称“影评分析题”。必须负责任的说,这是一道很简单的题,就是对一段影评进行情感倾向预测(positive/negative)。数据为英文文本,数据集自己下载:🔗data

文本清洗技巧

1. re:正则表达式 Regular Expression

中文处理:

除了中文,其他字符全部去掉。这样处理后的output只剩下中文和连接断句的逗号。这个是把非中文的字符比如数字、英文、标点符号、html文本等全部洗掉了,比较狠。这个技能在中文数据集上会比较有用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import re

def only_chinese(comment):
line = comment.strip() # 去除首尾空格
p2 = re.compile(u'[^\u4e00-\u9fa5]') # 中文的编码范围是:\u4e00到\u9fa5
zh = " ".join(p2.split(line)).strip()
outStr = ",".join(zh.split()) # 所有的断句全部用逗号连接
return outStr

comment = " 武林外传的情节设计基本没什么bug!╭(●`∀´●)╯!!\
看了10年都看不腻~送你个网pan链接:\
http://fakewebsite.com"

test = only_chinese(comment)
print(test)

# output
# 武林外传的情节设计基本没什么,看了,年都看不腻,送你个网,链接
英文处理:

除了英文字母,其他字符全部替换为空格。正则表达式就是专治各种不服。

1
2
3
4
5
6
import re
comment = '最喜欢的话是Coding is the new SEXY!'
review_text = re.sub("[^a-zA-Z]"," ",comment)
print(review_text)

# Coding is the new SEXY

2. BeautifulSoup:清洗HTML、垃圾符号

很多网上爬下来的评价内容会带有HTML类型的文本,都是无效信息需要删除,首先”pip install beautifulsoup4“

1
2
3
4
5
6
7
from bs4 import BeautifulSoup

review = '<br /><br />\"Elvira, Mistress of the Dark\"'
review_text = BeautifulSoup(review).get_text()
print(review_text)

# "Elvira, Mistress of the Dark"

这样洗出来的文本”Elvira, Mistress of the Dark”就是真正有效的信息了。

了解文本清洗技巧之后,回到影评情感分析题本身,接下来开始正文。

第一步:创建文本清洗函数

影评分析题的文本是英文的,首先要创建一个简单的函数,将评论清理成我们可以使用的格式。我们只想要原始文本,而不是其他相关的HTML,或其他垃圾符号。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import re
from bs4 import BeautifulSoup

def review_to_wordlist(review):
'''
Meant for converting each of the IMDB reviews into a list of words.
'''
# First remove the HTML.
review_text = BeautifulSoup(review).get_text()

# Use regular expressions to only include words.
review_text = re.sub("[^a-zA-Z]"," ", review_text)

# Convert words to lower case and split them into separate words.
words = review_text.lower().split()

# Return a list of words
# return(words)
return(" ".join(words))

加载数据,按照上面的函数将样本数据进行清洗:

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd

# load data
data = pd.read_csv('data/labeledTrainData.tsv',delimiter="\t")

# clean data
clean_data = []
for rv in data['review']:
clean_data.append(review_to_wordlist(rv))

data['clean_review'] = clean_data

我这里把整个labeled train data做为整个数据集,将其拆分成训练集、测试集:

1
2
3
4
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(\
data['clean_review'],data['sentiment'], test_size=0.2, random_state=1)

第二步:生成词向量

先看一下影评的平均文本长度:

1
2
data['clean_review'].apply(lambda x: len(x.split(" "))).mean()
# 236.82856

使用keras的Tokenizer进行分词:

1
2
3
4
5
6
7
8
9
10
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

max_features = 6000 # 字典最大数
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train)
list_tokenized_train = tokenizer.texts_to_sequences(X_train)

maxlen = 130 # 句子最大长度
X_tr = pad_sequences(list_tokenized_train, maxlen=maxlen)

第三步: 创建 分类器/模型

天底下的分类器千千万,可以随自己选几个尝试一下。

BiLSTM Classifier

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras.layers import Convolution1D
from keras import initializers, regularizers, constraints, optimizers, layers

embed_size = 256
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(Bidirectional(LSTM(32, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.05))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 100
epochs = 5
model.fit(X_tr,y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

需要一点时间,运行时会出现:

1
2
3
4
5
6
7
8
9
10
11
12
Train on 16000 samples, validate on 4000 samples
Epoch 1/5
16000/16000 [==============================] - 31s 2ms/step - loss: 0.4832 - acc: 0.7597 - val_loss: 0.3214 - val_acc: 0.8608
Epoch 2/5
16000/16000 [==============================] - 30s 2ms/step - loss: 0.2633 - acc: 0.8929 - val_loss: 0.3142 - val_acc: 0.8642
Epoch 3/5
16000/16000 [==============================] - 30s 2ms/step - loss: 0.1876 - acc: 0.9292 - val_loss: 0.3474 - val_acc: 0.8557
Epoch 4/5
16000/16000 [==============================] - 29s 2ms/step - loss: 0.1211 - acc: 0.9593 - val_loss: 0.4179 - val_acc: 0.8560
Epoch 5/5
16000/16000 [==============================] - 29s 2ms/step - loss: 0.0760 - acc: 0.9754 - val_loss: 0.5393 - val_acc: 0.8440
<keras.callbacks.History at 0x2684c124940>

第四步: 模型评估

1
2
3
4
5
6
7
8
9
list_sentences_test = X_test
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)
prediction = model.predict(X_te)
y_pred = (prediction > 0.5)
from sklearn.metrics import f1_score, confusion_matrix
print('F1-score: {0}'.format(f1_score(y_pred, y_test)))
print('Confusion matrix:')
confusion_matrix(y_pred, y_test)

评估结果:

1
2
3
4
F1-score: 0.8357478065700876
Confusion matrix:
array([[2147, 449],
[ 356, 2048]], dtype=int64)

整个流程就差不多算完成啦,接下来就是进行模型优化,或者更换其它分类器。

写在最后

这篇就是简单走了个流程,仅作示例。除了本文提到的技巧,还有很多细节可以填充,比如去掉停用词等;还有细节可以优化,比如调整嵌入维度等。之后有空还会继续维护本篇,填充更多有效内容。

更多参考:

would you buy me a coffee☕~
0%