ナード戦隊データマン

データサイエンスを用いて悪と戦うぞ

gensimとkerasでツイートの感情分析

gensimのword2vecをどのようにしてKerasのEmbeddingレイヤーに凍結するのかについて、感情分析を例としてコードを書きます。

準備

まず、感情分析データは以下と同じものを使います。 https://qiita.com/sugiyamath/items/7cabef39390c4a07e4d8

そして、word2vecとsentencepieceの訓練済みモデルは以下を使います。 https://qiita.com/sugiyamath/items/b7ea3e7484cf210b9ad4

jupyter notebookで実行

まず、データを読み込みます。

In[1]:

import pandas as pd
from os.path import join
from sklearn.utils import shuffle

emotions = ["happy", "sad", ["disgust", "disgust2"], "angry", "fear", "surprise"]
dir_path = "gathering/ja_tweets_sentiment"
size = 60000
df = []
for i, es in enumerate(emotions):
    if isinstance(es, list):
        for e in es:
            data = shuffle(pd.read_json(join(dir_path, "{}.json".format(e)))).iloc[:int(size/len(es))]
            data['label'] = i
            df.append(data)
    else:
        data = shuffle(pd.read_json(join(dir_path, "{}.json".format(es)))).iloc[:int(size)]
        data['label'] = i
        df.append(data)

df = pd.concat(df)
df = shuffle(df)
X = df['text']
y = df['label']
df.shape

次に、訓練済みsentencepieceで、テキストをトークナイズしておきます。

In[2]:

import sentencepiece as spm

import re
regexs = []
regexs.append(re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'))
regexs.append(re.compile('@(\w){1,15}'))

def tokenize(data, regexs, sp=sp):
    results = []
    for d in data:
        try:
            for regex in regexs:
                d = re.sub(regex, "", d)
            d = ' '.join([l.replace("▁", "").replace("#","") for l in sp.EncodeAsPieces(d)])
        except:
            d = ""
        results.append(d)
    return results

sp = spm.SentencePieceProcessor()
sp.Load("twitterstream2word2vec/model/sp/sp.model")

X = tokenize(X, regexs, sp)

Kerasのトークナイザでゴニョゴニョします。

In[3]:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

max_features=32000
maxlen = 280

y = to_categorical(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)

tokenizer = Tokenizer(num_words=max_features, filters="", lower=False)
tokenizer.fit_on_texts(list(X_train))

def preprocess(data, tokenizer, maxlen=280):
    return(pad_sequences(tokenizer.texts_to_sequences(data), maxlen=maxlen))

X_train = preprocess(X_train, tokenizer)
X_test = preprocess(X_test, tokenizer)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train)

Kerasのトークナイザの語のインデクスとgensimの訓練済みword2vecを対応付け、重み行列として定義し、その重みをKerasのEmbeddingレイヤーとして凍結します。

In[4]:

import numpy as np
from gensim.models import word2vec
from keras.layers import Embedding

word_index = tokenizer.word_index
word_vectors = word2vec.Word2Vec.load("twitterstream2word2vec/model/w2v_gensim/word2vec_tweet.model")

EMBEDDING_DIM = 200
vocabulary_size = min(len(word_index)+1, max_features)
embedding_matrix = np.zeros((vocabulary_size, EMBEDDING_DIM))

for word, i in word_index.items():
    if i >= max_features:
        continue
    try:
        embedding_vector = word_vectors[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        embedding_matrix[i] = np.random.normal(0, np.sqrt(0.25), EMBEDDING_DIM)
        
del(word_vectors)

embedding_layer = Embedding(vocabulary_size, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False)

あとは、凍結したembedding_layerをテキトーなディープラーニングモデルと結合するだけです。

In[5]:

from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Conv2D, MaxPooling2D, Dropout,concatenate
from keras.layers.core import Reshape, Flatten
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.models import Model
from keras import regularizers

sequence_length = X_train.shape[1]
filter_sizes = [3,4,5]
num_filters = 100
drop = 0.5

inputs = Input(shape=(sequence_length,))
embedding = embedding_layer(inputs)
reshape = Reshape((sequence_length, EMBEDDING_DIM, 1))(embedding)

conv_0 = Conv2D(num_filters, (filter_sizes[0], EMBEDDING_DIM),activation='relu',kernel_regularizer=regularizers.l2(0.01))(reshape)
conv_1 = Conv2D(num_filters, (filter_sizes[1], EMBEDDING_DIM),activation='relu',kernel_regularizer=regularizers.l2(0.01))(reshape)
conv_2 = Conv2D(num_filters, (filter_sizes[2], EMBEDDING_DIM),activation='relu',kernel_regularizer=regularizers.l2(0.01))(reshape)

maxpool_0 = MaxPooling2D((sequence_length - filter_sizes[0] + 1, 1), strides=(1,1))(conv_0)
maxpool_1 = MaxPooling2D((sequence_length - filter_sizes[1] + 1, 1), strides=(1,1))(conv_1)
maxpool_2 = MaxPooling2D((sequence_length - filter_sizes[2] + 1, 1), strides=(1,1))(conv_2)

merged_tensor = concatenate([maxpool_0, maxpool_1, maxpool_2], axis=1)
flatten = Flatten()(merged_tensor)
reshape = Reshape((3*num_filters,))(flatten)
dropout = Dropout(drop)(flatten)
output = Dense(units=6, activation='softmax',kernel_regularizer=regularizers.l2(0.01))(dropout)

model = Model(inputs, output)

adam = Adam(lr=1e-3)

model.compile(loss='categorical_crossentropy',
              optimizer=adam,
              metrics=['acc'])

callbacks = [EarlyStopping(monitor='val_loss')]

model.fit(X_train, y_train, batch_size=1000, epochs=10, verbose=1, validation_data=(X_val, y_val), callbacks=callbacks)

Out[5]:

Train on 202500 samples, validate on 67500 samples
Epoch 1/10
202500/202500 [==============================] - 35s 174us/step - loss: 1.7235 - acc: 0.3952 - val_loss: 1.4867 - val_acc: 0.4731
Epoch 2/10
202500/202500 [==============================] - 34s 167us/step - loss: 1.5152 - acc: 0.4459 - val_loss: 1.4519 - val_acc: 0.4771
Epoch 3/10
202500/202500 [==============================] - 34s 167us/step - loss: 1.4907 - acc: 0.4517 - val_loss: 1.4514 - val_acc: 0.4750
Epoch 4/10
202500/202500 [==============================] - 34s 168us/step - loss: 1.4890 - acc: 0.4510 - val_loss: 1.4484 - val_acc: 0.4770
Epoch 5/10
202500/202500 [==============================] - 34s 168us/step - loss: 1.4914 - acc: 0.4502 - val_loss: 1.4515 - val_acc: 0.4752

あとは分類レポートでも出力してみます。

In[6]:

import numpy as np
y_preds = model.predict(X_test)
y_preds = np.argmax(y_preds, axis=1)
y_true = np.argmax(y_test, axis=1)

emolabels = []
for e in emotions:
    if isinstance(e, list):
        emolabels.append(e[0])
    else:
        emolabels.append(e)

from sklearn.metrics import classification_report
print(classification_report(y_true, y_preds, target_names=emolabels))

Out[6]:

             precision    recall  f1-score   support

      happy       0.56      0.56      0.56     15007
        sad       0.60      0.54      0.57     15012
    disgust       0.43      0.27      0.34     15214
      angry       0.47      0.56      0.51     14893
       fear       0.38      0.40      0.39     14888
   surprise       0.42      0.51      0.46     14986

avg / total       0.48      0.47      0.47     90000

1次元文字ベースCNNのほうが精度が良かったですが、こちらもそれなりの精度が出ています。

補足

Kerasの各レイヤーで、trainable=Falseを指定すれば、学習中にそのレイヤーの重みを更新することがありません。これを「凍結」といいます。

凍結についての詳しい説明は以下を読んでください。 https://keras.io/ja/getting-started/faq/#freeze

Kerasはファインチューニング等も簡単にできるので、試してみてください。

参考

[0] https://keras.io/ja/ [1] https://www.kaggle.com/marijakekic/cnn-in-keras-with-pretrained-word2vec-weights