ナード戦隊データマン

機械学習と自然言語処理についてのブログ

targeted sentiment analysisを試しにやってみる

TSA(targeted sentiment analysis)とは、文内の対象に対して感情ラベルを予測するタスクです。

データ

データは以下からダウンロードします: https://github.com/sugiyamath/tsa_examples/tree/master/data/conll

jupyterで実行

まず、FastTextの事前訓練済みモデルをダウンロードします。

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
gunzip cc.en.300.bin.gz

次に、それを読み込みます。

from gensim.models.fasttext import FastText
from gensim.test.utils import datapath
kv = FastText.load_fasttext_format("./cc.en.300.bin")

そして、いくつかの関数を定義します。

import numpy as np
from tqdm import tqdm

def create_embedding_weights(kv, nb_words=100000, emd_dim=300):
    embedding_matrix = np.zeros((nb_words, emb_dim))
    embedding_dict = {}
    embedding_dict["<UNK>"] = 0
    for i, (word, _) in tqdm(enumerate(kv.wv.vocab.items())):
        if i >= nb_words-1:
            break
        embedding_dict[word] = i+1
        embedding_matrix[i+1] = kv.wv[word]
    return embedding_dict, embedding_matrix
import nltk

def tokenize(text, embedding_dict, sequence=False):
    out = []
    if sequence:
        words = text
    else:
        words = nltk.word_tokenize(text)
        
    for word in words:
        word = word.lower()
        try:
            out.append(embedding_dict[word])
        except KeyError:
            out.append(embedding_dict["<UNK>"])
    return np.array(out, dtype=np.int32)

embedding_matrix (Embedding層の重み)を作成します。

embedding_dict, embedding_matrix = create_embedding_weights(kv)

conllデータを読み込みます。

def load_data(datafile="../data/conll/all.conll.train"):
    labels = {"o":0,"b-neutral":1,"i-neutral":2,"b-positive":3,"i-positive":4,"b-negative":5,"i-negative":6}
    rows_word = []
    row_word = []
    rows_tag = []
    row_tag = []
    with open(datafile) as f:
        for line in f:
            line = line.strip()
            if line:
                tmp = line.split()
                if len(tmp) != 2:
                    continue
                word, tag = line.split()
                if tag not in labels:
                    continue                
                row_word.append(word)
                row_tag.append(tag)
            else:
                assert len(row_word) == len(row_tag)
                rows_word.append(row_word)
                rows_tag.append(row_tag)
                row_word = []
                row_tag = []
    if row_word and row_tag:
        assert len(row_word) == len(row_tag)
        rows_word.append(row_word)
        rows_tag.append(row_tag)
        
    assert len(rows_word) == len(rows_tag)
    return rows_word, rows_tag, labels

rows_word, rows_tag, labels = load_data()

前処理をします。

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

def preprocess(rows_word, rows_tag, labels, embedding_dict, maxlen=128):
    X = [tokenize(words, embedding_dict, sequence=True) for words in rows_word]
    X = pad_sequences(maxlen=128, sequences=X)
    y = [[labels[tag] for tag in tags] for tags in rows_tag]
    y = pad_sequences(maxlen=maxlen, sequences=y)
    y = np.array([to_categorical(i, num_classes=len(list(labels.items()))) for i in y])
    return X, y

X, y = preprocess(rows_word, rows_tag, labels, embedding_dict)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
X.shape, y.shape

モデルを作成します。

from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras_contrib.layers import CRF

maxlen=128
n_words = 100000
dim = 300
n_tags = len(list(labels.items()))

input = Input(shape=(maxlen,))
model = Embedding(n_words, dim, weights=[embedding_matrix], trainable=False, input_length=maxlen)(input)
model = Bidirectional(LSTM(units=128, return_sequences=True, recurrent_dropout=0.1))(model)
model = TimeDistributed(Dense(128, activation="relu"))(model)
crf = CRF(n_tags)
out = crf(model)

model = Model(input, out)
model.compile(optimizer="rmsprop", loss=crf.loss_function, metrics=[crf.accuracy])
model.summary()
Layer (type)                 Output Shape              Param #   
=================================================================
input_6 (InputLayer)         (None, 128)               0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 128, 300)          30000000  
_________________________________________________________________
bidirectional_4 (Bidirection (None, 128, 256)          439296    
_________________________________________________________________
time_distributed_4 (TimeDist (None, 128, 128)          32896     
_________________________________________________________________
crf_4 (CRF)                  (None, 128, 7)            966       
=================================================================
Total params: 30,473,158
Trainable params: 473,158
Non-trainable params: 30,000,000
_________________________________

訓練します。

history = model.fit(X_train, y_train, batch_size=32, epochs=5, validation_split=0.1, verbose=1)

精度評価をします。

from seqeval.metrics import classification_report

y_pred = model.predict(X_test, verbose=1)

idx2tag = {i: w for w, i in labels.items()}

def pred2label(pred, idx2tag):
    out = []
    for pred_i in pred:
        out_i = []
        for p in pred_i:
            p_i = np.argmax(p)
            out_i.append(idx2tag[p_i])
        out.append(out_i)
    return out
    
pred_labels = pred2label(y_pred, idx2tag)
test_labels = pred2label(y_test, idx2tag)

print(classification_report(test_labels, pred_labels))
           precision    recall  f1-score   support

        o       0.57      0.51      0.54      1871
  neutral       0.42      0.44      0.43       471
 negative       0.26      0.20      0.22       230
 positive       0.52      0.10      0.17       248

micro avg       0.52      0.44      0.47      2820
macro avg       0.52      0.44      0.46      2820

サンプル文で予測します。

words = {y:x for x, y in embedding_dict.items()}

def sample_prediction(X_row, y_row):
    p = model.predict(np.array([X_row]))
    p = np.argmax(p, axis=-1)
    true = np.argmax(y_row, -1)
    print("{:15}||{:5}||{}".format("Word", "True", "Pred"))
    print(30 * "=")
    for w, t, pred in zip(X_row, true, p[0]):
        if w != 0:
            print("{:15}: {:5} {}".format(words[w], idx2tag[t], idx2tag[pred]))

target = 142
sample_prediction(X_test[target], y_test[target])
Word           ||True ||Pred
==============================
why            : o     o
is             : o     o
obama          : b-negative b-negative
's             : o     o
foreign        : o     o
policy         : o     o
backing        : o     o
the            : o     o
same           : o     o
people         : o     o
as             : o     o
al             : o     o
in             : o     o
?              : o     o
via            : o     o

考察

この問題の難しい点は、固有表現抽出とは違い、同じ単語に対して異なる予測がされる可能性があることです。

例えば、

"obama is a stupid guy" -> obama: b-negative
"obama is a great guy" -> obama: b-positive
"obama is a normal guy" -> obama: b-neutral

となります。このように、文脈に応じて異なるラベルを付与しなければならない点が、このタスクの困難な点です。 なので、前述のモデルでは、neutralというラベルで予測されやすくなってしまっています。これは、単純に固有表現抽出のようなモデルになってしまっており、感情極性を正しく抽出できていません。

例えば、以下のような論文があるようです:

参考