ナード戦隊データマン

機械学習と自然言語処理についてのブログ

議論マイニング: IBM Debater Claim Stance Dataset

IBM Debater Claim Stance Datasetとは、議題と主張を入力として、主張が議題に対して賛成か反対かを判定するためのデータセットです。

コード

import pandas as pd
import numpy as np
import sentencepiece as spm
from sklearn.utils import shuffle
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Embedding, GRU, Dropout, concatenate, Flatten
from keras.layers import SpatialDropout1D
from keras.layers.convolutional import SeparableConv1D, MaxPooling1D
import json
from bert_serving.client import BertClient
bc = BertClient()
 
 
def load(
        trainfile="./data/IBM_Debater_(R)_CS_EACL-2017.v1/claim_stance_dataset_v1.json",
        spfile="./spmodel/en.wiki.bpe.vs5000.model"):
    with open(trainfile) as f:
        data = json.load(f)
        out = []
        for ds in data:
            for d in ds["claims"]:
                out.append({
                    "X1": ds["topicTarget"],
                    "X2": d["claimCorrectedText"],
                    "y": d["stance"] == "PRO"
                })
 
    df = shuffle(pd.DataFrame(out))
    df["X"] = [' ||| '.join([a, b]) for a, b in zip(df["X1"], df["X2"])]
    X = bc.encode(df["X"].tolist())
    y = df["y"]
    size = int(X.shape[0] * 0.9)
    X_train, X_test, y_train, y_test = X[:size], X[size:], y[:size], y[size:]
    size = int(X_test.shape[0] * 0.9)
    X_test, X_val, y_test, y_val = X_test[:size], X_test[
        size:], y_test[:size], y_test[size:]
    sp = spm.SentencePieceProcessor()
    sp.Load(spfile)
    return (X_train, y_train), (X_val, y_val), (X_test, y_test)
 
 
def build_model():
    in1 = Input(shape=(768, ))
    out = Dense(1024, activation="relu", kernel_initializer="he_normal")(in1)
    out = Dropout(0.5)(out)
    out = Dense(1, activation="sigmoid", kernel_initializer="normal")(out)
    model = Model(in1, out)
    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model
 
 
if __name__ == "__main__":
    from sklearn.metrics import classification_report, accuracy_score
    from keras.models import load_model
    epochs = 500
    batch_size = 1000
    tr, va, te = load()
    model = build_model()
    model.fit(*tr,
              validation_data=va,
              epochs=epochs,
              batch_size=batch_size,
              verbose=1)
    model.save("model.h5")
    model = load_model("model.h5")
    y_pred = [x[0] > 0.5 for x in model.predict(te[0])]
    #print(y_pred.shape)
    #print(y_pred[:10])
    print("ACC:", accuracy_score(te[1], y_pred))
    print(classification_report(te[1], y_pred))

[結果]

ACC: 0.8287037037037037
              precision    recall  f1-score   support

       False       0.78      0.81      0.80        89
        True       0.86      0.84      0.85       127

   micro avg       0.83      0.83      0.83       216
   macro avg       0.82      0.83      0.82       216
weighted avg       0.83      0.83      0.83       216

説明

bert as serviceを用いて文をエンコードしています。エンコードしたものをDNNで分類しています。bertは2文の入力をサポートしているため、議題と主張を同時に入力してエンコードできます。

参考

IBM Research | Debater Datasets

GitHub - hanxiao/bert-as-service: Mapping a variable-length sentence to a fixed-length vector using BERT model