ナード戦隊データマン

データサイエンスを用いて悪と戦うぞ

固有表現抽出ツールanagoの訓練データを京都ウェブ文書リードコーパスから用意する

前回( https://qiita.com/sugiyamath/items/365b263d4f03d3bca26f ), Hironsanのgithubのツール「anago」を試しましたが、十分なデータの用意ができませんでした。今回は、KWDLCから入力形式のデータを生成します。

形式のルール

  1. 単語とラベルをタブでつなぐ。
  2. 一文ごとに改行だけの行を挿入する。
  3. ラベルは"IOBタグ-分類(英大文字3字)"。

データの置き場所

データは以下のURLからダウンロードします。 http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/KWDLC/download_kwdlc.cgi

このデータを展開し、形式的にnltk_data/corpora/kwdlcに置きます。

コード

import nltk,re
from nltk.corpus.reader.util import *
from nltk.corpus.util import LazyCorpusLoader

root = nltk.data.find('corpora/kwdlc/dat/rel')
fileids = [f for f in find_corpus_fileids(FileSystemPathPointer(root), ".*")
               if re.search(r".+-.+", f)]
data = ""

for fileid in fileids:

    with open(root+"/" + fileid, "r") as file:
        data = data + file.read()

data_spl = data.split("\n")

regexp0 = re.compile(r"^#")
regexp1 = re.compile(r"^([^ ]+) [^ ]+ [^ ]+ [^ ]+ [^ ]+ [^ ]+ [^ ]+$")
regexp2 = re.compile(r".*ne type=\"([A-Z]+)\" target=\"(.+)\".*")

data_fmt = []
for d in data_spl:
    if regexp0.search(d):
        continue
    if regexp1.search(d):
        data_fmt.append(re.sub(regexp1, r"\1", d))
    if regexp2.search(d):
        data_fmt.append(re.sub(regexp2, r"NE:\1:\2", d))


def fill_array(size, it):
    out = []
    for i in range(size):
        out.append(it)
    return out

outs = []
labels = fill_array(len(data_fmt), None)

for i,d in enumerate(data_fmt):
    regexp3 = re.compile("NE:([A-Z]+):(.+)")
    if regexp3.search(d):
        rs = d.split(":")
        tmp_bools = []

        counter = i
        while(counter > 0):
            counter = counter - 1
            if data_fmt[counter] in rs[2]:
                tmp_bools.append(counter)
            else:
                break

        counter = i
        while(counter < len(data_fmt)):
            counter = counter + 1
            if data_fmt[counter] in rs[2]:
                tmp_bools.append(counter)
            else:
                break

        tmp_bools.sort()
        first_flag = True
        for b in tmp_bools:
            if first_flag:
                labels[b] = "B-" + rs[1][:3]
                first_flag = False
            else:
                labels[b] = "I-" + rs[1][:3]
        outs.append(None)
    else:
        outs.append(d)

with open("kwdlc.txt", "w") as file:
    for i, d in enumerate(outs):
        regexp4 = re.compile("。")
        if d is not None:
            if labels[i] is None:
                if regexp4.search(d):
                    file.write(d + "\t" + "O" + "\n")
                else:
                    file.write(d + "\t" + "O")
            else:
                file.write(d + "\t" + labels[i])
            file.write("\n")

日本語の固有表現抽出をanaGoで行う

Hironsanの記事( https://qiita.com/Hironsan/items/326b66711eb4196aa9d4 )で使われているhironsan.txtを用いて、彼の作ったanaGoを試します。 ( https://github.com/Hironsan/anago )

hironsan.txtダウンロードと整形

本当は、KNBなどを用いてモデルを作成したほうがいいのですが、時間がかかりそうなのでとりあえずすでに整形されているデータをanaGo用に変換します。

https://github.com/Hironsan/IOB2Corpus

# coding: utf-8
with open("hironsan.txt") as file:
    data = file.read().split("\n")
    
data_spl = [x.split("\t") for x in data]
data_fin = [[x[0],x[-1]] for x in data_spl]
with open("hironsan_anago.txt", "w") as file:
    for x in data_fin:
        file.write(str(x[0])+"\t"+str(x[1])+"\n")
# coding: utf-8
def my_train_test_split(data, train_size=0.8):
    train_num = int(len(data) * train_size)
    return data[:train_num], data[train_num:]

with open("hironsan_anago.txt", "r") as file:
    data = file.read().split("\n")
    train, test = my_train_test_split(data)
    valid, test = my_train_test_split(test, train_size=0.5)
    
with open("hironsan_train.txt", "w") as file:
    for x in train:
        file.write(x+"\n")
        
with open("hironsan_test.txt", "w") as file:
    for x in test:
        file.write(x+"\n")
        
with open("hironsan_valid.txt", "w") as file:
    for x in valid:
        file.write(x+"\n")

これらを実行すると、4つのファイルが生成されます。

  • hironsan_anago.txt
  • hironsan_train.txt
  • hironsan_test.txt
  • hironsan_valid.txt

学習済みword2vecを取得・embeddingファイルを生成

$ mkdir w2v_model
$ cd w2v_model
$ wget http://public.shiroyagi.s3.amazonaws.com/latest-ja-word2vec-gensim-model.zip
$ unzip latest-ja-word2vec-gensim-model.zip
# coding: utf-8
from gensim.models import Word2Vec
model = Word2Vec.load("w2v_model/word2vec.gensim.model")
word_vectors = model.wv
word_vectors.save_word2vec_format("emb.txt", binary=False)

emb.txtはgithub上で説明されているglove.6B.100d.txtに対応します。

anaGoの修正

いくつか修正する部分があります。まず、data/reader.pyのload_word_embeddings関数を編集します。これは、emb.txtをsplitしたとき、最初のインデックス以外も語を表す場合があるためです。

def load_word_embeddings(vocab, glove_filename, dim):
    """Loads GloVe vectors in numpy array.                                                                                                                   
                                                                                                                                                             
    Args:                                                                                                                                                    
        vocab (): dictionary vocab[word] = index.                                                                                                            
        glove_filename (str): a path to a glove file.                                                                                                        
        dim (int): dimension of embeddings.                                                                                                                  
                                                                                                                                                             
    Returns:                                                                                                                                                 
        numpy array: an array of word embeddings.                                                                                                            
    """

    embeddings = np.zeros([len(vocab), dim])
    with open(glove_filename) as f:
        for line in f:
            line = line.strip().split(' ')
            #word = line[0]                                                                                                                                  
            word_i = 1
            for i in range(dim):
                try:
                    tmp = float(line[i])
                    if tmp < 1.0:
                        break
                    else:
                        word_i = i+1
                except:
                    word_i = i+1

            #print(word_i)                                                                                                                                   
            word = ' '.join([str(x) for x in line[0: word_i]])
            embedding = [float(x) for x in line[word_i:dim+word_i]]
            if word in vocab:
                word_idx = vocab[word]
                embeddings[word_idx] = np.asarray(embedding)

    return embeddings

次に、evaluator.pyのeval関数が間違っているところがあるので修正します。

    def eval(self, x_test, y_test):

        # Prepare test data(steps, generator)                                                                                                                
        train_steps, train_batches = batch_iter(x_test, y_test, self.config.batch_size, preprocessor=self.preprocessor)

        # Build the model                                                                                                                                    
        model = SeqLabeling(self.config, ntags=len(self.preprocessor.vocab_tag))
        model.load(filepath=os.path.join(self.save_path, self.weights))

        # Build the evaluator and evaluate the model                                                                                                         
        f1score = F1score(train_steps, train_batches, self.preprocessor)
        f1score.model = model
        f1score.on_epoch_end(epoch=-1)  # epoch takes any integer.  

この場合、batch_iterにx_testとy_testを渡していますが、元のコードは以下のようになっています。

   def eval(self, x_test, y_test):

        # Prepare test data(steps, generator)
        train_steps, train_batches = batch_iter(
            list(zip(x_test, y_test)), self.config.batch_size, preprocessor=self.preprocessor)

        # Build the model
        model = SeqLabeling(self.config, ntags=len(self.preprocessor.vocab_tag))
        model.load(filepath=os.path.join(self.save_path, self.weights))

        # Build the evaluator and evaluate the model
        f1score = F1score(train_steps, train_batches, self.preprocessor)
        f1score.model = model
        f1score.on_epoch_end(epoch=-1)

batch_iterへの引数の渡し方が間違っていたようです。

最後にconfig.pyを編集します。なぜなら、word2vecの次元が100ではなく、50だからです。

class ModelConfig(object):
    """Wrapper class for model hyperparameters."""

    def __init__(self):
        """Sets the default model hyperparameters."""

        # Number of unique words in the vocab (plus 2, for <UNK>, <PAD>).
        self.vocab_size = None
        self.char_vocab_size = None

        # Batch size.
        self.batch_size = 32

        # Scale used to initialize model variables.
        self.initializer_scale = 0.08

        # LSTM input and output dimensionality, respectively.
        self.char_embedding_size = 25
        self.num_char_lstm_units = 25
        self.word_embedding_size = 50
        self.num_word_lstm_units = 50

        # If < 1.0, the dropout keep probability applied to LSTM variables.
        self.dropout = 0.5

        # If True, use character feature.
        self.char_feature = True

        # If True, use crf.
        self.crf = True


class TrainingConfig(object):
    """Wrapper class for training hyperparameters."""

    def __init__(self):
        """Sets the default training hyperparameters."""

        # Batch size
        self.batch_size = 10

        # Optimizer for training the model.
        self.optimizer = 'adam'

        # Learning rate for the initial phase of training.
        self.learning_rate = 0.001
        self.lr_decay = 0.9

        # If not None, clip gradients to this value.
        self.clip_gradients = 5.0

        # The number of max epoch size
        self.max_epoch = 50

        # Parameters for early stopping
        self.early_stopping = True
        self.patience = 3

        # Fine-tune word embeddings
        self.train_embeddings = True

        # How many model checkpoints to keep.
        self.max_checkpoints_to_keep = 5

以上で設定は終了です。ただし、今まで生成されたhironsan_*.txtとemb.txtは任意の場所においてください。

訓練・テスト

それでは、実行してみます。hironsan_*.txtはdata2/hionsan/へ入れて、modelsディレクトリとlogsディレクトリを作成してくだい。emb.txtはdata2/へ入れます。

# coding: utf-8
import os
import anago
from anago.data.reader import load_data_and_labels, load_word_embeddings
from anago.data.preprocess import prepare_preprocessor
from anago.config import ModelConfig, TrainingConfig

DATA_ROOT = 'data2/hironsan/'
SAVE_ROOT = './models'  # trained model
LOG_ROOT = './logs'     # checkpoint, tensorboard
embedding_path = './data2/emb.txt'

model_config = ModelConfig()
training_config = TrainingConfig()

train_path = os.path.join(DATA_ROOT, 'hironsan_train.txt')
valid_path = os.path.join(DATA_ROOT, 'hironsan_valid.txt')
test_path = os.path.join(DATA_ROOT, 'hironsan_test.txt')

x_train, y_train = load_data_and_labels(train_path)
x_valid, y_valid = load_data_and_labels(valid_path)
x_test, y_test = load_data_and_labels(test_path)

p = prepare_preprocessor(x_train, y_train)

embeddings = load_word_embeddings(p.vocab_word, embedding_path, 50)
model_config.vocab_size = len(p.vocab_word)
model_config.char_vocab_size = len(p.vocab_char)

trainer = anago.Trainer(model_config, training_config, checkpoint_path=LOG_ROOT, save_path=SAVE_ROOT,preprocessor=p, embeddings=embeddings)
trainer.train(x_train, y_train, x_valid, y_valid)
                        
weights = 'model_weights.h5'
evaluator = anago.Evaluator(model_config, weights, save_path=SAVE_ROOT, preprocessor=p)
evaluator.eval(x_test, y_test)

Out[1]:

 - f1: 56.88

F値はそれほど高くはないことがわかります。

考察

実行自体はできましたが、結果はまずまずです。これらの手法で高い精度を出すには、まず利用するデータの量がもっと大きくなければならないかもしれません。

また、利用したword2vecモデルがどのように分かち書きされたのかにも影響するでしょう。

一番面倒な点は、hirosan.txtのような形式でKNBなどのコーパスを出力しなければならない点です。また、自力でword2vecモデルを作成する必要もあるでしょう。

ただし、config.pyのearly_stoppingをFalseにしてepochを増やせば精度が高まる可能性があります。

教師なし学習で画像分類をする

教師なし学習の主な目的はEDAです。しかし、類似画像を教師なし学習で表示するようなサービスもあります。ここでは、教師なし学習であるKmeans, AgglomerativeClustering, DBSCANを比較します。

顔画像データの準備

まず、利用するデータをインポートしてデータについて少々理解を得ていきましょう。

In[1]:

%matplotlib inline
from sklearn.datasets import fetch_lfw_people
import matplotlib.pyplot as plt

people = fetch_lfw_people(min_faces_per_person=20, resize=0.7)

image_shape = people.images[0].shape
fix, axis = plt.subplots(2, 5, figsize=(15, 8), subplot_kw={'xticks':(), 'yticks':()})
for target, image, ax in zip(people.target, people.images, axis.ravel()):
    ax.imshow(image)
    ax.set_title(people.target_names[target])

Out[1]: download.png

上のような顔画像データをインポートしています。顔画像のそれぞれの人物のサイズを見てみます。

In[2]:

import numpy as np
counts = np.bincount(people.target)

for i, (count, name) in enumerate(zip(counts, people.target_names)):
    print("{0:25} {1:3}".format(name, count), end='   ')
    if (i+1) % 3 == 0:
        print()

Out[2]:

Alejandro Toledo           39   Alvaro Uribe               35   Amelie Mauresmo            21   
Andre Agassi               36   Angelina Jolie             20   Ariel Sharon               77   
Arnold Schwarzenegger      42   Atal Bihari Vajpayee       24   Bill Clinton               29   
Carlos Menem               21   Colin Powell              236   David Beckham              31   
Donald Rumsfeld           121   George Robertson           22   George W Bush             530   
Gerhard Schroeder         109   Gloria Macapagal Arroyo    44   Gray Davis                 26   
Guillermo Coria            30   Hamid Karzai               22   Hans Blix                  39   
Hugo Chavez                71   Igor Ivanov                20   Jack Straw                 28   
Jacques Chirac             52   Jean Chretien              55   Jennifer Aniston           21   
Jennifer Capriati          42   Jennifer Lopez             21   Jeremy Greenstock          24   
Jiang Zemin                20   John Ashcroft              53   John Negroponte            31   
Jose Maria Aznar           23   Juan Carlos Ferrero        28   Junichiro Koizumi          60   
Kofi Annan                 32   Laura Bush                 41   Lindsay Davenport          22   
Lleyton Hewitt             41   Luiz Inacio Lula da Silva  48   Mahmoud Abbas              29   
Megawati Sukarnoputri      33   Michael Bloomberg          20   Naomi Watts                22   
Nestor Kirchner            37   Paul Bremer                20   Pete Sampras               22   
Recep Tayyip Erdogan       30   Ricardo Lagos              27   Roh Moo-hyun               32   
Rudolph Giuliani           26   Saddam Hussein             23   Serena Williams            52   
Silvio Berlusconi          33   Tiger Woods                23   Tom Daschle                25   
Tom Ridge                  33   Tony Blair                144   Vicente Fox                32   
Vladimir Putin             49   Winona Ryder               24   

偏りが大きいので、サイズを50に制限します。

In[3]:

from sklearn.decomposition import PCA
mask = np.zeros(people.target.shape, dtype=np.bool)
for target in np.unique(people.target):
    mask[np.where(people.target == target)[0][:50]] = 1
X_people = people.data[mask]
y_people = people.target[mask]

X_people = X_people / 255

pca = PCA(n_components=100, whiten=True, random_state=0)
X_pca = pca.fit_transform(X_people)

DBSCAN

まず、DBSCANを試してみます。近接度epsを設定する必要がありますが、クラスタ数は自動的に決定されます。

In[4]:

from sklearn.cluster import AgglomerativeClustering,DBSCAN,KMeans
for eps in [1, 3, 5, 7, 9, 11, 13, 15, 17]:
    print("eps:{}".format(eps))
    dbscan = DBSCAN(eps=eps, min_samples=3)
    labels = dbscan.fit_predict(X_pca)
    print("Clusters present:{}".format(np.unique(labels)))
    print("Clusters size:{}".format(np.bincount(labels + 1)))
    print()

Out[4]:

eps:1
Clusters present:[-1]
Clusters size:[2063]

eps:3
Clusters present:[-1]
Clusters size:[2063]

eps:5
Clusters present:[-1]
Clusters size:[2063]

eps:7
Clusters present:[-1  0  1  2  3  4  5  6  7  8  9 10 11 12]
Clusters size:[2003    4   14    7    4    3    3    4    4    3    3    5    3    3]

eps:9
Clusters present:[-1  0  1  2]
Clusters size:[1306  751    3    3]

eps:11
Clusters present:[-1  0]
Clusters size:[ 413 1650]

eps:13
Clusters present:[-1  0]
Clusters size:[ 120 1943]

eps:15
Clusters present:[-1  0]
Clusters size:[  31 2032]

eps:17
Clusters present:[-1  0]
Clusters size:[   6 2057]

-1はノイズを表しています。まともにクラスタが出力されているのはeps=7だけのようです。

In[5]:

dbscan = DBSCAN(eps=7, min_samples=3)
labels = dbscan.fit_predict(X_pca)

for cluster in range(max(labels)+1):
    mask = labels == cluster
    n_images = np.sum(mask)
    fig, axes = plt.subplots(1, n_images, figsize=(n_images + 1.5, 4),
                            subplot_kw={'xticks':(), 'yticks':()})
    for image, label, ax in zip(X_people[mask], y_people[mask], axes):
        ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
        ax.set_title(people.target_names[label].split()[-1])

Out[5]:

Screenshot from 2017-11-15 07-44-20.png

出力が大きいので一部のみ表示しています。小泉は日本人なので特徴があり、正しく分類されています。

Kmeans

つぎに、kmeansを用いてみます。kmeansはクラスタセンタを生成します。

In[6]:

km = KMeans(n_clusters=10, random_state=0)
labels_km = km.fit_predict(X_pca)
print("Cluster size k-means:{}".format(np.bincount(labels_km)))

Out[6]:

Cluster size k-means:[113 256 188 147 216 180 258 211 139 355]

In[7]:

import mglearn
mglearn.plots.plot_kmeans_faces(km, pca, X_pca, X_people, y_people, people.target_names)

Out[8]: download (2).png

一番左の出力がクラスタセンタ、その次の5つが最もクラスタセンタに近い画像、右の5つが最も遠い画像です。視線の方向や顔の角度が影響していることがわかります。

AgglomerativeClustering

最後に、AgglomerativeClusteringを試します。

In[8]:

agglomerative = AgglomerativeClustering(n_clusters=10)
labels_agg = agglomerative.fit_predict(X_pca)
print("Cluster size:{}".format(np.bincount(labels_agg)))

Out[8]:

Cluster size:[478 254 317 119  96 191 424  17  55 112]

ランドスコアをチェックし、Kmeansの結果と比較します。

In[9]:

from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(labels_agg, labels_km)

Out[9]:

0.06975311521947071

0.069なので、Kmeansとは全く異なる結果を出力したことがわかります。分類された画像を見てみましょう。

In[10]:

n_clusters = 10
for cluster in range(n_clusters):
    mask = labels_agg == cluster
    fig, axes = plt.subplots(1, 10, subplot_kw={'xticks':(), 'yticks':()}, figsize=(15, 8))
    axes[0].set_ylabel(np.sum(mask))
    for image, label, asdf, ax in zip(X_people[mask], y_people[mask], labels_agg[mask], axes):
        ax.imshow(image.reshape(image_shape), vmin=0, vmax=1)
        ax.set_title(people.target_names[label].split()[-1], fontdict={'fontsize':9})

Out[10]: Screenshot from 2017-11-15 07-53-26.png

おわりに

ここでは、3つの教師なし学習を用いて画像分類しました。EDAで教師なし学習を使うと、データについてなんらかの理解を得られる可能性があります。しかし、実運用で教師なし学習を用いる場合、スコアリングができないという点に注意が必要です。また、複数のクラスタリングアルゴリズムがまったく異なる出力をすることにも注意したほうがよいでしょう。

参考

https://github.com/amueller/introduction_to_ml_with_python/blob/master/03-unsupervised-learning.ipynb

Dialogflowでダイアログシステムを簡単作成

ダイアログシステムとは、ユーザの質問に対して回答を行うようなシステムの総称です。ここでは、Dialogflow( https://dialogflow.com/ )という無料のサービスを用いて、リクエスト曲のYoutubeリンクをレスポンスとして返すものを作ります。

登録

Screenshot from 2017-11-13 21-25-58.png

まず、dialogflowにAgentを追加します。Agentとは機能の単位のことです。ここでは、歌や音楽のYoutubeリンクを返してくれる日本語Agentを作成します。

エンティティの作成

Screenshot from 2017-11-13 21-30-15.png

次にエンティティを作成します。エンティティとは、質問や回答に含まれる主に名詞句のことです。artistsエンティティを作成することにより、アーティスト名を利用して回答できるようにします。

当然、このようなエンティティ作成は手間がかかるため、csvをアップロードして登録することもできます。

Screenshot from 2017-11-13 21-58-59.png

intentsの作成

それでは、intentsを作成します。intentsは、ユーザが投げるであろう質問を登録し、その質問に対してどのようなレスポンスを返すのかを定義します。

Screenshot from 2017-11-13 21-52-06.png Screenshot from 2017-11-13 21-52-58.png

目的はdialogflowを試しに使う程度なので、youtube検索結果URLをレスポンスとして投げます。このとき、検索結果URLにはエンティティ情報を使っています。

試す

Try itと書かれた検索バーにリクエストを入力すると、レスポンスを見ることができます。

Screenshot from 2017-11-13 22-05-14.png

Screenshot from 2017-11-13 22-05-36.png

運用する

運用するには、TwitterGoogle Assistantなどと連携するか、SDKにアクセストークンを渡すことで利用できます。

Screenshot from 2017-11-13 22-08-31.png

Screenshot from 2017-11-13 22-08-21.png

おわりに

まだまだ、詳しく調べてみれば様々な使い方がありそうですが、tensorflowなどを用いて一からダイアログシステムを構築するよりは簡単そうです。自然言語処理でなかなか成果が挙げられないという人は、このdialogflowを使うことを検討してみてはどうでしょう。

リンク

https://dialogflow.com/

CapsuleNetをMNISTで試す

CapsuleNetとは、Google brainの人たちによって考案されたモデルです。ここでは、kaggleのdigit recognizerのコンペでCapsuleNetを使う方法を紹介します。ただ、参考にしたgithubリンクのソースをほとんど真似ているだけなので、日本語を読むのが面倒くさい人はリンクへ飛んでください。

(Jupyter Notebookで実行)

実装について

CapusleNetの実装は以下に基づいています。

レイヤーを定義する

まず、CapusleNetレイヤーを定義し、いくつかの便利な関数を定義します。今の所、まだ最適化はされていません。

In[1]:

import keras.backend as K
import tensorflow as tf
from keras import initializers, layers

class Length(layers.Layer):
    
    def call(self, inputs, **kwargs):
        return K.sqrt(K.sum(K.square(inputs), -1))

    def compute_output_shape(self, input_shape):
        return input_shape[:-1]

class Mask(layers.Layer):

    def call(self, inputs, **kwargs):
        if type(inputs) is list:  
            inputs, mask = inputs
        else:  
            x = inputs
            x = (x - K.max(x, 1, True)) / K.epsilon() + 1
            mask = K.clip(x, 0, 1)  

        inputs_masked = K.batch_dot(inputs, mask, [1, 1])
        return inputs_masked

    def compute_output_shape(self, input_shape):
        if type(input_shape[0]) is tuple:  
            return tuple([None, input_shape[0][-1]])
        else:
            return tuple([None, input_shape[-1]])


def squash(vectors, axis=-1):

    s_squared_norm = K.sum(K.square(vectors), axis, keepdims=True)
    scale = s_squared_norm / (1 + s_squared_norm) / K.sqrt(s_squared_norm)
    return scale * vectors


class CapsuleLayer(layers.Layer):

    def __init__(self, num_capsule, dim_vector, num_routing=3,
                 kernel_initializer='glorot_uniform',
                 bias_initializer='zeros',
                 **kwargs):
        super(CapsuleLayer, self).__init__(**kwargs)
        self.num_capsule = num_capsule
        self.dim_vector = dim_vector
        self.num_routing = num_routing
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

    def build(self, input_shape):
        self.input_num_capsule = input_shape[1]
        self.input_dim_vector = input_shape[2]

        self.W = self.add_weight(shape=[self.input_num_capsule, self.num_capsule, self.input_dim_vector, self.dim_vector],
                                 initializer=self.kernel_initializer,
                                 name='W')

        self.bias = self.add_weight(shape=[1, self.input_num_capsule, self.num_capsule, 1, 1],
                                    initializer=self.bias_initializer,
                                    name='bias',
                                    trainable=False)
        self.built = True

    def call(self, inputs, training=None):

        inputs_expand = K.expand_dims(K.expand_dims(inputs, 2), 2)

        inputs_tiled = K.tile(inputs_expand, [1, 1, self.num_capsule, 1, 1])

        inputs_hat = tf.scan(lambda ac, x: K.batch_dot(x, self.W, [3, 2]),
                             elems=inputs_tiled,
                             initializer=K.zeros([self.input_num_capsule, self.num_capsule, 1, self.dim_vector]))

        for i in range(self.num_routing):
            c = tf.nn.softmax(self.bias, dim=2)
            outputs = squash(K.sum(c * inputs_hat, 1, keepdims=True))

            if i != self.num_routing - 1:
                self.bias += K.sum(inputs_hat * outputs, -1, keepdims=True)
        return K.reshape(outputs, [-1, self.num_capsule, self.dim_vector])

    def compute_output_shape(self, input_shape):
        return tuple([None, self.num_capsule, self.dim_vector])

def PrimaryCap(inputs, dim_vector, n_channels, kernel_size, strides, padding):
    output = layers.Conv2D(filters=dim_vector*n_channels, kernel_size=kernel_size, strides=strides, padding=padding)(inputs)
    outputs = layers.Reshape(target_shape=[-1, dim_vector])(output)
    return layers.Lambda(squash)(outputs)

モデルをビルドする

次に、モデルをビルドします。以下が参考となるネットワーク構造です。

Screenshot from 2017-11-11 19-57-27.png

注意点としては、X->yではなく、(X, y) -> (y, X)という方法を使っていることです。これは、GANと呼ばれる手法の一つで、クラス予測と同時に画像生成します。

In[2]:

from sklearn.model_selection import train_test_split
import pandas as pd
from keras import layers, models, optimizers
from keras import backend as K
from keras.utils import to_categorical
import numpy as np

def CapsNet(input_shape, n_class, num_routing):

    x = layers.Input(shape=input_shape)

    conv1 = layers.Conv2D(filters=256, kernel_size=9, strides=1, padding='valid', activation='relu', name='conv1')(x)
    primarycaps = PrimaryCap(conv1, dim_vector=8, n_channels=32, kernel_size=9, strides=2, padding='valid')
    digitcaps = CapsuleLayer(num_capsule=n_class, dim_vector=16, num_routing=num_routing, name='digitcaps')(primarycaps)
    out_caps = Length(name='out_caps')(digitcaps)


    y = layers.Input(shape=(n_class,))
    masked = Mask()([digitcaps, y])
    x_recon = layers.Dense(512, activation='relu')(masked)
    x_recon = layers.Dense(1024, activation='relu')(x_recon)
    x_recon = layers.Dense(np.prod(input_shape), activation='sigmoid')(x_recon)
    x_recon = layers.Reshape(target_shape=input_shape, name='out_recon')(x_recon)

    return models.Model([x, y], [out_caps, x_recon])


def margin_loss(y_true, y_pred):
    L = y_true * K.square(K.maximum(0., 0.9 - y_pred)) + \
        0.5 * (1 - y_true) * K.square(K.maximum(0., y_pred - 0.1))

    return K.mean(K.sum(L, 1))


def train(model, data, epoch_size=100):

    (x_train, y_train), (x_test, y_test) = data

    model.compile(optimizer="adam",
                  loss=[margin_loss, 'mse'],
                  loss_weights=[1., 0.0005],
                  metrics={'out_caps': 'accuracy'})

    model.fit([x_train, y_train],[y_train, x_train], batch_size=100, epochs=epoch_size,
              validation_data=[[x_test, y_test], [y_test, x_test]])


    return model


def combine_images(generated_images):
    num = generated_images.shape[0]
    width = int(np.sqrt(num))
    height = int(np.ceil(float(num)/width))
    shape = generated_images.shape[1:3]
    image = np.zeros((height*shape[0], width*shape[1]),
                     dtype=generated_images.dtype)
    for index, img in enumerate(generated_images):
        i = int(index/width)
        j = index % width
        image[i*shape[0]:(i+1)*shape[0], j*shape[1]:(j+1)*shape[1]] = \
            img[:, :, 0]
    return image


def test(model, data):
    x_test, y_test = data
    y_pred, x_recon = model.predict([x_test, y_test], batch_size=100)
    print('-'*50)
    print('Test acc:', np.sum(np.argmax(y_pred, 1) == np.argmax(y_test, 1))/y_test.shape[0])

    import matplotlib.pyplot as plt
    from PIL import Image

    img = combine_images(np.concatenate([x_test[:50],x_recon[:50]]))
    image = img * 255
    Image.fromarray(image.astype(np.uint8)).save("real_and_recon.png")
    print()
    print('Reconstructed images are saved to ./real_and_recon.png')
    print('-'*50)
    plt.imshow(plt.imread("real_and_recon.png", ))
    plt.show()


def load_mnist(filename):
    data_train = pd.read_csv(filename)
    X_full = data_train.iloc[:,1:]
    y_full = data_train.iloc[:,:1]
    x_train, x_test, y_train, y_test = train_test_split(X_full, y_full, test_size = 0.3)
    x_train = x_train.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
    x_test = x_test.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
    y_train = to_categorical(y_train.astype('float32'))
    y_test = to_categorical(y_test.astype('float32'))
    return (x_train, y_train), (x_test, y_test)

訓練とテスト

上記のtrain関数を使って訓練します。訓練データはkaggleからダウンロードしたものです。

In[3]:

(x_train, y_train), (x_test, y_test) = load_mnist("../input/train.csv")
    
model = CapsNet(input_shape=[28, 28, 1], n_class=10, num_routing=3)
train(model=model, data=((x_train, y_train), (x_test, y_test)), epoch_size=4)

訓練にかなり時間がかかります。注意してください。次いでテストします。

In[4]:

test(model=model, data=(x_test, y_test))

Out[4]:

--------------------------------------------------
Test acc: 0.99

Reconstructed images are saved to ./real_and_recon.png
--------------------------------------------------

予測する

訓練して、精度もそれなりだったので検証としてはもう十分ですが、submissionデータを生成する方法を一応書きます(kaggleなので)

In[5]:

data_test = pd.read_csv('../input/test.csv')
data_test = data_test.values.reshape(-1, 28, 28, 1).astype('float32') / 255.
y_pred, _ = model.predict([data_test, 
                           np.zeros((data_test.shape[0],10))], 
                           batch_size = 32, verbose = True)

with open('submission.csv', 'w') as out_file:
    out_file.write('ImageId,Label\n')
    for img_id, guess_label in enumerate(np.argmax(y_pred,1),1):
        out_file.write('%d,%d\n' % (img_id, guess_label))

おわりに

CapsuleNetを使ってみる、ということだけが目標でしたが、高い精度が達成できました。訓練に数時間要しますが、GANを使わなければもっと訓練が早いと思います。最近のディープラーニング界隈は、こういった論文が多いので、それをコード化することができるスキルが重要かもしれません。

参考

  1. https://github.com/XifengGuo/CapsNet-Keras/
  2. https://arxiv.org/pdf/1710.09829.pdf

デキるアメリカのデータサイエンティストはKaggleのアンケートになんと答えたか

以前、どんなデータサイエンティストの給与が高いのかを分析しました。 ( https://qiita.com/sugiyamath/items/37582d09227afbd0098b ) しかし、結果としては「インド人が貧乏で、アメリカ人がリッチなんだろ」ということぐらいしか見せませんでした。そこで、今回は国をアメリカに限定し、さらに特徴量(アンケートの質問と回答)を300ほど全て見せます。

特徴量選択してロジスティック回帰を実行

前回とやってることはほとんど一緒なので、データのロードに関しての説明などは省きます。

In[1]:

import pandas as pd
import numpy as np
import re

df = pd.read_csv("multipleChoiceResponses.csv",encoding = "ISO-8859-1")
rates = pd.read_csv("conversionRates.csv", encoding="ISO-8859-1")

df = df[df["CompensationAmount"].notnull()]
df = df[df["CompensationCurrency"].notnull()]
df = df[df["CompensationAmount"].ne("-")]
df = df[df["CompensationCurrency"] == "USD"]

origins = rates["originCountry"].tolist()
exchangeRates = rates["exchangeRate"].tolist()
rate_dict = {}
for origin, exchangeRate in zip(origins,exchangeRates):
    rate_dict[origin] = exchangeRate

df = df[df["CompensationCurrency"].isin(rate_dict.keys())]

CompensationUSD = []
currencies = df["CompensationCurrency"].tolist()
amounts = df["CompensationAmount"].tolist()
for currency, amount in zip(currencies,amounts):
    tmp = re.sub(",","",amount)
    CompensationUSD.append(float(tmp)*float(rate_dict[currency]))
df["CompensationUSD"] = CompensationUSD
df = df[df["CompensationUSD"].notnull()]

df = df.drop('CompensationAmount', 1)
df = df.drop('CompensationCurrency', 1)
df = df.drop('Country', 1)

y = df["CompensationUSD"]
X = df.drop("CompensationUSD", 1)
X = X.fillna(0)
X_dummied = pd.get_dummies(X)
y_binary = y > np.median(y)

今回は、50000ドルではなく、アメリカのデータサイエンティストの給与の中央値を境に目的変数を2値化しました。

In[2]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_dummied, y_binary, random_state=0)
clf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, y_train)

out = pd.DataFrame()
out["feature_name"] = X_dummied.columns.tolist()
out["feature_importance"] = clf.feature_importances_

top_306_features = out.sort_values("feature_importance", ascending=False)[0:306]
X_selected = X_dummied[top_306_features["feature_name"]]


X_train, X_test, y_train, y_test = train_test_split(X_selected, y_binary, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)

print("score:{}".format(lr.score(X_test, y_test)))

Out[2]:

score:0.7090909090909091

国籍に関する特徴量を消したため、モデルの精度は下がりました。しかし、これによってアメリカのデータサイエンティストに限定して重要な回答項目を特定できます。

In[3]:

out2 = pd.DataFrame()
out2["feature_name"] = X_selected.columns.tolist()
out2["coef"] = lr.coef_[0].tolist()
out2.to_csv("result.csv", index=False)

Out[3]:

feature_name coef
0 Age 0.068479
1 Tenure_More than 10 years 1.697553
2 CurrentEmployerType_Employed by college or uni... -2.098635
3 DataScienceIdentitySelect_0 0.281331
4 LearningCategoryOnlineCourses -0.047623
5 TimeVisualizing -0.015580
6 TimeModelBuilding -0.026931
7 Tenure_1 to 2 years -0.803940
8 LearningCategorySelftTaught -0.041710
9 TimeGatheringData -0.013498
10 LearningCategoryWork -0.038596
11 TimeFindingInsights 0.006048
12 Tenure_3 to 5 years -0.321073
13 FormalEducation_Doctoral degree 0.724343
14 CurrentJobTitleSelect_Data Scientist 0.281331
15 TimeProduction -0.021605
16 CurrentJobTitleSelect_Data Analyst -1.449301
17 LearningCategoryUniversity -0.033854
18 WorkToolsFrequencyAWS_0 0.387572
19 WorkToolsFrequencyHadoop_0 -0.321495
20 LearningCategoryKaggle -0.066354
21 EmployerIndustry_Academic -0.037940
22 EmployerIndustry_Technology 0.587736
23 CurrentEmployerType_Employed by a company that... 0.463840
24 DataScienceIdentitySelect_No -0.314967
25 WorkMethodsFrequencyEnsembleMethods_0 -0.510260
26 EmployerMLTime_Don\'t know -0.703553
27 WorkMethodsFrequencyTimeSeriesAnalysis_0 -0.224616
28 Tenure_Less than a year -0.435826
29 LearningPlatformUsefulnessCollege_0 0.281135
30 AlgorithmUnderstandingLevel_Enough to run the ... -0.861677
31 WorkChallengeFrequencyExpectations_0 -0.375583
32 WorkDataStorage_Flat files not in a database o... -0.386620
33 EmploymentStatus_Employed full-time 0.578870
34 WorkToolsFrequencyAWS_Often 1.352131
35 WorkMethodsFrequencyLiftAnalysis_0 -0.491133
36 EmployerSizeChange_Increased significantly 0.557436
37 JobFunctionSelect_Build and/or run a machine l... 0.340561
38 WorkDataVisualizations_76-99% of projects 0.373011
39 WorkDatasetSize_0 -0.746442
40 WorkDataStorage_Row-oriented relational (e.g. ... -0.235493
41 SalaryChange_Has increased 20% or more 0.772570
42 WorkToolsFrequencyPython_0 -0.112314
43 FormalEducation_Bachelor\'s degree -0.350542
44 WorkMethodsFrequencyA/B_0 -0.352774
45 LearningPlatformUsefulnessCollege_Somewhat useful -0.956894
46 WorkMethodsFrequencySimulation_0 -0.220552
47 WorkMethodsFrequencyGBM_0 -0.255413
48 WorkChallengeFrequencyDirtyData_0 -0.014298
49 WorkToolsFrequencySQL_0 -0.124224
50 LanguageRecommendationSelect_Python 0.782124
51 WorkCodeSharing_Git 0.160173
52 BlogsPodcastsNewslettersSelect_0 0.637040
53 WorkMethodsFrequencyPrescriptiveModeling_0 -0.345483
54 WorkMethodsFrequencyLogisticRegression_Sometimes 0.121432
55 WorkToolsFrequencySpark_0 0.197360
56 WorkInternalVsExternalTools_More internal than... -0.042683
57 WorkDataStorage_Flat files not in a database o... -0.372058
58 WorkMethodsFrequencyCross-Validation_Most of t... 0.104426
59 EmployerSize_10,000 or more employees 0.149857
60 LearningPlatformUsefulnessProjects_0 -0.058177
61 WorkChallengeFrequencyPolitics_0 0.343984
62 FormalEducation_Master\'s degree 0.067436
63 WorkMethodsFrequencyTextAnalysis_Sometimes 0.317632
64 WorkDatasetSize_100GB 0.074480
65 Tenure_6 to 10 years 0.290472
66 LearningPlatformUsefulnessKaggle_Very useful 0.044072
67 WorkMethodsFrequencyRecommenderSystems_0 0.387902
68 LearningPlatformUsefulnessConferences_0 -0.124684
69 EmployerSearchMethod_An external recruiter or ... 0.868228
70 MajorSelect_Physics 0.296851
71 EmployerSizeChange_Increased slightly 0.722816
72 EmployerMLTime_0 0.435451
73 WorkToolsFrequencyNoSQL_0 0.781856
74 ParentsEducation_A master\'s degree -0.348896
75 LearningPlatformUsefulnessCompany_0 -0.470440
76 WorkToolsFrequencyUnix_0 -0.373172
77 WorkMethodsFrequencyCross-Validation_0 0.111001
78 WorkMethodsFrequencyDecisionTrees_0 -0.159270
79 WorkDataTypeSelect_Text data -0.189846
80 LearningPlatformUsefulnessYouTube_0 0.258828
81 LearningPlatformUsefulnessCourses_0 -0.099933
82 WorkMethodsFrequencyPCA_0 -0.352539
83 WorkDataTypeSelect_Relational data -0.166469
84 WorkMethodsFrequencyDataVisualization_0 -0.781894
85 CurrentEmployerType_Employed by company that m... 0.396833
86 WorkChallengeFrequencyDirtyData_Most of the time 0.450069
87 WorkToolsFrequencyR_Most of the time 0.645424
88 WorkMethodsFrequencyNLP_0 0.276013
89 AlgorithmUnderstandingLevel_Enough to explain ... -0.178403
90 MajorSelect_Computer Science -0.256908
91 EmployerMLTime_3-5 years -0.304332
92 DataScienceIdentitySelect_Yes -0.219098
93 WorkToolsFrequencyAWS_Most of the time 0.486646
94 WorkToolsFrequencySQL_Most of the time -0.207669
95 EmployerSearchMethod_A friend, family member, ... -0.391141
96 WorkToolsFrequencyJupyter_0 -0.211600
97 LanguageRecommendationSelect_R -0.218728
98 WorkDatasetSize_10GB -0.422807
99 WorkToolsFrequencyPython_Most of the time 0.129843
100 LearningCategoryOther -0.013314
101 CurrentEmployerType_Employed by professional s... -0.429094
102 UniversityImportance_Very important -0.249939
103 LearningPlatformUsefulnessProjects_Very useful 0.253083
104 SalaryChange_I was not employed 3 years ago -0.017191
105 EmploymentStatus_Employed part-time -0.448167
106 TitleFit_Fine 0.437946
107 LearningPlatformUsefulnessTextbook_Very useful 0.288480
108 LearningPlatformUsefulnessYouTube_Somewhat useful -0.737736
109 WorkMethodsFrequencyLogisticRegression_Often -0.211330
110 WorkMethodsFrequencyKNN_0 0.482404
111 WorkDataVisualizations_10-25% of projects -0.294009
112 LearningPlatformUsefulnessBlogs_0 -0.183831
113 EmployerSize_I don\'t know -0.126138
114 WorkChallengeFrequencyTalent_0 -0.104591
115 WorkProductionFrequency_Most of the time -0.066297
116 WorkMethodsFrequencyA/B_Sometimes 0.267184
117 WorkToolsFrequencyTensorFlow_0 -0.429121
118 SalaryChange_Has stayed about the same (has no... 0.285703
119 SalaryChange_Has increased between 6% and 19% 0.803690
120 WorkMethodsFrequencyTimeSeriesAnalysis_Often 0.721306
121 WorkMethodsFrequencyTextAnalysis_0 0.124592
122 MLToolNextYearSelect_TensorFlow -0.402375
123 MLMethodNextYearSelect_Deep learning -0.013382
124 LearningPlatformUsefulnessCourses_Somewhat useful 0.727625
125 WorkMethodsFrequencyLogisticRegression_0 -0.148098
126 EmploymentStatus_Independent contractor, freel... -0.133118
127 WorkToolsFrequencyMATLAB_0 0.022003
128 WorkMethodsFrequencyRandomForests_0 0.146480
129 MajorSelect_Information technology, networking... -1.550536
130 JobFunctionSelect_Analyze and understand data ... 0.025936
131 LearningPlatformUsefulnessTextbook_0 0.232973
132 CurrentEmployerType_Employed by a company that... 0.366073
133 MLToolNextYearSelect_I don\'t plan on learning ... 0.006818
134 LearningPlatformUsefulnessSO_Somewhat useful 0.355973
135 AlgorithmUnderstandingLevel_Enough to tune the... -1.226598
136 RemoteWork_Never 0.249343
137 TimeOtherSelect -0.008814
138 WorkChallengeFrequencyDataAccess_Sometimes 0.699259
139 WorkMethodsFrequencyNaiveBayes_0 -0.521839
140 WorkToolsFrequencyJupyter_Most of the time -0.053016
141 LearningPlatformUsefulnessBlogs_Somewhat useful 0.130403
142 WorkMethodsFrequencyRecommenderSystems_Sometimes 0.736061
143 WorkDatasetSize_10MB -1.274020
144 WorkChallengeFrequencyClarity_Often 0.552631
145 LearningPlatformUsefulnessKaggle_0 0.017762
146 WorkChallengeFrequencyPrivacy_0 0.755218
147 EmployerIndustry_Internet-based 0.659964
148 RemoteWork_Sometimes 0.350711
149 WorkMethodsFrequencyDataVisualization_Most of ... -0.782662
150 LearningPlatformUsefulnessArxiv_Very useful 0.103936
151 WorkChallengeFrequencyDataAccess_0 0.192623
152 WorkChallengeFrequencyUnusedResults_0 -0.225472
153 EmployerSearchMethod_I was contacted directly ... -0.260579
154 GenderSelect_Female -0.946012
155 JobSatisfaction_10 - Highly Satisfied 0.059784
156 WorkDataSharing_Share Drive/SharePoint 0.237191
157 WorkMLTeamSeatSelect_Standalone Team -0.163679
158 WorkChallengeFrequencyTools_0 0.453525
159 WorkMethodsFrequencySegmentation_0 0.234805
160 WorkToolsFrequencyUnix_Most of the time -0.412597
161 WorkToolsFrequencyR_Sometimes 0.534221
162 WorkMethodsFrequencyDecisionTrees_Often 0.259252
163 WorkDatasetsChallenge_0 0.231599
164 LearningPlatformUsefulnessYouTube_Very useful -0.088507
165 WorkMethodsFrequencyEnsembleMethods_Sometimes 0.282194
166 EmployerMLTime_More than 10 years -0.086353
167 MLToolNextYearSelect_Python -0.148804
168 WorkDatasetSize_1GB -0.126173
169 WorkHardwareSelect_Traditional Workstation 0.917003
170 WorkMethodsFrequencySVMs_0 0.053407
171 CurrentJobTitleSelect_Other 0.660138
172 WorkProductionFrequency_Sometimes 0.262847
173 ParentsEducation_A bachelor\'s degree 0.209113
174 LearningPlatformUsefulnessArxiv_0 -0.001661
175 WorkMethodsFrequencyDecisionTrees_Most of the ... 0.059537
176 ParentsEducation_A doctoral degree 0.015415
177 WorkToolsFrequencyR_0 0.097373
178 WorkDatasets_0 0.008264
179 LearningPlatformUsefulnessBlogs_Very useful 0.026194
180 WorkMethodsFrequencyDataVisualization_Often -0.788286
181 WorkChallengeFrequencyHiringFunds_0 0.087427
182 FirstTrainingSelect_Online courses (coursera, ... 0.261640
183 WorkToolsFrequencyCloudera_0 -0.246073
184 WorkChallengeFrequencyIntegration_0 -0.441233
185 WorkChallengeFrequencyUnusefulInstrumenting_0 0.084732
186 WorkMethodsFrequencyBayesian_0 0.499167
187 WorkChallengeFrequencyEnvironments_0 0.405823
188 UniversityImportance_Somewhat important -0.274890
189 LearningPlatformUsefulnessSO_Very useful 0.402665
190 WorkMethodsFrequencyNeuralNetworks_0 -0.301835
191 JobSatisfaction_5 0.072469
192 WorkMLTeamSeatSelect_Other -0.231015
193 AlgorithmUnderstandingLevel_Enough to code it ... -0.093961
194 WorkChallengeFrequencyUnusedResults_Often 0.078110
195 WorkDataVisualizations_51-75% of projects 0.190737
196 MLMethodNextYearSelect_Neural Nets -0.024591
197 LearningPlatformUsefulnessDocumentation_0 -0.047319
198 LearningPlatformUsefulnessTextbook_Somewhat us... 0.087347
199 WorkChallengeFrequencyExplaining_0 0.180495
200 WorkDataVisualizations_100% of projects -0.184048
201 LearningPlatformUsefulnessFriends_0 0.145526
202 LearningPlatformUsefulnessCollege_Very useful 0.156822
203 WorkToolsFrequencySQL_Often 0.066856
204 LearningPlatformUsefulnessCourses_Very useful -0.147143
205 FirstTrainingSelect_University courses -0.118731
206 WorkToolsFrequencyHadoop_Often 0.796860
207 WorkMethodsFrequencyRandomForests_Sometimes -0.519073
208 WorkMethodsFrequencyAssociationRules_0 0.644449
209 MajorSelect_Mathematics or statistics -0.056290
210 WorkMethodsFrequencyKNN_Sometimes 0.563291
211 MajorSelect_Engineering (non-computer focused) 0.073537
212 CurrentJobTitleSelect_Scientist/Researcher 0.107928
213 WorkDataTypeSelect_Text data,Relational data -0.464725
214 LearningPlatformUsefulnessSO_0 0.038219
215 WorkChallengeFrequencyTalent_Often 0.584796
216 MLTechniquesSelect_0 0.083263
217 EmployerSizeChange_0 -0.002907
218 WorkMethodsFrequencySimulation_Often -0.093468
219 WorkMethodsFrequencyPCA_Sometimes -0.385294
220 RemoteWork_Rarely -0.033807
221 WorkToolsFrequencyJupyter_Often -0.596180
222 PublicDatasetsSelect_Dataset aggregator/platfo... 0.244097
223 WorkChallengeFrequencyClarity_Most of the time 0.190254
224 WorkDataStorage_Flat files not in a database o... 1.192469
225 WorkToolsFrequencyGCP_0 0.057922
226 EmployerSizeChange_Stayed the same 0.467980
227 WorkHardwareSelect_Laptop + Cloud service (AWS... 0.552663
228 WorkCodeSharing_Other -0.081971
229 WorkDataSharing_Email 0.129658
230 JobSatisfaction_9 -0.200769
231 WorkMethodsFrequencyRandomForests_Often -0.297451
232 PastJobTitlesSelect_Data Analyst -1.639222
233 WorkMLTeamSeatSelect_IT Department -0.633447
234 JobSatisfaction_8 -0.270292
235 MLSkillsSelect_Supervised Machine Learning (Ta... 0.386059
236 MLMethodNextYearSelect_Other 0.046018
237 TitleFit_Poorly 1.026585
238 WorkToolsFrequencyMATLAB_Often -1.487252
239 WorkToolsFrequencyJava_0 -0.438842
240 EmployerMLTime_1-2 years -0.310438
241 WorkChallengeFrequencyTalent_Most of the time 0.146010
242 WorkMethodsFrequencyCross-Validation_Often 0.104158
243 MLMethodNextYearSelect_Time Series Analysis 0.094379
244 WorkChallengeFrequencyPrivacy_Often 0.919702
245 JobSatisfaction_7 -0.214038
246 WorkChallengeFrequencyExplaining_Sometimes -0.026764
247 WorkToolsFrequencyPython_Often 0.087898
248 WorkChallengeFrequencyDomainExpertise_0 0.606880
249 WorkToolsFrequencyHadoop_Most of the time -0.023946
250 LearningPlatformUsefulnessTutoring_0 0.161013
251 WorkChallengeFrequencyScaling_Sometimes 1.045299
252 WorkChallengeFrequencyDomainExpertise_Often 0.357464
253 EmployerSize_1,000 to 4,999 employees -0.041616
254 WorkProductionFrequency_Rarely 0.129231
255 UniversityImportance_Important -0.286191
256 GenderSelect_Male -0.384720
257 DataScienceIdentitySelect_Sort of (Explain more) 0.250319
258 WorkToolsFrequencyNoSQL_Sometimes 0.235315
259 EmployerMLTime_6-10 years 0.251603
260 EmployerSize_100 to 499 employees 0.236422
261 WorkMLTeamSeatSelect_Business Department 0.005845
262 WorkToolsFrequencyNoSQL_Often -0.680183
263 TitleFit_Perfectly 0.153481
264 CurrentJobTitleSelect_Software Developer/Softw... 0.206485
265 MLToolNextYearSelect_Other -0.101759
266 LearningPlatformUsefulnessNewsletters_0 0.905325
267 WorkMethodsFrequencyGBM_Most of the time -0.139914
268 WorkMethodsFrequencyCNNs_0 -0.011438
269 WorkMethodsFrequencyTimeSeriesAnalysis_Most of... -0.300182
270 WorkHardwareSelect_Laptop or Workstation and p... 0.242171
271 WorkChallengeFrequencyDataAccess_Most of the time 0.189199
272 WorkMethodsFrequencyEnsembleMethods_Often -0.237337
273 AlgorithmUnderstandingLevel_Enough to refine a... -0.051625
274 WorkChallengeFrequencyExplaining_Often -0.297820
275 WorkMethodsFrequencyDecisionTrees_Sometimes -0.343836
276 WorkToolsFrequencySpark_Sometimes -0.019461
277 WorkToolsFrequencyC_0 0.188561
278 MajorSelect_0 -0.463205
279 EmployerSearchMethod_0 -0.449360
280 WorkToolsFrequencyPython_Sometimes 0.257468
281 WorkChallengeFrequencyClarity_Sometimes -0.137937
282 WorkHardwareSelect_Basic laptop (Macbook),Lapt... 1.040318
283 WorkToolsFrequencyJupyter_Sometimes -0.709074
284 WorkMethodsFrequencyEnsembleMethods_Most of th... -0.367561
285 JobFunctionSelect_Other 0.113986
286 WorkChallengeFrequencyDirtyData_Sometimes -0.377112
287 WorkInternalVsExternalTools_Approximately half... -0.106135
288 WorkToolsFrequencyAWS_Sometimes 0.651202
289 ParentsEducation_High school -0.103820
290 WorkToolsFrequencyExcel_0 0.046580
291 WorkDatasetSize_100MB -0.265782
292 LearningPlatformUsefulnessConferences_Very useful 0.017170
293 RemoteWork_Always 0.341071
294 WorkHardwareSelect_Basic laptop (Macbook) 0.360181
295 WorkChallengeFrequencyPolitics_Sometimes 0.565232
296 WorkMethodsFrequencyBayesian_Sometimes 0.188945
297 WorkMethodsFrequencyPrescriptiveModeling_Often 0.746246
298 LearningPlatformUsefulnessKaggle_Somewhat useful -0.348093
299 WorkMethodsFrequencyNaiveBayes_Often 0.003339
300 WorkChallengeFrequencyHiringFunds_Most of the ... 0.255938
301 FirstTrainingSelect_Self-taught 0.000264
302 WorkChallengeFrequencyClarity_0 0.300438
303 LearningPlatformUsefulnessCompany_Very useful -0.671252
304 WorkToolsFrequencyTableau_0 -0.270873
305 WorkDataVisualizations_26-50% of projects -0.536218

上記のテーブルが、アンケートの回答と係数の関係です。ブラウザによっては、テーブルを右にスクロールしないと係数を見れないので注意してください。年齢以外は係数が大きいほど給与に大きな影響を与える項目になっています。ただし、テーブルの上に行くほど情報量の多い項目になっています。(年齢が一番影響が大きい)

回答項目の説明

回答項目の意味を知りたい場合は、以下のスキーマを見てください。 https://www.kaggle.com/kaggle/kaggle-survey-2017/downloads/schema.csv

汎化性能

ROCをプロットしたので、参考にどうぞ。 download (3).png

リンク

https://www.kaggle.com/kaggle/kaggle-survey-2017/