ナード戦隊データマン

データサイエンスを用いて悪と戦うぞ

コンテンツ抽出のdom-basedモデルをvision-basedモデルのデータから学習した

コンテンツ抽出のdom-basedモデルとは、domの構造を特徴量として利用するコンテンツ抽出の手法です。今回は、web2textというツールで使われている特徴量を、RandomForestで実行します。

特徴量一覧

Screenshot_2018-10-12_20-54-56.png

上記特徴量のうち、いくつかを利用します。

データの準備

記事urlの一覧から取得したhtmlファイルから、以下を取り出します。

  1. テキスト要素を持つノードのテキスト
  2. テキスト要素を持つノードのxpath
  3. そのテキスト要素が抽出したいコンテンツか否か

以下がcsvの例です。(ただし、このcsv以前の記事Pascal VOCデータから生成しているため、抽出したくないコンテンツも若干含まれています。)

#text,label,xpath
"We use cookies to ensure that we give you the best experience on our website and to ensure we show advertising that is relevant to you. By continuing to use our website, you agree to our use of such cookies. You can change this and find out more by following: ",False,/html/body/div[1]/div/p
 クッキーポリシー ,False,/html/body/div[1]/div/p/a
継続,False,/html/body/div[1]/div/div/div
" $(document).ready(function(){ $('#cookiesLegal .continue').on('click', function(){ $('#cookiesLegal').remove(); $('body').removeClass('cookies_on'); //alert('off'); }) }); ",False,/html/body/div[1]/script
 日本語 ,False,/html/body/header/div[1]/div/div[2]/div/a
English,False,/html/body/header/div[1]/div/div[2]/div/ul/li[1]/a
Español,False,/html/body/header/div[1]/div/div[2]/div/ul/li[2]/a
Italiano,False,/html/body/header/div[1]/div/div[2]/div/ul/li[3]/a
Français,False,/html/body/header/div[1]/div/div[2]/div/ul/li[4]/a
日本語,False,/html/body/header/div[1]/div/div[2]/div/ul/li[5]/a
Deutsch,False,/html/body/header/div[1]/div/div[2]/div/ul/li[6]/a
ログイン,False,/html/body/header/div[1]/div/div[2]/span/a[1]/span
登録,False,/html/body/header/div[1]/div/div[2]/span/a[2]
ログイン,False,/html/body/header/div[1]/div/div[3]/div/a
初めてのアクセスですか?,False,/html/body/header/div[1]/div/div[3]/div/p/span
登録,False,/html/body/header/div[1]/div/div[3]/div/p/a
 Tickets purchase ,False,/html/body/header/div[3]/div/div[1]
 VideoPass purchase ,False,/html/body/header/div[3]/div/div[2]
ホーム,False,/html/body/header/div[4]/div/div/ul[1]/li[1]/a
Live,False,/html/body/header/div[4]/div/div/ul[1]/li[2]/a
ビデオ,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/a/span
ベスト・オブ,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[1]/a
Live,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[2]/a
2018 年シーズン,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[3]/a
スポイラー,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[4]/a
ショー,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[5]/a
過去のシーズン,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[6]/a
オールビデオ,False,/html/body/header/div[4]/div/div/ul[1]/li[3]/nav/ul/li[7]/a
フォトギャラリー,False,/html/body/header/div[4]/div/div/ul[1]/li[4]/a/span
ベスト・オブ,False,/html/body/header/div[4]/div/div/ul[1]/li[4]/nav/ul/li[1]/a
Grand Prix,False,/html/body/header/div[4]/div/div/ul[1]/li[4]/nav/ul/li[2]/a
ライダー,False,/html/body/header/div[4]/div/div/ul[1]/li[4]/nav/ul/li[3]/a
チーム,False,/html/body/header/div[4]/div/div/ul[1]/li[4]/nav/ul/li[4]/a
リザルト,False,/html/body/header/div[4]/div/div/ul[1]/li[5]/a
カレンダー,False,/html/body/header/div[4]/div/div/ul[1]/li[6]/a
インサイド ,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/a/span
チーム&ライダー,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[1]/li/a
MotoGP VIP Village™,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[2]/li[1]/a
スポンサー,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[2]/li[2]/a
MotoGP Buzz,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[2]/li[3]/a
Red Bull MotoGP™ Rookies Cup,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[2]/li[4]/a
FIM Enel MotoE™ World Cup,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[2]/li[5]/a
Two Wheels For Life,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[2]/li[6]/a
MotoGP™リーグ,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[3]/li[1]/a
ビデオゲーム,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[3]/li[2]/a
eSport,False,/html/body/header/div[4]/div/div/ul[1]/li[7]/nav/ul[3]/li[3]/a
タイミングパス,False,/html/body/header/div[4]/div/div/div[2]/a[1]
ビデオパス,False,/html/body/header/div[4]/div/div/div[2]/a[2]
チケット,False,/html/body/header/div[4]/div/div/ul[2]/li[1]/a
アプリ,False,/html/body/header/div[4]/div/div/ul[2]/li[2]/a
オンラインショップ,False,/html/body/header/div[4]/div/div/ul[2]/li[3]/a
よくある質問,False,/html/body/header/div[4]/div/div/ul[2]/li[4]/a
Contact,False,/html/body/header/div[4]/div/div/ul[2]/li[5]/a
ビデオパス,False,/html/body/header/div[4]/div/div/ul[2]/li[6]/a
News ,False,/html/body/div[4]/div/div/div[3]/div[2]/div[1]/div[1]
14 days 前,False,/html/body/div[4]/div/div/div[3]/div[2]/div[1]/div[2]/div
Author,False,/html/body/div[4]/div/div/div[3]/div[2]/dl/dt[1]
motogp.com,False,/html/body/div[4]/div/div/div[3]/div[2]/dl/dd[1]
Published,False,/html/body/div[4]/div/div/div[3]/div[2]/dl/dt[2]
14 days ago,False,/html/body/div[4]/div/div/div[3]/div[2]/dl/dd[2]/span
By,False,/html/body/div[4]/div/div/div[3]/div[2]/div[2]/div[1]/span
 motogp.com,False,/html/body/div[4]/div/div/div[3]/div[2]/div[2]/div[1]
ユーロスポーツがオランダでの中継を継続,True,/html/body/div[4]/div/div/div[3]/div[3]/h1
フランス北部とベルギー西部のフランダース地域にも21年まで生中継を提供。,True,/html/body/div[4]/div/div/div[3]/div[3]/h2
Tags ,True,/html/body/div[4]/div/div/div[3]/div[3]/div[1]
MotoGP,True,/html/body/div[4]/div/div/div[3]/div[3]/div[1]/a[1]
2018,True,/html/body/div[4]/div/div/div[3]/div[3]/div[1]/a[2]
ドルナスポーツは19日、ユーロスポーツとの間で、オランダとフランダース地域におけるテレビ放送に関して、21年まで契約を延長することで合意。3クラスのフリー走行1から決勝レースまでの中継に加え、パソコンやスマートフォン、タブレッド向けのサービスを展開するヨーロスポーツプレイヤーでは、ライブタイミングやライブトラッキングも配信する。,True,/html/body/div[4]/div/div/div[3]/div[3]/div[2]/div/p
推奨記事,False,/html/body/div[4]/div/div/div[4]/h2
MotoGP,False,/html/body/div[4]/div/div/div[4]/div/article[1]/header/div/span/a
14 hours ago ,False,/html/body/div[4]/div/div/div[4]/div/article[1]/div/div[2]/p
Jump onboard with Marc Márquez as the #MotoGP Champ takes,False,/html/body/div[4]/div/div/div[4]/div/article[1]/div/a/h2
Jump onboard with Marc Márquez as the #MotoGP Champ takes a tuk tuk for a spin around the streets of Bangkok! ,False,/html/body/div[4]/div/div/div[4]/div/article[1]/div/p/a
Thailand,False,/html/body/div[4]/div/div/div[4]/div/article[1]/footer/div/div[2]/section/dl/dd/a[1]
" , ",False,/html/body/div[4]/div/div/div[4]/div/article[1]/footer/div/div[2]/section/dl/dd
Marc Marquez,False,/html/body/div[4]/div/div/div[4]/div/article[1]/footer/div/div[2]/section/dl/dd/a[2]
123,False,/html/body/div[4]/div/div/div[4]/div/article[1]/footer/div/div[3]/div[1]/span
 motogp.com ,False,/html/body/div[4]/div/div/div[4]/div/article[2]/header/div/span
1 week ago ,False,/html/body/div[4]/div/div/div[4]/div/article[2]/div/div/p
360度パノラマ映像~ピットレーンのセレブレーション,False,/html/body/div[4]/div/div/div[4]/div/article[2]/div/a[2]/h2
Aragon,False,/html/body/div[4]/div/div/div[4]/div/article[2]/footer/dl/dd/a[1]
" , ",False,/html/body/div[4]/div/div/div[4]/div/article[2]/footer/dl/dd
360,False,/html/body/div[4]/div/div/div[4]/div/article[2]/footer/dl/dd/a[2]
 motogp.com ,False,/html/body/div[4]/div/div/div[4]/div/article[3]/header/div/span
1 week ago ,False,/html/body/div[4]/div/div/div[4]/div/article[3]/div/div/p
After the Flag: エピソード13,False,/html/body/div[4]/div/div/div[4]/div/article[3]/div/a[2]/h2
Aragon,False,/html/body/div[4]/div/div/div[4]/div/article[3]/footer/dl/dd/a
 motogp.com ,False,/html/body/div[4]/div/div/div[4]/div/article[4]/header/div/span
2 weeks ago ,False,/html/body/div[4]/div/div/div[4]/div/article[4]/div/div/p
マルチオンボードスタート,False,/html/body/div[4]/div/div/div[4]/div/article[4]/div/a[2]/h2
Aragon,False,/html/body/div[4]/div/div/div[4]/div/article[4]/footer/dl/dd/a[1]
" , ",False,/html/body/div[4]/div/div/div[4]/div/article[4]/footer/dl/dd
#AragonGP,False,/html/body/div[4]/div/div/div[4]/div/article[4]/footer/dl/dd/a[2]
2 weeks ago ,False,/html/body/div[4]/div/div/div[4]/div/article[5]/div[1]/div/span
MotoGP™クラス‐決勝レースハイライト,False,/html/body/div[4]/div/div/div[4]/div/article[5]/div[1]/a/h2
Aragon,False,/html/body/div[4]/div/div/div[4]/div/article[5]/footer/dl/dd/a[1]
" , ",False,/html/body/div[4]/div/div/div[4]/div/article[5]/footer/dl/dd
#AragonGP,False,/html/body/div[4]/div/div/div[4]/div/article[5]/footer/dl/dd/a[2]
@MotoGP,False,/html/body/div[4]/div/div/div[4]/div/article[6]/header/div/span[1]/a
2 weeks ago ,False,/html/body/div[4]/div/div/div[4]/div/article[6]/div/div[2]/p
We think he's happy with that one... We're not sure,False,/html/body/div[4]/div/div/div[4]/div/article[6]/div/a/h2
We think he's happy with that one... We're not sure about the dance moves though...,False,/html/body/div[4]/div/div/div[4]/div/article[6]/div/p/a
Aragon,False,/html/body/div[4]/div/div/div[4]/div/article[6]/footer/div/div[2]/section/dl/dd/a[1]
" , ",False,/html/body/div[4]/div/div/div[4]/div/article[6]/footer/div/div[2]/section/dl/dd
Marc Marquez,False,/html/body/div[4]/div/div/div[4]/div/article[6]/footer/div/div[2]/section/dl/dd/a[2]
369,False,/html/body/div[4]/div/div/div[4]/div/article[6]/footer/div/div[3]/div[2]/span
757,False,/html/body/div[4]/div/div/div[4]/div/article[6]/footer/div/div[3]/div[3]/span
 motogp.com ,False,/html/body/div[4]/div/div/div[4]/div/article[7]/header/div/span
2 weeks ago ,False,/html/body/div[4]/div/div/div[4]/div/article[7]/div/div/p
ラスト3分間のポールポジションバトル,False,/html/body/div[4]/div/div/div[4]/div/article[7]/div/a[2]/h2
Aragon,False,/html/body/div[4]/div/div/div[4]/div/article[7]/footer/dl/dd/a
More motogp.com,False,/html/body/nav[1]/div/div/div/p
ソーシャルネットワーク,False,/html/body/nav[1]/div/ul/li[1]/header/p
アバウト,False,/html/body/nav[1]/div/ul/li[2]/header/p
dorna.com,False,/html/body/nav[1]/div/ul/li[2]/ul/li[1]/a
クッキーポリシー,False,/html/body/nav[1]/div/ul/li[2]/ul/li[2]/a
Terms & Conditions,False,/html/body/nav[1]/div/ul/li[2]/ul/li[3]/a
コンタクト,False,/html/body/nav[1]/div/ul/li[3]/header/p
コンタクト,False,/html/body/nav[1]/div/ul/li[3]/ul/li[1]/a
FAQ,False,/html/body/nav[1]/div/ul/li[3]/ul/li[2]/a
アドバタイズ,False,/html/body/nav[1]/div/ul/li[3]/ul/li[3]/a
motogp.com,False,/html/body/nav[1]/div/ul/li[4]/header/p
ビデオパス,False,/html/body/nav[1]/div/ul/li[4]/ul/li[1]/a
MotoGP™ Tickets,False,/html/body/nav[1]/div/ul/li[4]/ul/li[2]/a
MotoGP™ League,False,/html/body/nav[1]/div/ul/li[4]/ul/li[3]/a
TV Broadcast,False,/html/body/nav[1]/div/ul/li[4]/ul/li[4]/a
MotoGP™ Apps,False,/html/body/nav[1]/div/ul/li[4]/ul/li[5]/a
MotoGP VIP Village™,False,/html/body/nav[1]/div/ul/li[4]/ul/li[6]/a
MotoGP™ Store,False,/html/body/nav[1]/div/ul/li[4]/ul/li[7]/a
スポイラー,False,/html/body/nav[1]/div/ul/li[4]/ul/li[8]/a
MotoGP™ Cashback,False,/html/body/nav[1]/div/ul/li[4]/ul/li[9]/a
© 2016 Dorna Sports SL. All rights reserved. All trademarks are the property of their respective owners.,False,/html/body/nav[1]/div/p
ログイン,False,/html/body/nav[2]/div/nav/a[1]
登録,False,/html/body/nav[2]/div/nav/a[2]
ホーム,False,/html/body/nav[2]/div/i/div/ul[1]/li[1]/a
Live,False,/html/body/nav[2]/div/i/div/ul[1]/li[2]/a
ビデオ,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/a/span
ベスト・オブ,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[1]/a
Live,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[2]/a
2018 年シーズン,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[3]/a
スポイラー,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[4]/a
ショー,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[5]/a
過去のシーズン,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[6]/a
オールビデオ,False,/html/body/nav[2]/div/i/div/ul[1]/li[3]/nav/ul/li[7]/a
フォトギャラリー,False,/html/body/nav[2]/div/i/div/ul[1]/li[4]/a/span
ベスト・オブ,False,/html/body/nav[2]/div/i/div/ul[1]/li[4]/nav/ul/li[1]/a
Grand Prix,False,/html/body/nav[2]/div/i/div/ul[1]/li[4]/nav/ul/li[2]/a
ライダー,False,/html/body/nav[2]/div/i/div/ul[1]/li[4]/nav/ul/li[3]/a
チーム,False,/html/body/nav[2]/div/i/div/ul[1]/li[4]/nav/ul/li[4]/a
リザルト,False,/html/body/nav[2]/div/i/div/ul[1]/li[5]/a
カレンダー,False,/html/body/nav[2]/div/i/div/ul[1]/li[6]/a
チーム&ライダー,False,/html/body/nav[2]/div/i/div/ul[1]/li[7]/a
インサイド ,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/a/span
MotoGP VIP Village™,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[1]/li[1]/a
スポンサー,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[1]/li[2]/a
MotoGP Buzz,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[1]/li[3]/a
Red Bull MotoGP™ Rookies Cup,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[1]/li[4]/a
FIM Enel MotoE™ World Cup,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[1]/li[5]/a
Two Wheels For Life,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[1]/li[6]/a
MotoGP™リーグ,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[2]/li[1]/a
ビデオゲーム,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[2]/li[2]/a
eSport,False,/html/body/nav[2]/div/i/div/ul[1]/li[8]/nav/ul[2]/li[3]/a
タイミングパス,False,/html/body/nav[2]/div/i/div/div[2]/a[1]
ビデオパス,False,/html/body/nav[2]/div/i/div/div[2]/a[2]
チケット,False,/html/body/nav[2]/div/i/div/ul[2]/li[1]/a
アプリ,False,/html/body/nav[2]/div/i/div/ul[2]/li[2]/a
オンラインショップ,False,/html/body/nav[2]/div/i/div/ul[2]/li[3]/a
よくある質問,False,/html/body/nav[2]/div/i/div/ul[2]/li[4]/a
Contact,False,/html/body/nav[2]/div/i/div/ul[2]/li[5]/a
ビデオパス,False,/html/body/nav[2]/div/i/div/ul[2]/li[6]/a
 日本語 ,False,/html/body/div[8]/nav/span/a
English,False,/html/body/div[8]/nav/span/ul/li[1]/a
Español,False,/html/body/div[8]/nav/span/ul/li[2]/a
Italiano,False,/html/body/div[8]/nav/span/ul/li[3]/a
Français,False,/html/body/div[8]/nav/span/ul/li[4]/a
日本語,False,/html/body/div[8]/nav/span/ul/li[5]/a
Deutsch,False,/html/body/div[8]/nav/span/ul/li[6]/a
ログイン,False,/html/body/div[8]/nav/a[1]/span
登録,False,/html/body/div[8]/nav/a[2]
Share options:,False,/html/body/div[11]/div/div
Twitter,False,/html/body/div[11]/div/a[1]
Google+,False,/html/body/div[11]/div/a[2]

特徴量設計用のコード

それぞれの特徴量設計を行います。たまにゼロ除算エラーが出て直していませんが、手持ちのデータではゼロ除算が発生する回数が少なかったのでとりあえず実行しました。

import re
import pandas as pd
import MeCab
import string
import numpy as np
from nltk.corpus import stopwords


def prepare_df(filepath):
    df = pd.read_csv(filepath)
    reg = re.compile(r"\[[0-9]+\]")
    df['xpath_fixed'] = list(map(lambda x: re.sub(reg,"",x),df['xpath'].tolist()))
    df = df[list(map(lambda x: "script" not in x, df['xpath_fixed']))]
    return df


def b1(df, column="#text"):
    return df[[column]].duplicated()


def b3(df, column="xpath_fixed"):
    return [sum(df[column] == x)/df.shape[0] for x in df[column]]


def b5(df, tagger, column="#text"):
    return [np.log(len(tagger.parse(x).split())) for x in df[column]]


def b6(df, tagger, column="#text"):
    return [np.mean([len(y) for y in tagger.parse(x).split()]) for x in df[column]]


def b7(df, tagger, column="#text"):
    jpstps = stopwords.words('japanese')
    enstps = stopwords.words('english')
    
    stopword_ratio = []
    for x in df[column]:
        tmp = np.isin(tagger.parse(x).split(), jpstps+enstps)
        if len(tmp) == 0:
            stopword_ratio.append(0)
        else:
            stopword_ratio.append(float(sum(tmp))/float(len(tmp)))
        
    return stopword_ratio


def b9(df, column="#text"):
    return [np.log(len(list(x))) for x in df[column]]


def b10_b24(df, column="#text"):
    plist = list(".,?!。!?、")
    punkt_ratio = []
    n_punkt = []
    for x in df[column]:
        tmp = np.isin(list(x), plist)
        t = float(sum(tmp))
        n_punkt.append(t)
        if t == 0:
            punkt_ratio.append(0.0)
        else:   
            punkt_ratio.append(np.log(
                float(t) / float(len(tmp))
            ))

    return punkt_ratio, n_punkt


def b11(df, column="#text"):
    nums = list("0123456789")
    num_ratio = []
    for x in df["#text"]:
        tmp = list(x)
        num_ratio.append(float(sum(np.isin(tmp,nums)))/float(len(tmp)))
    return num_ratio


def b14(df, column="#text"):
    endmark = list(".,?!。!?、")
    return [np.any([x.strip().endswith(y) for y in endmark]) for x in df[column]]


def b26(df, column="#text"):
    return [float(n)/float(df.shape[0]) for n, x in enumerate(df[column])]


def b29_b30(df, col1="xpath", col2="#text", col3="xpath_fixed", parent_level=1):
    body_percentage = []
    link_density = []
    for x, text in zip(df[col1], df[col2]):
        if parent_level==None:
            parent = "/html/body"
        else:
            parent = '/'.join(x.split("/")[:-parent_level])
        target = df[list(map(lambda y: parent in y, df[col1]))]
        body_percentage.append(float(len(list(text)))/float(sum(len(list(y)) for y in target[col2])))
        atags = [y.endswith("a") for y in target[col3]]
        link_density.append(float(sum(atags))/float(len(atags)))
    return body_percentage, link_density


def b31(df, tagger, col1="xpath", col2="#text", col3="xpath_fixed", parent_level=1):
    b6_p = []
    b7_p = []
    b9_p = []
    b10_p = []
    b11_p = []
    b14_p = []
    pf = {}
    plist = list(".,?!。!?、")
    nums = list("0123456789")
    endmark = list(".,?!。!?、")
    jpstps = stopwords.words('japanese')
    enstps = stopwords.words('english')

    for x, text in zip(df[col1], df[col2]):
        if parent_level==None:
            parent = "/html/body"
        else:
            parent = '/'.join(x.split("/")[:-parent_level])
        if parent in pf:
            b6_p.append(pf[parent]['b6'])
            b7_p.append(pf[parent]['b7'])
            b9_p.append(pf[parent]['b9'])
            b10_p.append(pf[parent]['b10'])
            b11_p.append(pf[parent]['b11'])
            b14_p.append(pf[parent]['b14'])
            continue

        target = df[list(map(lambda y: parent in y, df[col1]))]
        b6 = np.mean(np.concatenate([[len(y) for y in tagger.parse(x).split()] for x in target[col2]]))
        b7 = []
        for y in target[col2]:
            tmp = np.isin(tagger.parse(y).split(), jpstps+enstps)
            b7.append(tmp)
        b7 = np.concatenate(b7)
        if len(b7) == 0:
            b7 = 0
        else:
            b7 = float(sum(b7))/float(len(b7))
        b9 = np.log(len(list(' '.join([y for y in target[col2]]))))

        punkt_ratio = []
        tmp = []
        for y in target[col2]:
            tmp.append(np.isin(list(y), plist))
        t = sum(np.concatenate(tmp))
        if t == 0:
            punkt_ratio = 0.0
        else:   
            punkt_ratio = np.log(
                float(t) / float(len(tmp))
            )
        b10 = punkt_ratio
        
        num_ratio = []
        tmp = []
        for y in target[col2]:
            tmp += list(y)
        num_ratio = float(sum(np.isin(tmp,nums)))/float(len(tmp))
        b11 = num_ratio
        b14 = np.any([target[col2].tolist()[-1].endswith(y) for y in endmark])
    
        pf[parent] = {'b6':b6, 'b7':b7, 'b9':b9, 'b10': b10, 'b11':b11, 'b14':b14}
        b6_p.append(pf[parent]['b6'])
        b7_p.append(pf[parent]['b7'])
        b9_p.append(pf[parent]['b9'])
        b10_p.append(pf[parent]['b10'])
        b11_p.append(pf[parent]['b11'])
        b14_p.append(pf[parent]['b14'])
                     
    return b6_p, b7_p, b9_p, b10_p, b11_p, b14_p


def b49(df, column="xpath_fixed", parent_level=1):
    tags = "td div p tr table body ul span li blockquote b small a ol ul i form dl strong pre".split()
    ptag_features = []

    for x in df[column]:
        tmp = np.zeros(len(tags))
        t = x.split("/")[-(parent_level+1)]
        try:
            ind = tags.index(t)
            tmp[ind] = 1.0
        except:
            pass
        ptag_features.append(tmp)
    return pd.DataFrame(ptag_features, columns=list(map(str, list(range(49, 49+len(tags))))))


def b110(df, column="xpath_fixed"):
    tags = "a p td b li span i tr div strong em h3 h2 table h4 small sup h1 blockquote".split()
    tag_features = []

    for x in df[column]:
        tmp = np.zeros(len(tags))
        t = x.split("/")[-1]
        try:
            ind = tags.index(t)
            tmp[ind] = 1.0
        except:
            pass
        tag_features.append(tmp)
    return pd.DataFrame(tag_features, columns=list(map(str, list(range(110, 110+len(tags))))))


def build(filepath):
    print(filepath, end=" ", flush=True)
    tagger = MeCab.Tagger("-Owakati")
    try:
        df = prepare_df(filepath)
        out = pd.concat([b49(df), b110(df)], axis=1)
        out["b1"] = b1(df)
        out["b3"] = b3(df)
        out["b5"] = b5(df,tagger)
        out["b6"] = b6(df, tagger)
        out["b7"] = b7(df, tagger)
        out["b9"] = b9(df)
        out["b10"], out["b24"] = b10_b24(df)
        out["b11"] = b11(df)
        out["b14"] = b14(df)
        out["b26"] = b26(df)
        out["b29"], out["b30"] = b29_b30(df)
        out["b31"], out["b32"], out["b34"], out["b35"], out["b36"], out["b39"] = b31(df, tagger)
        out["b70"], out["b71"] = b29_b30(df, parent_level=2)
        out["b72"], out["b73"], out["b75"], out["b76"], out["b77"], out["b80"] = b31(df, tagger, parent_level=2)
        out["b90"], out["b91"] = b29_b30(df, parent_level=None)
        out["b92"], out["b93"], out["b95"], out["b96"], out["b97"], out["b100"] = b31(df, tagger, parent_level=None)
        out["label"] = df["label"]
        return True, out
    except Exception as e:
        print(e)
        return False, e
import os
from multiprocessing import Pool

pool = Pool(8)
path = "candidates_text/"
out = None
filepathes = [os.path.join(path,filename) for filename in os.listdir(path)]
out = pool.map(build, filepathes)
data = pd.concat([o[1] for o in out if o[0] == True])

特徴量設計済みデータはこちら -> https://github.com/sugiyamath/information_extraction_experiments/blob/master/model4/example_data/data.7z

訓練・評価

Jupyterで訓練・評価を行います。

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

data = data.replace([np.inf, -np.inf], np.nan)
data = data.fillna(0)

y = data['label'] == True
X = data.iloc[:,:-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=42)

clf = RandomForestClassifier().fit(X_train, y_train)
y_preds = clf.predict(X_test)
print(classification_report(y_test, y_preds))
             precision    recall  f1-score   support

      False       0.98      1.00      0.99     93130
       True       0.97      0.88      0.93     19292

avg / total       0.98      0.98      0.98    112422

参考

[1] https://arxiv.org/abs/1801.02607