ナード戦隊データマン

機械学習と自然言語処理についてのブログ

geolife: GPSログから移動手段を予測

geolifeデータセット1とは、Microsoft Research Asia Geolife projectによって収集された、 182人のユーザーの2007年から2012年に渡るGPSログのデータです。

概要

データを展開すると、ユーザーIDをディレクトリ名としたデータが格納されており、各ディレクトリには、*.pltという形式のGPSログが格納されています。

いくつかのデータには移動手段のラベルが付与されていますが、今回の目的はこの移動手段を予測することです。

github.com

データの読み込み

read_geolife.py2 という名前で以下のコードを保存します。

import numpy as np
import pandas as pd
import glob
import os.path
import datetime
import os

def read_plt(plt_file):
    points = pd.read_csv(plt_file, skiprows=6, header=None,
                         parse_dates=[[5, 6]], infer_datetime_format=True)

    # for clarity rename columns
    points.rename(inplace=True, columns={'5_6': 'time', 0: 'lat', 1: 'lon', 3: 'alt'})

    # remove unused columns
    points.drop(inplace=True, columns=[2, 4])

    return points

mode_names = ['walk', 'bike', 'bus', 'car', 'subway','train', 'airplane', 'boat', 'run', 'motorcycle', 'taxi']
mode_ids = {s : i + 1 for i, s in enumerate(mode_names)}

def read_labels(labels_file):
    labels = pd.read_csv(labels_file, skiprows=1, header=None,
                         parse_dates=[[0, 1], [2, 3]],
                         infer_datetime_format=True, delim_whitespace=True)

    # for clarity rename columns
    labels.columns = ['start_time', 'end_time', 'label']

    # replace 'label' column with integer encoding
    labels['label'] = [mode_ids[i] for i in labels['label']]

    return labels

def apply_labels(points, labels):
    indices = labels['start_time'].searchsorted(points['time'], side='right') - 1
    no_label = (indices < 0) | (points['time'].values >= labels['end_time'].iloc[indices].values)
    points['label'] = labels['label'].iloc[indices].values
    points['label'][no_label] = 0

def read_user(user_folder):
    labels = None

    plt_files = glob.glob(os.path.join(user_folder, 'Trajectory', '*.plt'))
    df = pd.concat([read_plt(f) for f in plt_files])

    labels_file = os.path.join(user_folder, 'labels.txt')
    if os.path.exists(labels_file):
        labels = read_labels(labels_file)
        apply_labels(df, labels)
    else:
        df['label'] = 0

    return df

def read_all_users(folder):
    subfolders = os.listdir(folder)
    dfs = []
    for i, sf in enumerate(subfolders):
        print('[%d/%d] processing user %s' % (i + 1, len(subfolders), sf))
        df = read_user(os.path.join(folder,sf))
        df['user'] = int(sf)
        dfs.append(df)
    return pd.concat(dfs)

ついで、これを使ってデータを読み込み、保存します。

import read_geolife
df = read_geolife.read_all_users('Data').sort_values(["user", "time"])
df.to_csv("data.csv", index=False)

データの分割

データをグループ化します。なお、グループ化前のdata.csvは必ずuserでソートしたあと、timeでソートしておいてください。

# coding: utf-8

import pandas as pd
from tqdm import tqdm
import numpy as np

df = pd.read_csv("../data/data.csv")
df["time"] = pd.to_datetime(df["time"])
df = df[df["label"] != 0]


def grouping(df):
    df = df.sort_values(["user", "time", "label"])
    prev_user = None
    prev_time = None
    prev_label = None
    group_num = 0
    out = []
    for user, time, label in tqdm(zip(df["user"], df["time"], df["label"])):
        if prev_label is None:
            pass
        elif prev_label == label and prev_user == user:
            diff = time - prev_time
            if diff.days > 0:
                group_num += 1
            elif diff.seconds > 120:
                group_num += 1
        else:
            group_num += 1
        prev_user = user
        prev_time = time
        prev_label = label
        out.append(group_num)
    return out


df["group"] = grouping(df)
df.to_csv("data_fixed.csv")

特徴量抽出

論文3 に基づいて、特徴量抽出のコードを作成します。以下を特徴量にします:

  1. ログ取得時の二点間の距離。
  2. 距離と時間から求めた速度。
  3. 速度と時間から求めた加速度。
  4. 加速度と時間から求めた躍度。
  5. 2点から求めた方向。
  6. パディングを表す2値の値。
# coding: utf-8
 
import numpy as np
import pandas as pd
import math
from scipy import signal
from functools import partial
from compassbearing import calculate_initial_compass_bearing
 
 
def distance(points):
    out = []
    prev_point = None
    for point in points:
        if prev_point is None:
            distance = 0.0
        else:
            distance = math.sqrt(
                ((point[0] - prev_point[0])**2) +
                ((point[1] - prev_point[1])**2))
        out.append(distance)
        prev_point = point
    return out
 
 
def speed(distances, times):
    prev_time = None
    out = []
    for dist, time in zip(distances, times):
        if prev_time is None:
            s = 0.0
        else:
            try:
                difftime = time - prev_time
                dt_s = difftime.seconds
                dt_s += difftime.microseconds / (1000 * 1000)
                s = float(dist) / float(dt_s)
            except ZeroDivisionError:
                s = 0.0
        out.append(s)
        prev_time = time
    return out
 
 
def diff(data, times):
    prev_d = None
    prev_t = None
    out = []
    for d, time in zip(data, times):
        if prev_d is None:
            derivative = 0.0
        else:
            try:
                diff_d = d - prev_d
                diff_t = time - prev_t
                dt_s = diff_t.seconds
                dt_s += diff_t.microseconds / (1000 * 1000)
                derivative = float(diff_d) / float(dt_s)
            except ZeroDivisionError:
                derivative = 0.0
        out.append(derivative)
        prev_d = d
        prev_t = time
    return out
 
 
def create_embedding(points, times, window_size=9):
    distances = distance(points)
    speeds = speed(distances, times)
    a = diff(speeds, times)
    j = diff(a, times)
    b = [0.0] + [
        float(calculate_initial_compass_bearing(point1, point2)) / float(360)
        for point1, point2 in zip(points[0:-1], points[1:])
    ]
    
    return [
        [x0, x1, x2, x3, x4]
        for x0, x1, x2, x3, x4 in zip(distances, speeds, a, j, b)
    ]
 
 
def split_data(data, dim=5, size=200, inplace=0, batch_size=10):
    out = []
    row = []
    for i, d in enumerate(data):
        if dim is None:
            row.append(d)
        else:
            row.append(d + [0])
        if len(row) == size:
            out.append(row)
            row = []
    if row:
        if dim is None:
            row += [0 for x in range(size - len(row))]
        else:
            row += [
                np.array([inplace for _ in range(dim)]+[1])
                for i in range(size - len(row))
            ]
        out.append(row)
    out = np.array(out)
    spl_size = out.shape[0]//batch_size
    if spl_size == 0:
        spl_size = 1
    return np.array_split(out, spl_size) 

データジェネレータの作成

データをモデルに読み込ませるためのジェネレータを作成します。

from feature import create_embedding, split_data
import pandas as pd
import numpy as np
from keras.utils import to_categorical
from sklearn.utils import shuffle
import pickle
from collections import defaultdict
 
 
def frequent_label(y):
    out = defaultdict(int)
    maxvalue = 0
    maxlabel = None
    for i in y:
        i = label_fix(i)
        if i != 0:
            out[i] += 1
            if out[i] > maxvalue:
                maxvalue = out[i]
                maxlabel = i
    return maxlabel
 
 
def label_fix(label):
    # ['walk':1, 'bike':2, 'bus':3, 'car':4, 'subway':5,'train':5, 'airplane':6, 'boat':6, 'run':6, 'motorcycle':7, 'taxi':4]
    
    mode_tr = {
        0: 0, 
        1: 1,
        2: 2,
        3: 3,
        4: 4,
        5: 5,
        6: 5,
        7: 6,
        8: 6,
        9: 6,
        10: 7,
        11: 4
    }
    return mode_tr[label]
 
 
def generate(df, groups, num_classes=12, onetime=False):
    df["time"] = pd.to_datetime(df["time"])
    while True:
        for group in groups:
            selected = df[df["group"] == group].sort_values(["user", "time"])
            points = [(lat, lon)
                      for lat, lon in zip(selected["lat"], selected["lon"])]
            times = selected["time"]
            labels = selected["label"]
            result = create_embedding(points, times)
            result_spl = split_data(result)
            labels_spl = split_data(labels, dim=None, inplace=0)
            for X, y in zip(result_spl, labels_spl):
                y_fixed = np.array([
                    to_categorical(
                        frequent_label(row_y), num_classes=num_classes)
                    for row_y in y
                ])
                X_fixed = X.reshape(X.shape[0], 1, X.shape[1], X.shape[2])
                if onetime:
                    yield X_fixed, y_fixed, y
                else:
                    yield X_fixed, y_fixed
        if onetime:
            break
 
 
def train_valid_test(df, train_size=0.8):
    groups = df["group"].unique()
    groups = shuffle(groups)
    s = int(groups.shape[0] * 0.8)
    train, tmp_test = groups[:s], groups[s:]
    s = int(tmp_test.shape[0] * 0.5)
    valid, test = tmp_test[:s], tmp_test[s:]
 
    with open("spl.pkl", "wb") as f:
        pickle.dump({"train": train, "test": test, "valid": valid}, f)
 
    return train, valid, test

モデルの訓練

import pandas as pd
import pickle
import numpy as np
from tqdm import tqdm
from keras.models import Sequential, Input
from keras.layers import LSTM, Dense, Dropout, SeparableConv2D, InputLayer, MaxPooling2D, Flatten
from keras_contrib.layers import CRF
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint
from data_generator import generate, train_valid_test
 
 
def build_model(n_labels):
    model = Sequential([
        SeparableConv2D(32, (1, 3),
                        strides=(1, 1),
                        padding="same",
                        activation="relu",
                        input_shape=(1, 200, 6)),
        SeparableConv2D(32, (1, 3),
                        strides=(1, 1),
                        padding="same",
                        activation="relu"),
        MaxPooling2D((1, 2)),
        SeparableConv2D(64, (1, 3),
                        strides=(1, 1),
                        padding="same",
                        activation="relu"),
        SeparableConv2D(64, (1, 3),
                        strides=(1, 1),
                        padding="same",
                        activation="relu"),
        MaxPooling2D((1, 2)),
        SeparableConv2D(128, (1, 3),
                        strides=(1, 1),
                        padding="same",
                        activation="relu"),
        SeparableConv2D(128, (1, 3),
                        strides=(1, 1),
                        padding="same",
                        activation="relu"),
        MaxPooling2D((1, 2)),
        Dropout(0.5),
        Flatten(),
    ])
    A = model.output_shape
    model.add(Dense(int(A[1]*1/4.), activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(n_labels, activation='softmax'))
    optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
 
    model.compile(optimizer=optimizer,
                  loss="categorical_crossentropy",
                  metrics=["accuracy"])
    return model
 
 
if __name__ == "__main__":
    n_labels = 8
    df = pd.read_csv("../data/data_fixed.csv")
    model = build_model(n_labels)
    train, valid, test = train_valid_test(df)
    #with open("../data/spl.pkl", "rb") as f:
    #    indices = pickle.load(f)
    #    train, valid, test = indices["train"], indices["valid"], ""
 
    X_train = []
    y_train = []
    for X_row, y_row, _ in tqdm(generate(df, train, n_labels, True)):
        X_train.append(X_row)
        y_train.append(y_row)
 
    X_train = np.concatenate(tuple(X_train), axis=0)
    y_train = np.concatenate(tuple(y_train), axis=0)
 
    X_val = []
    y_val = []
    for X_row, y_row, _ in tqdm(generate(df, valid, n_labels, True)):
        X_val.append(X_row)
        y_val.append(y_row)
 
    X_val = np.concatenate(tuple(X_val), axis=0)
    y_val = np.concatenate(tuple(y_val), axis=0)
    
    callbacks = [
        ModelCheckpoint("./model_best.h5",
                        monitor="val_loss",
                        save_best_only=True,
                        mode=min)
    ]
 
    model.fit(X_train, y_train,
              epochs=100, batch_size=64, callbacks=callbacks,
              validation_data=(X_val, y_val))

モデルのテスト

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1" 
 
import pickle
import pandas as pd
from tqdm import tqdm
from keras.models import load_model
from data_generator import generate, label_fix
from sklearn.metrics import classification_report
 
if __name__ == "__main__":
    with open("./spl.pkl", "rb") as f:
        indices = pickle.load(f)
        indices = indices["test"]
 
    df = pd.read_csv("../data/data_fixed.csv")
    model = load_model("./model.h5")
 
    y_pred = []
    y_true = []
    for X_batch, _, y_batch in tqdm(generate(df, indices, 8, onetime=True)):
        tmp_pred = model.predict_classes(X_batch) #(None, 1, 50, 5)
        assert X_batch.shape[0] == tmp_pred.shape[0]
        assert y_batch.shape[0] == tmp_pred.shape[0]
        for X, y, pred_i in zip(X_batch, y_batch, tmp_pred):
            # X: (1, 50, 5)
            pred = []
            true = []
            for X_i in X:
                # X_i: (50, 5)
                # y: (50, )
                for x, y_i in zip(X_i, y):
                    if x[-1] == 0:
                        pred.append(label_fix(pred_i))
                        true.append(label_fix(y_i))
                    else:
                        #pred.append(0)
                        pass
            assert len(pred) == len(true)
            y_pred += pred
            y_true += true
 
    result = classification_report(y_true, y_pred)
    with open("result3.txt", "w") as f:
        f.write(result)

精度

              precision    recall  f1-score   support
 
           1       0.70      0.85      0.77    177178
           2       0.60      0.47      0.53     97368
           3       0.60      0.62      0.61    130121
           4       0.45      0.39      0.41     68278
           5       0.82      0.76      0.79    121442
           6       0.00      0.00      0.00      1005
 
    accuracy                           0.66    595392
   macro avg       0.53      0.51      0.52    595392
weighted avg       0.66      0.66      0.66    595392

考察

なぜか論文ほどの精度が出ません。前処理の段階でなにかミスしているか、または特徴量設計でなにか勘違いしているかもしれません。

このモデルの良い点は、GPSの位置と時間から計算できる特徴量のみを使っている点です。このような単純なデータだけで、移動手段がそれなりに予測できてしまうということがわかりました。

ちなみに、モデルは1 × 200 × 6というセグメントで分割され、200はセグメント長、6はそれぞれの特徴量となっています。

セグメント化が必要な理由は、点による予測が難しいためです。「移動」を予測するわけなので、移動した「線」で捉えたほうが、点でとらえるよりもはるかに多くの情報が使えます。

データの分割の段階で、ラベルを基に分割するというズルをしてしまっていますが、それでもあまり精度は上がりませんでした。

もし、コードになにかおかしな所があればおしえてください。

追記

2019/07/10 9:42

相対距離を特徴量から外したら精度が上がりました。

              precision    recall  f1-score   support

           1       0.76      0.83      0.79    177178
           2       0.62      0.66      0.64     97368
           3       0.73      0.63      0.68    130121
           4       0.57      0.51      0.54     68278
           5       0.85      0.87      0.86    121442
           6       0.00      0.00      0.00      1005

    accuracy                           0.73    595392
   macro avg       0.59      0.58      0.58    595392
weighted avg       0.73      0.73      0.73    595392

2019-07-10 9:57

distance関数を以下のように書き換えると精度が上がります。

from geopy.distance import vincenty

def distance(points):
    out = []
    prev_point = None
    for point in points:
        if prev_point is None:
            distance = 0.0
        else:
            try:
                distance = vincenty(tuple(point), tuple(prev_point)).meters
            except ValueError:
                print(point, prev_point)
                distance = 0.0
        out.append(distance)
        prev_point = point
    return out
              precision    recall  f1-score   support

           1       0.81      0.93      0.87    177178
           2       0.88      0.79      0.83     97368
           3       0.79      0.76      0.77    130121
           4       0.71      0.65      0.68     68278
           5       0.93      0.91      0.92    121442
           6       0.00      0.00      0.00      1005

    accuracy                           0.83    595392
   macro avg       0.69      0.67      0.68    595392
weighted avg       0.83      0.83      0.83    595392

参考