ナード戦隊データマン

機械学習と自然言語処理についてのブログ

SHLデータセット: スマホセンサー情報に対する多目的アノテーション付きデータセット

SHLデータセット1は、多数のセンサーを使って収集されたユーザーの情報に対して、移動手段を含むいくつかのアノテーションがされたデータセットです。

github.com

※ コードは上記プロジェクトで公開しました。

概要

ラベルとセンサーの種類には以下のようなものがあります。

SHLデータセットの移動手段ラベル:

  • Still: standing or sitting; inside or outside a building
  • Walking: inside or outside
  • Run
  • Bike
  • Car: as driver, or as passenger
  • Bus: standing or sitting; lower deck or upper deck
  • Train: standing or sitting
  • Subway: standing or sitting

センサーの種類:

  • Accelerometer: x, y, z in m/s2
  • Gyroscope: x, y, z in rad/s
  • Magnetometer: x, y, z in μT
  • Orientation: quaternions in the form of w, x, y, z vector
  • Gravity: x, y, z in m/s2
  • Linear acceleration: x, y, z in m/s2
  • Ambient pressure in hPa
  • Google’s activity recognition API output: 0-100% of confidence for each class (“in vehicle”, “on bicycle”, “on foot”, “running”, “still”, “tilting”, “unknown”, “walking”)
  • Ambient light in lx
  • Battery level (0-100%) and temperature (in °C)
  • Satellite reception: ID, SNR, azimuth and elevation of each visible satellite
  • Wifi reception including SSID, RSSI, frequency and capabilities (i.e. encryption type)
  • Mobile phone cell reception including network type (e.g. GSM, LTE), CID, location area code (LAC), mobile country code (MCC), mobile network code (MNS), signal strength
  • Location obtained from satellites (latitude, longitude, altitude, accuracy)
  • Audio

移動手段予測のコード例

データのダウンロード (download.sh)

まずデータをダウンロードします。

wget http://www.shl-dataset.org/wp-content/uploads/SHLDataset_User1Hips_v1/SHLDataset_User1Hips_v1.zip.001 &
wget http://www.shl-dataset.org/wp-content/uploads/SHLDataset_User1Hips_v1/SHLDataset_User1Hips_v1.zip.002 &
wget http://www.shl-dataset.org/wp-content/uploads/SHLDataset_User1Hips_v1/SHLDataset_User1Hips_v1.zip.003 &
wget http://www.shl-dataset.org/wp-content/uploads/SHLDataset_User1Hips_v1/SHLDataset_User1Hips_v1.zip.004 &
wget http://www.shl-dataset.org/wp-content/uploads/SHLDataset_User1Hips_v1/SHLDataset_User1Hips_v1.zip.005 &

wait

echo "Done"

p7zip-fullをインストールします。

apt install p7zip-full

データを展開します。

7z x SHLDataset_User1Hips_v1.zip.001

Note: 7zipで分割ファイルを展開する場合、最初のID(001)を指定して展開するだけでOKです。

データをcsvに変換。(data_fixer.py)

加速度センサーとジャイロスコープだけ切り出します。

import os
import pandas as pd
from tqdm import tqdm


root_path = "./release/User1/"
csv_root = "./release/User1/"

"""
./release/User1/010317/Hips_Motion.txt
./release/User1/010317/Label.txt
"""

names1 = [
    "time", "accx", "accy", "accz", "gyrox", "gyroy", "gyroz", "magx", "magy",
    "magz", "oriw", "orix", "oriy", "oriz", "gravx", "gravy", "gravz", "laccx",
    "laccy", "laccz", "press", "alt", "temp"
]

names2 = [
    "time", "label", "finelabel", "roadlabel", "trafficlabel", "tunnelslabel",
    "sociallabel", "foodlabel"
]


def blocks(f, size=65536):
    while True:
        b = f.read(size)
        if not b:
            break
        yield b


def line_count(path):
    with open(path) as f:
        total = sum(b1.count("\n") for b1 in blocks(f))
    return total


def load_one(path):

    file1 = "Hips_Motion.txt"
    file2 = "Label.txt"
    
    feature_path = os.path.join(path, file1)
    label_path = os.path.join(path, file2)

    if line_count(feature_path) != line_count(label_path):
        return None, False

    df = pd.read_csv(feature_path, sep=" ", header=None, names=names1)
    df_label = pd.read_csv(label_path, sep=" ", header=None, names=names2)

    df["time"] = df["time"].astype("int")
    df_label["time"] = df["time"].astype("int")

    features = ["time", "accx", "accy", "accz", "gyrox", "gyroy", "gyroz"]

    labels = ["time", "label"]

    df = df[features]
    df_label = df_label[labels]

    df = df.merge(df_label, on=["time"])
    del (df_label)
    
    return df, True


def load_all():
    for idx in tqdm(os.listdir(root_path)):
        path = os.path.join(root_path, idx)
        try:
            df, flag = load_one(path)
            if flag:
                df.to_csv(os.path.join(root_path, idx+".csv"), index=False)
        except Exception as e:
            with open("fixing.log", "a") as f:
                f.write("path:{}, error:{}".format(path, repr(e)))


def concat_data():
    dfs = []
    for filename in tqdm(os.listdir(csv_root)):
        if filename.endswith(".csv"):
            dfs.append(
                pd.read_csv(os.path.join(csv_root, filename)))
    df = pd.concat(dfs)
    df["time"].astype("int")
    df = df.sort_values(by=["time"])
    df.to_csv("data.csv", index=False)


if __name__ == "__main__":
    load_all()
    concat_data()

ウィンドウ分割(window_segmentation.py)

生データの各点では予測できないので、点を線に変換して予測するためにWindow分割します。

import pandas as pd
from tqdm import tqdm
import numpy as np

cols = ["time", "accx", "accy", "accz", "gyrox", "gyroy", "gyroz", "label"]


def most_frequent(arr):
    return max(set(arr), key=arr.count)


def window_segmentation(df):
    df["time"] = df["time"].astype("int")
    df = df[df["label"] != 0]
    df = df.sort_values(by=["time"])

    out = []
    out_label = []
    prev_time = None
    current_group = []
    current_group_label = []

    for time, accx, accy, accz, gyrox, gyroy, gyroz, label in tqdm(
            zip(*[df[col] for col in cols])):
        if len(current_group) == 512:
            out.append(current_group)
            out_label.append(most_frequent(current_group_label))
            current_group = []
            current_group_label = []
        if prev_time is None:
            current_group.append([accx, accy, accz, gyrox, gyroy, gyroz])
            current_group_label.append(label)
        elif time - prev_time < 100:
            current_group.append([accx, accy, accz, gyrox, gyroy, gyroz])
            current_group_label.append(label)
        elif len(current_group) < 512:
            current_group = []
            current_group_label = []
        prev_time = time
    out, out_label = np.array(out), np.array(out_label)
    print(out.shape)
    print(out_label.shape)

    np.save("data_features.npy", out)
    np.save("data_labels.npy", out_label)
    return out, out_label


if __name__ == "__main__":
    df = pd.read_csv("./data.csv")
    window_segmentation(df)

均衡化 (fix_window.py)

データの偏りをなくします。

import numpy as np


def balancing(feature_data="./data_features.npy",
              label_data="./data_labels.npy"):

    labels = [1, 2, 3, 4, 5, 6, 7, 8]
    X = np.load(feature_data)
    y = np.load(label_data)

    min_length = y.shape[0]
    for label in labels:
        tmp = np.sum(y == label)
        if tmp < min_length:
            min_length = tmp

    data = []
    data_label = []
    for label in labels:
        indices = np.where(y == label)[0][:min_length]
        data.append(X[indices])
        data_label.append(y[indices])

    X = np.concatenate(data)
    y = np.concatenate(data_label)

    print(X.shape)
    print(y.shape)
    np.save("data_features_balanced.npy", X)
    np.save("data_labels_balanced.npy", y)
    return X, y


if __name__ == "__main__":
    balancing()

訓練・テスト(train.py)

CNNによるモデルを訓練します。

import numpy as np
from keras.models import Sequential, load_model
from keras.layers import SeparableConv1D, MaxPooling1D, Flatten
from keras.layers import Dense
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


def build_model(nlabels=9):
    print("prepareing model")
    model = Sequential([
        SeparableConv1D(48, 4, 1, input_shape=(512, 6)),
        SeparableConv1D(48, 4, 1),
        MaxPooling1D(2),
        SeparableConv1D(64, 4, 1),
        SeparableConv1D(64, 4, 1),
        MaxPooling1D(4),
        SeparableConv1D(80, 4, 1),
        SeparableConv1D(80, 4, 1),
        MaxPooling1D(4),
        Flatten(),
        Dense(nlabels, activation="softmax")
    ])

    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer="nadam",
                  metrics=["accuracy"])

    return model


if __name__ == "__main__":
    model = build_model()
    callbacks = [
        ModelCheckpoint("./model_best.h5",
                        monitor="val_loss",
                        save_best_only=True,
                        mode="min")
    ]

    X = np.load("./data/final/data_features_balanced.npy")
    y = np.load("./data/final/data_labels_balanced.npy")
    X = np.nan_to_num(X)
    y = np.nan_to_num(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
    X_valid, X_test, y_valid, y_test = train_test_split(X_test,
                                                        y_test,
                                                        train_size=0.5)
    model.fit(X_train,
              y_train,
              validation_data=(X_valid, y_valid),
              epochs=300,
              batch_size=1000,
              callbacks=callbacks)
    model = load_model("./model_best.h5")
    y_pred = model.predict_classes(X_test)
    print(classification_report(y_test, y_pred))

精度

1=Still, 2=Walking, 3=Run, 4=Bike, 5=Car, 6=Bus, 7=Train, 8=Subway

              precision    recall  f1-score   support

           1       0.64      0.83      0.73       605
           2       0.92      0.87      0.89       634
           3       0.98      0.99      0.99       626
           4       0.95      0.92      0.93       606
           5       0.90      0.93      0.92       579
           6       0.91      0.78      0.84       645
           7       0.61      0.68      0.64       643
           8       0.53      0.41      0.46       606

    accuracy                           0.80      4944
   macro avg       0.81      0.80      0.80      4944
weighted avg       0.81      0.80      0.80      4944

参考