ナード戦隊データマン

データサイエンスを用いて悪と戦うぞ

デキるアメリカのデータサイエンティストはKaggleのアンケートになんと答えたか

以前、どんなデータサイエンティストの給与が高いのかを分析しました。 ( https://qiita.com/sugiyamath/items/37582d09227afbd0098b ) しかし、結果としては「インド人が貧乏で、アメリカ人がリッチなんだろ」ということぐらいしか見せませんでした。そこで、今回は国をアメリカに限定し、さらに特徴量(アンケートの質問と回答)を300ほど全て見せます。

特徴量選択してロジスティック回帰を実行

前回とやってることはほとんど一緒なので、データのロードに関しての説明などは省きます。

In[1]:

import pandas as pd
import numpy as np
import re

df = pd.read_csv("multipleChoiceResponses.csv",encoding = "ISO-8859-1")
rates = pd.read_csv("conversionRates.csv", encoding="ISO-8859-1")

df = df[df["CompensationAmount"].notnull()]
df = df[df["CompensationCurrency"].notnull()]
df = df[df["CompensationAmount"].ne("-")]
df = df[df["CompensationCurrency"] == "USD"]

origins = rates["originCountry"].tolist()
exchangeRates = rates["exchangeRate"].tolist()
rate_dict = {}
for origin, exchangeRate in zip(origins,exchangeRates):
    rate_dict[origin] = exchangeRate

df = df[df["CompensationCurrency"].isin(rate_dict.keys())]

CompensationUSD = []
currencies = df["CompensationCurrency"].tolist()
amounts = df["CompensationAmount"].tolist()
for currency, amount in zip(currencies,amounts):
    tmp = re.sub(",","",amount)
    CompensationUSD.append(float(tmp)*float(rate_dict[currency]))
df["CompensationUSD"] = CompensationUSD
df = df[df["CompensationUSD"].notnull()]

df = df.drop('CompensationAmount', 1)
df = df.drop('CompensationCurrency', 1)
df = df.drop('Country', 1)

y = df["CompensationUSD"]
X = df.drop("CompensationUSD", 1)
X = X.fillna(0)
X_dummied = pd.get_dummies(X)
y_binary = y > np.median(y)

今回は、50000ドルではなく、アメリカのデータサイエンティストの給与の中央値を境に目的変数を2値化しました。

In[2]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_dummied, y_binary, random_state=0)
clf = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, y_train)

out = pd.DataFrame()
out["feature_name"] = X_dummied.columns.tolist()
out["feature_importance"] = clf.feature_importances_

top_306_features = out.sort_values("feature_importance", ascending=False)[0:306]
X_selected = X_dummied[top_306_features["feature_name"]]


X_train, X_test, y_train, y_test = train_test_split(X_selected, y_binary, random_state=0)
lr = LogisticRegression().fit(X_train, y_train)

print("score:{}".format(lr.score(X_test, y_test)))

Out[2]:

score:0.7090909090909091

国籍に関する特徴量を消したため、モデルの精度は下がりました。しかし、これによってアメリカのデータサイエンティストに限定して重要な回答項目を特定できます。

In[3]:

out2 = pd.DataFrame()
out2["feature_name"] = X_selected.columns.tolist()
out2["coef"] = lr.coef_[0].tolist()
out2.to_csv("result.csv", index=False)

Out[3]:

feature_name coef
0 Age 0.068479
1 Tenure_More than 10 years 1.697553
2 CurrentEmployerType_Employed by college or uni... -2.098635
3 DataScienceIdentitySelect_0 0.281331
4 LearningCategoryOnlineCourses -0.047623
5 TimeVisualizing -0.015580
6 TimeModelBuilding -0.026931
7 Tenure_1 to 2 years -0.803940
8 LearningCategorySelftTaught -0.041710
9 TimeGatheringData -0.013498
10 LearningCategoryWork -0.038596
11 TimeFindingInsights 0.006048
12 Tenure_3 to 5 years -0.321073
13 FormalEducation_Doctoral degree 0.724343
14 CurrentJobTitleSelect_Data Scientist 0.281331
15 TimeProduction -0.021605
16 CurrentJobTitleSelect_Data Analyst -1.449301
17 LearningCategoryUniversity -0.033854
18 WorkToolsFrequencyAWS_0 0.387572
19 WorkToolsFrequencyHadoop_0 -0.321495
20 LearningCategoryKaggle -0.066354
21 EmployerIndustry_Academic -0.037940
22 EmployerIndustry_Technology 0.587736
23 CurrentEmployerType_Employed by a company that... 0.463840
24 DataScienceIdentitySelect_No -0.314967
25 WorkMethodsFrequencyEnsembleMethods_0 -0.510260
26 EmployerMLTime_Don\'t know -0.703553
27 WorkMethodsFrequencyTimeSeriesAnalysis_0 -0.224616
28 Tenure_Less than a year -0.435826
29 LearningPlatformUsefulnessCollege_0 0.281135
30 AlgorithmUnderstandingLevel_Enough to run the ... -0.861677
31 WorkChallengeFrequencyExpectations_0 -0.375583
32 WorkDataStorage_Flat files not in a database o... -0.386620
33 EmploymentStatus_Employed full-time 0.578870
34 WorkToolsFrequencyAWS_Often 1.352131
35 WorkMethodsFrequencyLiftAnalysis_0 -0.491133
36 EmployerSizeChange_Increased significantly 0.557436
37 JobFunctionSelect_Build and/or run a machine l... 0.340561
38 WorkDataVisualizations_76-99% of projects 0.373011
39 WorkDatasetSize_0 -0.746442
40 WorkDataStorage_Row-oriented relational (e.g. ... -0.235493
41 SalaryChange_Has increased 20% or more 0.772570
42 WorkToolsFrequencyPython_0 -0.112314
43 FormalEducation_Bachelor\'s degree -0.350542
44 WorkMethodsFrequencyA/B_0 -0.352774
45 LearningPlatformUsefulnessCollege_Somewhat useful -0.956894
46 WorkMethodsFrequencySimulation_0 -0.220552
47 WorkMethodsFrequencyGBM_0 -0.255413
48 WorkChallengeFrequencyDirtyData_0 -0.014298
49 WorkToolsFrequencySQL_0 -0.124224
50 LanguageRecommendationSelect_Python 0.782124
51 WorkCodeSharing_Git 0.160173
52 BlogsPodcastsNewslettersSelect_0 0.637040
53 WorkMethodsFrequencyPrescriptiveModeling_0 -0.345483
54 WorkMethodsFrequencyLogisticRegression_Sometimes 0.121432
55 WorkToolsFrequencySpark_0 0.197360
56 WorkInternalVsExternalTools_More internal than... -0.042683
57 WorkDataStorage_Flat files not in a database o... -0.372058
58 WorkMethodsFrequencyCross-Validation_Most of t... 0.104426
59 EmployerSize_10,000 or more employees 0.149857
60 LearningPlatformUsefulnessProjects_0 -0.058177
61 WorkChallengeFrequencyPolitics_0 0.343984
62 FormalEducation_Master\'s degree 0.067436
63 WorkMethodsFrequencyTextAnalysis_Sometimes 0.317632
64 WorkDatasetSize_100GB 0.074480
65 Tenure_6 to 10 years 0.290472
66 LearningPlatformUsefulnessKaggle_Very useful 0.044072
67 WorkMethodsFrequencyRecommenderSystems_0 0.387902
68 LearningPlatformUsefulnessConferences_0 -0.124684
69 EmployerSearchMethod_An external recruiter or ... 0.868228
70 MajorSelect_Physics 0.296851
71 EmployerSizeChange_Increased slightly 0.722816
72 EmployerMLTime_0 0.435451
73 WorkToolsFrequencyNoSQL_0 0.781856
74 ParentsEducation_A master\'s degree -0.348896
75 LearningPlatformUsefulnessCompany_0 -0.470440
76 WorkToolsFrequencyUnix_0 -0.373172
77 WorkMethodsFrequencyCross-Validation_0 0.111001
78 WorkMethodsFrequencyDecisionTrees_0 -0.159270
79 WorkDataTypeSelect_Text data -0.189846
80 LearningPlatformUsefulnessYouTube_0 0.258828
81 LearningPlatformUsefulnessCourses_0 -0.099933
82 WorkMethodsFrequencyPCA_0 -0.352539
83 WorkDataTypeSelect_Relational data -0.166469
84 WorkMethodsFrequencyDataVisualization_0 -0.781894
85 CurrentEmployerType_Employed by company that m... 0.396833
86 WorkChallengeFrequencyDirtyData_Most of the time 0.450069
87 WorkToolsFrequencyR_Most of the time 0.645424
88 WorkMethodsFrequencyNLP_0 0.276013
89 AlgorithmUnderstandingLevel_Enough to explain ... -0.178403
90 MajorSelect_Computer Science -0.256908
91 EmployerMLTime_3-5 years -0.304332
92 DataScienceIdentitySelect_Yes -0.219098
93 WorkToolsFrequencyAWS_Most of the time 0.486646
94 WorkToolsFrequencySQL_Most of the time -0.207669
95 EmployerSearchMethod_A friend, family member, ... -0.391141
96 WorkToolsFrequencyJupyter_0 -0.211600
97 LanguageRecommendationSelect_R -0.218728
98 WorkDatasetSize_10GB -0.422807
99 WorkToolsFrequencyPython_Most of the time 0.129843
100 LearningCategoryOther -0.013314
101 CurrentEmployerType_Employed by professional s... -0.429094
102 UniversityImportance_Very important -0.249939
103 LearningPlatformUsefulnessProjects_Very useful 0.253083
104 SalaryChange_I was not employed 3 years ago -0.017191
105 EmploymentStatus_Employed part-time -0.448167
106 TitleFit_Fine 0.437946
107 LearningPlatformUsefulnessTextbook_Very useful 0.288480
108 LearningPlatformUsefulnessYouTube_Somewhat useful -0.737736
109 WorkMethodsFrequencyLogisticRegression_Often -0.211330
110 WorkMethodsFrequencyKNN_0 0.482404
111 WorkDataVisualizations_10-25% of projects -0.294009
112 LearningPlatformUsefulnessBlogs_0 -0.183831
113 EmployerSize_I don\'t know -0.126138
114 WorkChallengeFrequencyTalent_0 -0.104591
115 WorkProductionFrequency_Most of the time -0.066297
116 WorkMethodsFrequencyA/B_Sometimes 0.267184
117 WorkToolsFrequencyTensorFlow_0 -0.429121
118 SalaryChange_Has stayed about the same (has no... 0.285703
119 SalaryChange_Has increased between 6% and 19% 0.803690
120 WorkMethodsFrequencyTimeSeriesAnalysis_Often 0.721306
121 WorkMethodsFrequencyTextAnalysis_0 0.124592
122 MLToolNextYearSelect_TensorFlow -0.402375
123 MLMethodNextYearSelect_Deep learning -0.013382
124 LearningPlatformUsefulnessCourses_Somewhat useful 0.727625
125 WorkMethodsFrequencyLogisticRegression_0 -0.148098
126 EmploymentStatus_Independent contractor, freel... -0.133118
127 WorkToolsFrequencyMATLAB_0 0.022003
128 WorkMethodsFrequencyRandomForests_0 0.146480
129 MajorSelect_Information technology, networking... -1.550536
130 JobFunctionSelect_Analyze and understand data ... 0.025936
131 LearningPlatformUsefulnessTextbook_0 0.232973
132 CurrentEmployerType_Employed by a company that... 0.366073
133 MLToolNextYearSelect_I don\'t plan on learning ... 0.006818
134 LearningPlatformUsefulnessSO_Somewhat useful 0.355973
135 AlgorithmUnderstandingLevel_Enough to tune the... -1.226598
136 RemoteWork_Never 0.249343
137 TimeOtherSelect -0.008814
138 WorkChallengeFrequencyDataAccess_Sometimes 0.699259
139 WorkMethodsFrequencyNaiveBayes_0 -0.521839
140 WorkToolsFrequencyJupyter_Most of the time -0.053016
141 LearningPlatformUsefulnessBlogs_Somewhat useful 0.130403
142 WorkMethodsFrequencyRecommenderSystems_Sometimes 0.736061
143 WorkDatasetSize_10MB -1.274020
144 WorkChallengeFrequencyClarity_Often 0.552631
145 LearningPlatformUsefulnessKaggle_0 0.017762
146 WorkChallengeFrequencyPrivacy_0 0.755218
147 EmployerIndustry_Internet-based 0.659964
148 RemoteWork_Sometimes 0.350711
149 WorkMethodsFrequencyDataVisualization_Most of ... -0.782662
150 LearningPlatformUsefulnessArxiv_Very useful 0.103936
151 WorkChallengeFrequencyDataAccess_0 0.192623
152 WorkChallengeFrequencyUnusedResults_0 -0.225472
153 EmployerSearchMethod_I was contacted directly ... -0.260579
154 GenderSelect_Female -0.946012
155 JobSatisfaction_10 - Highly Satisfied 0.059784
156 WorkDataSharing_Share Drive/SharePoint 0.237191
157 WorkMLTeamSeatSelect_Standalone Team -0.163679
158 WorkChallengeFrequencyTools_0 0.453525
159 WorkMethodsFrequencySegmentation_0 0.234805
160 WorkToolsFrequencyUnix_Most of the time -0.412597
161 WorkToolsFrequencyR_Sometimes 0.534221
162 WorkMethodsFrequencyDecisionTrees_Often 0.259252
163 WorkDatasetsChallenge_0 0.231599
164 LearningPlatformUsefulnessYouTube_Very useful -0.088507
165 WorkMethodsFrequencyEnsembleMethods_Sometimes 0.282194
166 EmployerMLTime_More than 10 years -0.086353
167 MLToolNextYearSelect_Python -0.148804
168 WorkDatasetSize_1GB -0.126173
169 WorkHardwareSelect_Traditional Workstation 0.917003
170 WorkMethodsFrequencySVMs_0 0.053407
171 CurrentJobTitleSelect_Other 0.660138
172 WorkProductionFrequency_Sometimes 0.262847
173 ParentsEducation_A bachelor\'s degree 0.209113
174 LearningPlatformUsefulnessArxiv_0 -0.001661
175 WorkMethodsFrequencyDecisionTrees_Most of the ... 0.059537
176 ParentsEducation_A doctoral degree 0.015415
177 WorkToolsFrequencyR_0 0.097373
178 WorkDatasets_0 0.008264
179 LearningPlatformUsefulnessBlogs_Very useful 0.026194
180 WorkMethodsFrequencyDataVisualization_Often -0.788286
181 WorkChallengeFrequencyHiringFunds_0 0.087427
182 FirstTrainingSelect_Online courses (coursera, ... 0.261640
183 WorkToolsFrequencyCloudera_0 -0.246073
184 WorkChallengeFrequencyIntegration_0 -0.441233
185 WorkChallengeFrequencyUnusefulInstrumenting_0 0.084732
186 WorkMethodsFrequencyBayesian_0 0.499167
187 WorkChallengeFrequencyEnvironments_0 0.405823
188 UniversityImportance_Somewhat important -0.274890
189 LearningPlatformUsefulnessSO_Very useful 0.402665
190 WorkMethodsFrequencyNeuralNetworks_0 -0.301835
191 JobSatisfaction_5 0.072469
192 WorkMLTeamSeatSelect_Other -0.231015
193 AlgorithmUnderstandingLevel_Enough to code it ... -0.093961
194 WorkChallengeFrequencyUnusedResults_Often 0.078110
195 WorkDataVisualizations_51-75% of projects 0.190737
196 MLMethodNextYearSelect_Neural Nets -0.024591
197 LearningPlatformUsefulnessDocumentation_0 -0.047319
198 LearningPlatformUsefulnessTextbook_Somewhat us... 0.087347
199 WorkChallengeFrequencyExplaining_0 0.180495
200 WorkDataVisualizations_100% of projects -0.184048
201 LearningPlatformUsefulnessFriends_0 0.145526
202 LearningPlatformUsefulnessCollege_Very useful 0.156822
203 WorkToolsFrequencySQL_Often 0.066856
204 LearningPlatformUsefulnessCourses_Very useful -0.147143
205 FirstTrainingSelect_University courses -0.118731
206 WorkToolsFrequencyHadoop_Often 0.796860
207 WorkMethodsFrequencyRandomForests_Sometimes -0.519073
208 WorkMethodsFrequencyAssociationRules_0 0.644449
209 MajorSelect_Mathematics or statistics -0.056290
210 WorkMethodsFrequencyKNN_Sometimes 0.563291
211 MajorSelect_Engineering (non-computer focused) 0.073537
212 CurrentJobTitleSelect_Scientist/Researcher 0.107928
213 WorkDataTypeSelect_Text data,Relational data -0.464725
214 LearningPlatformUsefulnessSO_0 0.038219
215 WorkChallengeFrequencyTalent_Often 0.584796
216 MLTechniquesSelect_0 0.083263
217 EmployerSizeChange_0 -0.002907
218 WorkMethodsFrequencySimulation_Often -0.093468
219 WorkMethodsFrequencyPCA_Sometimes -0.385294
220 RemoteWork_Rarely -0.033807
221 WorkToolsFrequencyJupyter_Often -0.596180
222 PublicDatasetsSelect_Dataset aggregator/platfo... 0.244097
223 WorkChallengeFrequencyClarity_Most of the time 0.190254
224 WorkDataStorage_Flat files not in a database o... 1.192469
225 WorkToolsFrequencyGCP_0 0.057922
226 EmployerSizeChange_Stayed the same 0.467980
227 WorkHardwareSelect_Laptop + Cloud service (AWS... 0.552663
228 WorkCodeSharing_Other -0.081971
229 WorkDataSharing_Email 0.129658
230 JobSatisfaction_9 -0.200769
231 WorkMethodsFrequencyRandomForests_Often -0.297451
232 PastJobTitlesSelect_Data Analyst -1.639222
233 WorkMLTeamSeatSelect_IT Department -0.633447
234 JobSatisfaction_8 -0.270292
235 MLSkillsSelect_Supervised Machine Learning (Ta... 0.386059
236 MLMethodNextYearSelect_Other 0.046018
237 TitleFit_Poorly 1.026585
238 WorkToolsFrequencyMATLAB_Often -1.487252
239 WorkToolsFrequencyJava_0 -0.438842
240 EmployerMLTime_1-2 years -0.310438
241 WorkChallengeFrequencyTalent_Most of the time 0.146010
242 WorkMethodsFrequencyCross-Validation_Often 0.104158
243 MLMethodNextYearSelect_Time Series Analysis 0.094379
244 WorkChallengeFrequencyPrivacy_Often 0.919702
245 JobSatisfaction_7 -0.214038
246 WorkChallengeFrequencyExplaining_Sometimes -0.026764
247 WorkToolsFrequencyPython_Often 0.087898
248 WorkChallengeFrequencyDomainExpertise_0 0.606880
249 WorkToolsFrequencyHadoop_Most of the time -0.023946
250 LearningPlatformUsefulnessTutoring_0 0.161013
251 WorkChallengeFrequencyScaling_Sometimes 1.045299
252 WorkChallengeFrequencyDomainExpertise_Often 0.357464
253 EmployerSize_1,000 to 4,999 employees -0.041616
254 WorkProductionFrequency_Rarely 0.129231
255 UniversityImportance_Important -0.286191
256 GenderSelect_Male -0.384720
257 DataScienceIdentitySelect_Sort of (Explain more) 0.250319
258 WorkToolsFrequencyNoSQL_Sometimes 0.235315
259 EmployerMLTime_6-10 years 0.251603
260 EmployerSize_100 to 499 employees 0.236422
261 WorkMLTeamSeatSelect_Business Department 0.005845
262 WorkToolsFrequencyNoSQL_Often -0.680183
263 TitleFit_Perfectly 0.153481
264 CurrentJobTitleSelect_Software Developer/Softw... 0.206485
265 MLToolNextYearSelect_Other -0.101759
266 LearningPlatformUsefulnessNewsletters_0 0.905325
267 WorkMethodsFrequencyGBM_Most of the time -0.139914
268 WorkMethodsFrequencyCNNs_0 -0.011438
269 WorkMethodsFrequencyTimeSeriesAnalysis_Most of... -0.300182
270 WorkHardwareSelect_Laptop or Workstation and p... 0.242171
271 WorkChallengeFrequencyDataAccess_Most of the time 0.189199
272 WorkMethodsFrequencyEnsembleMethods_Often -0.237337
273 AlgorithmUnderstandingLevel_Enough to refine a... -0.051625
274 WorkChallengeFrequencyExplaining_Often -0.297820
275 WorkMethodsFrequencyDecisionTrees_Sometimes -0.343836
276 WorkToolsFrequencySpark_Sometimes -0.019461
277 WorkToolsFrequencyC_0 0.188561
278 MajorSelect_0 -0.463205
279 EmployerSearchMethod_0 -0.449360
280 WorkToolsFrequencyPython_Sometimes 0.257468
281 WorkChallengeFrequencyClarity_Sometimes -0.137937
282 WorkHardwareSelect_Basic laptop (Macbook),Lapt... 1.040318
283 WorkToolsFrequencyJupyter_Sometimes -0.709074
284 WorkMethodsFrequencyEnsembleMethods_Most of th... -0.367561
285 JobFunctionSelect_Other 0.113986
286 WorkChallengeFrequencyDirtyData_Sometimes -0.377112
287 WorkInternalVsExternalTools_Approximately half... -0.106135
288 WorkToolsFrequencyAWS_Sometimes 0.651202
289 ParentsEducation_High school -0.103820
290 WorkToolsFrequencyExcel_0 0.046580
291 WorkDatasetSize_100MB -0.265782
292 LearningPlatformUsefulnessConferences_Very useful 0.017170
293 RemoteWork_Always 0.341071
294 WorkHardwareSelect_Basic laptop (Macbook) 0.360181
295 WorkChallengeFrequencyPolitics_Sometimes 0.565232
296 WorkMethodsFrequencyBayesian_Sometimes 0.188945
297 WorkMethodsFrequencyPrescriptiveModeling_Often 0.746246
298 LearningPlatformUsefulnessKaggle_Somewhat useful -0.348093
299 WorkMethodsFrequencyNaiveBayes_Often 0.003339
300 WorkChallengeFrequencyHiringFunds_Most of the ... 0.255938
301 FirstTrainingSelect_Self-taught 0.000264
302 WorkChallengeFrequencyClarity_0 0.300438
303 LearningPlatformUsefulnessCompany_Very useful -0.671252
304 WorkToolsFrequencyTableau_0 -0.270873
305 WorkDataVisualizations_26-50% of projects -0.536218

上記のテーブルが、アンケートの回答と係数の関係です。ブラウザによっては、テーブルを右にスクロールしないと係数を見れないので注意してください。年齢以外は係数が大きいほど給与に大きな影響を与える項目になっています。ただし、テーブルの上に行くほど情報量の多い項目になっています。(年齢が一番影響が大きい)

回答項目の説明

回答項目の意味を知りたい場合は、以下のスキーマを見てください。 https://www.kaggle.com/kaggle/kaggle-survey-2017/downloads/schema.csv

汎化性能

ROCをプロットしたので、参考にどうぞ。 download (3).png

リンク

https://www.kaggle.com/kaggle/kaggle-survey-2017/