推荐：10个数据科学家常犯的编程错误（附解决方案）

2018年9月2日 0条评论 46次阅读 0人点赞 9suan

作者：Norman Niemer 翻译：李润嘉校对：李洁

本文约2000字，建议阅读10分钟。

本文为资深数据科学家常见的10个错误提供解决方案。

数据科学家是“比软件工程师更擅长统计学，比统计学家更擅长软件工程的人”。许多数据科学家都具有统计学背景，但是在软件工程方面的经验甚少。我是一名资深数据科学家，在Stackoverflow的python编程方面排名前1%，并与许多（初级）数据科学家共事。以下是我经常看到的10大常见错误，本文将为你相关解决方案：

不共享代码中引用的数据
对无法访问的路径进行硬编码
将代码与数据混合
在Git中和源码一起提交数据
编写函数而不是DAG
写for循环
不编写单元测试
不写代码说明文档
将数据保存为csv或pickle文件
使用jupyter notebook

1. 不共享代码中引用的数据

数据科学需要代码和数据。因此，为了让别人可以复现你的结果，他们需要能够访问到数据。道理很简单，但是很多人忘记分享他们代码中的数据。


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">import</span> pandas <span class="code-snippet__keyword" style="max-width: 1000%;">as</span> pd</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">df1 = pd.read_csv(<span class="code-snippet__string" style="max-width: 1000%;">'file-i-dont-have.csv'</span>) <span class="code-snippet__comment" style="max-width: 1000%;"># fails</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">do_stuff(df)</span>

解决方案：使用d6tpipe（https://github.com/d6t/ d6tpipe）来共享你的代码中的数据文件、将其上传到S3/web/google驱动等，或者保存到数据库，以便于别人可以检索到文件（但是不要将其添加到git，原因见下文）。

2. 对无法访问的路径进行硬编码

与错误1相似，如果你对别人无法访问的路径进行硬编码，他们将无法运行你的代码，并且必须仔细查看代码来手动更改路径。令人崩溃！


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">import</span> pandas <span class="code-snippet__keyword" style="max-width: 1000%;">as</span> pd</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">df = pd.read_csv(<span class="code-snippet__string" style="max-width: 1000%;">'/path/i-dont/have/data.csv'</span>) <span class="code-snippet__comment" style="max-width: 1000%;"># fails</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">do_stuff(df)</span>



1
<span class="code-snippet_outer"><br></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># or</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">import</span> os</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">os.chdir(<span class="code-snippet__string" style="max-width: 1000%;">'c:\Users\yourname\desktop\python'</span>) <span class="code-snippet__comment" style="max-width: 1000%;"># fails</span></span>

解决方案：使用相对路径、全局路径配置变量或d6tpipe，使你的数据易于访问。

d6tpipe：

https://github.com/d6t/d6tpip

3. 将代码与数据混合

既然数据科学的代码中包含数据，为什么不把它们放到同一目录中？那样你还可以在其中保存图像、报告和其他垃圾。哎呀，真是一团糟！


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">├── <span class="code-snippet__selector-tag" style="max-width: 1000%;">data.csv</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">├── <span class="code-snippet__selector-tag" style="max-width: 1000%;">ingest.py</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">├── <span class="code-snippet__selector-tag" style="max-width: 1000%;">other-data.csv</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">├── <span class="code-snippet__selector-tag" style="max-width: 1000%;">output.png</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">├── <span class="code-snippet__selector-tag" style="max-width: 1000%;">report.html</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">└── <span class="code-snippet__selector-tag" style="max-width: 1000%;">run.py</span></span>

解决方案：将你的目录进行分类，比如数据、报告、代码等。请参阅Cookiecutter Data Science或d6tflow项目模板[见#5]，并使用#1中提到的工具来存储和共享数据。

Cookiecutter Data Science：

https://drivendata.github.io/cookiecutter-data-science/

d6tflow项目模板：

https://github.com/d6t/d6tflow-templat

4. 在Git中和源码一起提交数据

现在，大多数人对他们的代码使用版本控制（如果你不使用，那就是另外一个错误，请参阅git：https://git-scm.com/）。在尝试共享数据时，很容易将数据文件添加到版本控制中。当文件很小时是可以的，但是git并没有针对数据进行优化，尤其是大文件。


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">git <span class="code-snippet__keyword" style="max-width: 1000%;">add</span> data.csv</span>

解决方案：使用第1点中提到的工具来存储和共享数据。如果你真的希望对数据进行版本控制，请参阅 d6tpipe，DVC和Git大文件存储。

d6tpipe：

https://github.com/d6t/d6tpipe

DVC：

https://dvc.org/

Git大文件存储：

https://git-lfs.github.com

5. 编写函数而不是DAG

关于数据部分已经够多了，现在来谈一谈实际的代码！在学习编程时最先学习的内容之一就是函数，数据科学代码通常由一系列线性运行的函数组成。

这会导致一些问题，请参阅“为什么你的机器学习代码可能不好的4个原因”：

https://github.com/d6t/d6t-python/blob/master/blogs/reasons-why-bad-ml-code.rst


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">def process_data(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>, parameter):</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = do_stuff(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">data</span>.to_pickle(<span class="code-snippet__string" style="max-width: 1000%;">'data.pkl'</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> </span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = pd.read_csv(<span class="code-snippet__string" style="max-width: 1000%;">'data.csv'</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">process_data(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">df_train = pd.read_pickle(df_train)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">model = sklearn.svm.SVC()</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">model.fit(df_train.iloc[:, :-<span class="code-snippet__number" style="max-width: 1000%;">1</span>], df_train[<span class="code-snippet__string" style="max-width: 1000%;">'y'</span>])</span>

解决方案：数据科学代码不是一系列线性连接的函数，而是一组具有依赖关系的任务集合。请使用d6tflow或airflow。

d6tflow：

https://github.com/d6t/d6tflow-template

airflow：

https://airflow.apache.org

6. 写for循环

与函数类似，for循环也是你学习编程时最初学习的内容。它们易于理解，但是运行缓慢且过于冗长，通常意味着你不了解矢量化的替代方案。


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">x = range(10)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">avg = sum(x)/len(x); std = math.sqrt(sum((i-avg)**2 for i in x)/len(x));</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">zscore = [(i-avg)/std for x]</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># should be: scipy.stats.zscore(x)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># or</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">groupavg = []</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">for i in df['g'].unique():</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">dfg = df[df[g']==i]</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">groupavg.append(dfg['g'].mean())</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># should be: df.groupby('g').mean()</span>

解决方案：Numpy，scipy和pandas为你需要for循环的情况提供了矢量化函数。

Numpy：

http://www.numpy.org/

scipy：

https://www.scipy.org/

pandas：

https://pandas.pydata.org

7. 不编写单元测试

随着数据、参数或用户输入的改变，你的代码可能会出现问题，有时你并没有注意到。这可能会导致糟糕的输出结果，而如果有人基于你的输出做出决策，那么糟糕的数据将会导致糟糕的决策。

解决方案：使用assert语句来检查数据质量。pandas有相等测试，d6tstack有数据提取检查以及用于数据连接的d6tjoin。

pandas相等测试：

https://pandas.pydata.org/pandas-docs/stable/reference/general_utility_functions.html

d6tstack：

https://github.com/d6t/d6tstack

d6tjoin：

https://github.com/d6t/d6tjoin/blob/master/examples-prejoin.ipyn

以下是数据检查的示例代码：


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">assert</span> df[<span class="code-snippet__string" style="max-width: 1000%;">'id'</span>].unique().shape[<span class="code-snippet__number" style="max-width: 1000%;">0</span>] == len(ids) <span class="code-snippet__comment" style="max-width: 1000%;"># have data for all ids?</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">assert</span> df.isna().sum()&lt;<span class="code-snippet__number" style="max-width: 1000%;">0.9</span> <span class="code-snippet__comment" style="max-width: 1000%;"># catch missing values</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">assert</span> df.groupby([<span class="code-snippet__string" style="max-width: 1000%;">'g'</span>,<span class="code-snippet__string" style="max-width: 1000%;">'date'</span>]).size().max() ==<span class="code-snippet__number" style="max-width: 1000%;">1</span> <span class="code-snippet__comment" style="max-width: 1000%;"># no duplicate values/date?</span></span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">assert</span> d6tjoin.utils.PreJoin([df1,df2],[<span class="code-snippet__string" style="max-width: 1000%;">'id'</span>,<span class="code-snippet__string" style="max-width: 1000%;">'date'</span>]).is_all_matched() <span class="code-snippet__comment" style="max-width: 1000%;"># all ids matched?</span></span>

8. 不写代码说明文档

我明白，你急着做出一些分析结果。你把事情汇总到一起分析，将结果交给你的客户或老板。一个星期之后，他们回来说，“可以把XXX改一下吗”或者“可以更新一下这里吗”。你看着你的代码，但是并不记得你当初为什么这么写。现在就像是在运行别人的代码。


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">def some_complicated_function(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>):</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = <span class="code-snippet__keyword" style="max-width: 1000%;">data</span>[<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>[<span class="code-snippet__string" style="max-width: 1000%;">'column'</span>]!=<span class="code-snippet__string" style="max-width: 1000%;">'wrong'</span>]</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = <span class="code-snippet__keyword" style="max-width: 1000%;">data</span>.groupby(<span class="code-snippet__string" style="max-width: 1000%;">'date'</span>).apply(lambda x: complicated_stuff(x))</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = <span class="code-snippet__keyword" style="max-width: 1000%;">data</span>[<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>[<span class="code-snippet__string" style="max-width: 1000%;">'value'</span>]&lt;<span class="code-snippet__number" style="max-width: 1000%;">0.9</span>]</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span> <span class="code-snippet__keyword" style="max-width: 1000%;">data</span></span>

解决方案：即使在你已经提交分析报告后，也要花费额外的时间，来对你做的事情编写说明文档。以后你会感谢自己，别人更会感谢你。那样显得你很专业！

9. 将数据保存为csv或pickle文件

回到数据，毕竟是在讲数据科学。就像函数和for循环一样，CSV和pickle文件很常用，但是并不好用。CSV文件不包含纲要（schema），因此每个人都必须再次解析数字和日期。Pickle文件解决了这个问题，但是它只能在python中使用，并且不能压缩。两者都不是存储大型数据集的最优格式。


1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">def process_data(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>, parameter):</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">    <span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = do_stuff(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">    <span class="code-snippet__keyword" style="max-width: 1000%;">data</span>.to_pickle(<span class="code-snippet__string" style="max-width: 1000%;">'data.pkl'</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">    </span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">data</span> = pd.read_csv(<span class="code-snippet__string" style="max-width: 1000%;">'data.csv'</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">process_data(<span class="code-snippet__keyword" style="max-width: 1000%;">data</span>)</span>



1
<span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">df_train = pd.read_pickle(df_train)</span>

解决方案：使用parquet或其他带有数据纲要的二进制数据格式，在理想情况下可以压缩数据。d6tflow将任务的数据输出保存为parquet，无需额外处理。

parquet：

https://github.com/dask/fastparquet

d6tflow：

https://github.com/d6t/d6tflow-template

10. 使用jupyter notebook

最后一个是颇有争议的错误：jupyter notebook和csv文件一样普遍。许多人使用它们，但是这并不意味着它们很好。jupyter notebook助长了上述提到的许多不良编程习惯，尤其是：

把所有文件保存在一个目录中
编写从上至下运行的代码，而不是DAG
没有对代码进行模块化
很难调试
代码和输出混在一个文件中
没有很好的版本控制
它容易上手，但是扩展性很差。

解决方案：使用pycharm和/或spyder。

pycharm：

https://www.jetbrains.com/pycharm/

spyder：

https://www.spyder-ide.org

作者简介：Norman Niemer是一家大规模资产管理公司的首席数据科学家，他在其中发布数据驱动的投资见解。他有哥伦比亚大学的金融工程专业理学硕士学位，和卡斯商学院（伦敦）的银行与金融专业理学学士学位。

原文标题：

Top 10 Coding Mistakes Made by Data Scientists

原文链接：

https://github.com/d6t/d6t-python/blob/master/blogs/top10-mistakes-coding.md

译者简介：李润嘉，首都师范大学应用统计硕士在读。对数据科学和机器学习兴趣浓厚，语言学习爱好者。立志做一个有趣的人，学想学的知识，去想去的地方，敢想敢做，不枉岁月。

转自：数据派THU 公众号；

END

合作请加QQ：365242293

数据分析（ID : ecshujufenxi ）互联网科技与数据圈自己的微信，也是WeMedia自媒体联盟成员之一，WeMedia联盟覆盖5000万人群。

阅读原文

九算AI实验室|18年专注网站优化-网站SEO-百度账户托管

推荐：10个数据科学家常犯的编程错误（附解决方案）

朋友会在“发现-看一看”看到你“在看”的内容

朋友将在看一看看到

发布到看一看

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可

九算AI实验室|18年专注网站优化-网站SEO-百度账户托管

朋友会在“发现-看一看”看到你“在看”的内容

朋友将在看一看看到

发布到看一看

本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可