在本文中,作者从 Pandas 的简介开始,一步一步讲解了 Pandas 的发展现状、内存优化等问题。这是一篇最佳实践教程,既适合用过 Pandas 的读者,也适合没用过但想要上手的小白。
通过本文,你将有望发现一到多种用 pandas 编码的新方法。
Pandas 发展现状;
内存优化;
索引;
方法链;
随机提示。
在阅读本文时,我建议你阅读每个你不了解的函数的文档字符串(docstrings)。简单的 Google 搜索和几秒钟 Pandas 文档的阅读,都会使你的阅读体验更加愉快。
Pandas 是一个「开源的、有 BSD 开源协议的库,它为 Python 编程语言提供了高性能、易于使用的数据架构以及数据分析工具」。总之,它提供了被称为 DataFrame 和 Series(对那些使用 Panel 的人来说,它们已经被弃用了)的数据抽象,通过管理索引来快速访问数据、执行分析和转换运算,甚至可以绘图(用 matplotlib 后端)。
Pandas 的当前最新版本是 v0.25.0 (https://github.com/pandas-dev/pandas/releases/tag/v0.25.0)
Pandas 正在逐步升级到 1.0 版,而为了达到这一目的,它改变了很多人们习以为常的细节。Pandas 的核心开发者之一 Marc Garcia 发表了一段非常有趣的演讲——「走向 Pandas 1.0」。
演讲链接:https://www.youtube.com/watch?v=hK6o_TDXXN8
用一句话来总结,Pandas v1.0 主要改善了稳定性(如时间序列)并删除了未使用的代码库(如 SparseDataFrame)。
让我们开始吧!选择「1985 到 2016 年间每个国家的自杀率」作为玩具数据集。这个数据集足够简单,但也足以让你上手 Pandas。
数据集链接:https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
在深入研究代码之前,如果你想重现结果,要先执行下面的代码准备数据,确保列名和类型是正确的。
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">import</span> pandas <span class="code-snippet__keyword">as</span> pd</span>
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">import</span> numpy <span class="code-snippet__keyword">as</span> np</span>
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">import</span> os</span>
1
<span class="code-snippet_outer"># to download https:<span class="code-snippet__comment">//www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016</span></span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer">data_path = <span class="code-snippet__string"> path/to/folder/ </span></span>
1
<span class="code-snippet_outer">df = (pd.read_csv(filepath_or_buffer=os.path.join(data_path, <span class="code-snippet__string"> master.csv </span>)) </span>
1
<span class="code-snippet_outer">.rename(columns={<span class="code-snippet__string"> suicides/100k pop </span> : <span class="code-snippet__string"> suicides_per_100k </span>, <span class="code-snippet__string"> gdp_for_year ($) </span> : <span class="code-snippet__string"> gdp_year </span>, <span class="code-snippet__string"> gdp_per_capita ($) </span> : <span class="code-snippet__string"> gdp_capita </span>, <span class="code-snippet__string"> country-year </span> : <span class="code-snippet__string"> country_year </span>}) </span>
1
<span class="code-snippet_outer">.assign(gdp_year=lambda _df: _df[<span class="code-snippet__string"> gdp_year </span>].str</span>
1
<span class="code-snippet_outer">.replace(<span class="code-snippet__string"> , </span>,<span class="code-snippet__string"> </span>).astype(np.int64)) )</span>
提示:如果你读取了一个大文件,在 read_csv(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)中参数设定为 chunksize=N,这会返回一个可以输出 DataFrame 对象的迭代器。
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> df.columnsIndex([<span class="code-snippet__string"> country </span>, <span class="code-snippet__string"> year </span>, <span class="code-snippet__string"> sex </span>, <span class="code-snippet__string"> age </span>, <span class="code-snippet__string"> suicides_no </span>, <span class="code-snippet__string"> population </span>, <span class="code-snippet__string"> suicides_per_100k </span>, <span class="code-snippet__string"> country_year </span>, <span class="code-snippet__string"> HDI for year </span>, <span class="code-snippet__string"> gdp_year </span>, <span class="code-snippet__string"> gdp_capita </span>, <span class="code-snippet__string"> generation </span>], dtype=<span class="code-snippet__string"> object </span>)</span>
这里有 101 个国家、年份从 1985 到 2016、两种性别、六个年代以及六个年龄组。有一些获得这些信息的方法:
可以用 unique() 和 nunique() 获取列内唯一的值(或唯一值的数量);
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> df[<span class="code-snippet__string"> generation </span>].unique()</span>
1
<span class="code-snippet_outer">array([<span class="code-snippet__string"> Generation X </span>, <span class="code-snippet__string"> Silent </span>, <span class="code-snippet__string"> G.I. Generation </span>, <span class="code-snippet__string"> Boomers </span>, <span class="code-snippet__string"> Millenials </span>, <span class="code-snippet__string"> Generation Z </span>], dtype=object)</span>
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> df[<span class="code-snippet__string"> country </span>].nunique()</span>
1
<span class="code-snippet_outer" style="">101</span>
可以用 describe() 输出每一列不同的统计数据(例如最小值、最大值、平均值、总数等),如果指定 include= all ,会针对每一列目标输出唯一元素的数量和出现最多元素的数量;
可以用 head() 和 tail() 来可视化数据框的一小部分。
在处理数据之前,了解数据并为数据框的每一列选择合适的类型是很重要的一步。
在内部,Pandas 将数据框存储为不同类型的 numpy 数组(比如一个 float64 矩阵,一个 int32 矩阵)。
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">import</span> pandas <span class="code-snippet__keyword">as</span> pd</span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer"><span class="code-snippet__function"><span class="code-snippet__keyword">def</span> <span class="code-snippet__title">mem_usage(df: pd.DataFrame)</span> -> str:</span> </span>
1
<span class="code-snippet_outer"><span class="code-snippet__string">"""This method styles the memory usage of a DataFrame to be readable as MB. Parameters ---------- df: pd.DataFrame Data frame to measure. Returns ------- str Complete memory usage as a string formatted for MB. """</span> </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">return</span> <span class="code-snippet__string">f <span class="code-snippet__subst">{df.memory_usage(deep=<span class="code-snippet__keyword">True</span>).sum() / <span class="code-snippet__number">1024</span> ** <span class="code-snippet__number">2</span> : <span class="code-snippet__number">3.2</span>f}</span> MB </span></span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer"><span class="code-snippet__function"><span class="code-snippet__keyword">def</span> <span class="code-snippet__title">convert_df(df: pd.DataFrame, deep_copy: bool = True)</span> -> pd.DataFrame:</span> </span>
1
<span class="code-snippet_outer"><span class="code-snippet__string">"""Automatically converts columns that are worth stored as ``categorical`` dtype. Parameters ---------- df: pd.DataFrame Data frame to convert. deep_copy: bool Whether or not to perform a deep copy of the original data frame. Returns ------- pd.DataFrame Optimized copy of the input data frame. """</span> </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">return</span> df.copy(deep=deep_copy).astype({ col: <span class="code-snippet__string"> category </span> <span class="code-snippet__keyword">for</span> col <span class="code-snippet__keyword">in</span> df.columns <span class="code-snippet__keyword">if</span> df[col].nunique() / df[col].shape[<span class="code-snippet__number">0</span>] < <span class="code-snippet__number">0.5</span>})</span>
Pandas 提出了一种叫做 memory_usage() 的方法,这种方法可以分析数据框的内存消耗。在代码中,指定 deep=True 来确保考虑到了实际的系统使用情况。
memory_usage():https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.memory_usage.html
了解列的类型(https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes)很重要。它可以通过两种简单的方法节省高达 90% 的内存使用:
除了降低数值类型的大小(用 int32 而不是 int64)外,Pandas 还提出了分类类型: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
如果你是用 R 语言的开发人员,你可能觉得它和 factor 类型是一样的。
这种分类类型允许用索引替换重复值,还可以把实际值存在其他位置。教科书中的例子是国家。和多次存储相同的字符串「瑞士」或「波兰」比起来,为什么不简单地用 0 和 1 替换它们,并存储在字典中呢?
1
<span class="code-snippet_outer"><span class="code-snippet__attr">categorical_dict</span> = {<span class="code-snippet__number">0</span>: <span class="code-snippet__string"> Switzerland </span>, <span class="code-snippet__number">1</span>: <span class="code-snippet__string"> Poland </span>}</span>
Pandas 做了几乎相同的工作,同时添加了所有的方法,可以实际使用这种类型,并且仍然能够显示国家的名称。
回到 convert_df() 方法,如果这一列中的唯一值小于 50%,它会自动将列类型转换成 category。这个数是任意的,但是因为数据框中类型的转换意味着在 numpy 数组间移动数据,因此我们得到的必须比失去的多。
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> mem_usage(df)</span>
1
<span class="code-snippet_outer"><span class="code-snippet__number">10.28</span> MB</span>
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> mem_usage(df.set_index([<span class="code-snippet__string"> country </span>, <span class="code-snippet__string"> year </span>, <span class="code-snippet__string"> sex </span>, <span class="code-snippet__string"> age </span>]))</span>
1
<span class="code-snippet_outer"><span class="code-snippet__number">5.00</span> MB</span>
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> mem_usage(convert_df(df))</span>
1
<span class="code-snippet_outer"><span class="code-snippet__number">1.40</span> MB</span>
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> mem_usage(convert_df(df.set_index([<span class="code-snippet__string"> country </span>, <span class="code-snippet__string"> year </span>, <span class="code-snippet__string"> sex </span>, <span class="code-snippet__string"> age </span>])))</span>
1
<span class="code-snippet_outer"><span class="code-snippet__number">1.40</span> MB</span>
1
<span class="code-snippet_outer"><br></span>
通过使用「智能」转换器,数据框使用的内存几乎减少了 10 倍(准确地说是 7.34 倍)。
Pandas 是强大的,但也需要付出一些代价。当你加载 DataFrame 时,它会创建索引并将数据存储在 numpy 数组中。这是什么意思?一旦加载了数据框,只要正确管理索引,就可以快速地访问数据。
访问数据的方法主要有两种,分别是通过索引和查询访问。根据具体情况,你只能选择其中一种。但在大多数情况中,索引(和多索引)都是最好的选择。我们来看下面的例子:
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> <span class="code-snippet__string">%%time</span></span>
1
<span class="code-snippet_outer">>>> df.query( country == "Albania" and year == 1987 and sex == "male" and age == "25-34 years" )</span>
1
<span class="code-snippet_outer">CPU times: user 7.27 ms, sys: 751 µs, total: 8.02 ms</span>
1
<span class="code-snippet_outer"># ==================</span>
1
<span class="code-snippet_outer"><span class="code-snippet_outer">>>> %</span>%time</span>
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> mi_df.loc[<span class="code-snippet__string"> Albania </span>, <span class="code-snippet__number">1987</span>, <span class="code-snippet__string"> male </span>, <span class="code-snippet__string"> 25-34 years </span>]</span>
1
<span class="code-snippet_outer">CPU <span class="code-snippet__symbol">times:</span> user <span class="code-snippet__number">459</span> µs, <span class="code-snippet__symbol">sys:</span> <span class="code-snippet__number">1</span> µs, <span class="code-snippet__symbol">total:</span> <span class="code-snippet__number">460</span> µs</span>
1
<span class="code-snippet_outer" style="">%%time</span>
1
<span class="code-snippet_outer">mi_df = df.set_index([ country , year , sex , age ])</span>
1
<span class="code-snippet_outer">CPU times: user 10.8 ms, sys: 2.2 ms, total: 13 ms</span>
通过查询访问数据的时间是 1.5 倍。如果你只想检索一次数据(这种情况很少发生),查询是正确的方法。否则,你一定要坚持用索引,CPU 会为此感激你的。
.set_index(drop=False) 允许不删除用作新索引的列。
.loc[]/.iloc[] 方法可以很好地读取数据框,但无法修改数据框。如果需要手动构建(比如使用循环),那就要考虑其他的数据结构了(比如字典、列表等),在准备好所有数据后,创建 DataFrame。否则,对于 DataFrame 中的每一个新行,Pandas 都会更新索引,这可不是简单的哈希映射。
1
<span class="code-snippet_outer"><span class="code-snippet__meta">>></span>> (pd.DataFrame({<span class="code-snippet__string"> a :range</span>(<span class="code-snippet__number">2</span>), <span class="code-snippet__string"> b </span>: range(<span class="code-snippet__number">2</span>)}, index=[<span class="code-snippet__string"> a </span>, <span class="code-snippet__string"> a </span>]) .loc[<span class="code-snippet__string"> a </span>]) </span>
1
<span class="code-snippet_outer"> a b</span>
1
<span class="code-snippet_outer">a <span class="code-snippet__number">0</span> <span class="code-snippet__number">0</span></span>
1
<span class="code-snippet_outer">a <span class="code-snippet__number">1</span> <span class="code-snippet__number">1</span></span>
因此,未排序的索引可以降低性能。为了检查索引是否已经排序并对它排序,主要有两种方法:
1
<span class="code-snippet_outer">%%<span class="code-snippet__selector-tag">time</span></span>
1
<span class="code-snippet_outer">>>> <span class="code-snippet__selector-tag">mi_df.sort_index</span>()</span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">CPU</span> <span class="code-snippet__selector-tag">times</span>: <span class="code-snippet__selector-tag">user</span> 34<span class="code-snippet__selector-class">.8</span> <span class="code-snippet__selector-tag">ms</span>, <span class="code-snippet__selector-tag">sys</span>: 1<span class="code-snippet__selector-class">.63</span> <span class="code-snippet__selector-tag">ms</span>, <span class="code-snippet__selector-tag">total</span>: 36<span class="code-snippet__selector-class">.5</span> <span class="code-snippet__selector-tag">ms</span></span>
1
<span class="code-snippet_outer">>>> <span class="code-snippet__selector-tag">mi_df.index.is_monotonicTrue</span></span>
使用 DataFrame 的方法链是链接多个返回 DataFrame 方法的行为,因此它们都是来自 DataFrame 类的方法。在现在的 Pandas 版本中,使用方法链是为了不存储中间变量并避免出现如下情况:
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">import</span> numpy <span class="code-snippet__keyword">as</span> np</span>
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">import</span> pandas <span class="code-snippet__keyword">as</span> pd</span>
1
<span class="code-snippet_outer">df = pd.DataFrame({<span class="code-snippet__string"> a_column </span>: [<span class="code-snippet__number">1</span>, <span class="code-snippet__number">-999</span>, <span class="code-snippet__number">-999</span>], <span class="code-snippet__string"> powerless_column </span>: [<span class="code-snippet__number">2</span>, <span class="code-snippet__number">3</span>, <span class="code-snippet__number">4</span>], <span class="code-snippet__string"> int_column </span>: [<span class="code-snippet__number">1</span>, <span class="code-snippet__number">1</span>, <span class="code-snippet__number">-1</span>]}) </span>
1
<span class="code-snippet_outer">df[<span class="code-snippet__string"> a_column </span>] = df[<span class="code-snippet__string"> a_column </span>].replace(<span class="code-snippet__number">-999</span>, np.nan) </span>
1
<span class="code-snippet_outer">df[<span class="code-snippet__string"> power_column </span>] = df[<span class="code-snippet__string"> powerless_column </span>] ** <span class="code-snippet__number">2</span> </span>
1
<span class="code-snippet_outer">df[<span class="code-snippet__string"> real_column </span>] = df[<span class="code-snippet__string"> int_column </span>].astype(np.float64) </span>
1
<span class="code-snippet_outer">df = df.apply(<span class="code-snippet__keyword">lambda</span> _df: _df.replace(<span class="code-snippet__number">4</span>, np.nan)) </span>
1
<span class="code-snippet_outer">df = df.dropna(how=<span class="code-snippet__string"> all </span>)</span>
1
<span class="code-snippet_outer">df = (pd.DataFrame({<span class="code-snippet__string"> a_column </span>: [<span class="code-snippet__number">1</span>, <span class="code-snippet__number">-999</span>, <span class="code-snippet__number">-999</span>], </span>
1
<span class="code-snippet_outer"><span class="code-snippet__string"> powerless_column </span>: [<span class="code-snippet__number">2</span>, <span class="code-snippet__number">3</span>, <span class="code-snippet__number">4</span>], </span>
1
<span class="code-snippet_outer"><span class="code-snippet__string"> int_column </span>: [<span class="code-snippet__number">1</span>, <span class="code-snippet__number">1</span>, <span class="code-snippet__number">-1</span>]}) </span>
1
<span class="code-snippet_outer">.assign(a_column=<span class="code-snippet__keyword">lambda</span> _df: _df[<span class="code-snippet__string"> a_column </span>].replace(<span class="code-snippet__number">-999</span>, np.nan)) </span>
1
<span class="code-snippet_outer">.assign(power_column=<span class="code-snippet__keyword">lambda</span> _df: _df[<span class="code-snippet__string"> powerless_column </span>] ** <span class="code-snippet__number">2</span>) </span>
1
<span class="code-snippet_outer">.assign(real_column=<span class="code-snippet__keyword">lambda</span> _df: _df[<span class="code-snippet__string"> int_column </span>].astype(np.float64)) </span>
1
<span class="code-snippet_outer">.apply(<span class="code-snippet__keyword">lambda</span> _df: _df.replace(<span class="code-snippet__number">4</span>, np.nan)) </span>
1
<span class="code-snippet_outer">.dropna(how=<span class="code-snippet__string"> all </span>) )</span>
方法链的工具箱是由不同的方法(比如 apply、assign、loc、query、pipe、groupby 以及 agg)组成的,这些方法的输出都是 DataFrame 对象或 Series 对象(或 DataFrameGroupBy)。
1
<span class="code-snippet_outer">(df </span>
1
<span class="code-snippet_outer">.groupby(<span class="code-snippet__string"> age </span>) </span>
1
<span class="code-snippet_outer">.agg({<span class="code-snippet__string"> generation </span>:<span class="code-snippet__string"> unique </span>}) </span>
1
<span class="code-snippet_outer">.rename(columns={<span class="code-snippet__string"> generation </span>:<span class="code-snippet__string"> unique_generation </span>})</span>
1
<span class="code-snippet_outer"># Recommended <span class="code-snippet__keyword">from</span> v0<span class="code-snippet__number">.25</span></span>
1
<span class="code-snippet_outer"># .agg(unique_generation=(<span class="code-snippet__string"> generation </span>, <span class="code-snippet__string"> unique </span>)))</span>
除了了解到「X 代」覆盖了三个年龄组外,分解这条链。第一步是对年龄组分组。这一方法返回了一个 DataFrameGroupBy 对象,在这个对象中,通过选择组的唯一年代标签聚合了每一组。
在这种情况下,聚合方法是「unique」方法,但它也可以接受任何(匿名)函数。
在 0.25 版本中,Pandas 引入了使用 agg 的新方法:https://dev.pandas.io/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling。
1
<span class="code-snippet_outer">(df </span>
1
<span class="code-snippet_outer">.groupby([<span class="code-snippet__string"> country </span>, <span class="code-snippet__string"> year </span>]) </span>
1
<span class="code-snippet_outer">.agg({<span class="code-snippet__string"> suicides_per_100k </span>: <span class="code-snippet__string"> sum </span>}) </span>
1
<span class="code-snippet_outer">.rename(columns={<span class="code-snippet__string"> suicides_per_100k </span>:<span class="code-snippet__string"> suicides_sum </span>})</span>
1
<span class="code-snippet_outer"># Recommended <span class="code-snippet__keyword">from</span> v0<span class="code-snippet__number">.25</span></span>
1
<span class="code-snippet_outer"># .agg(suicides_sum=(<span class="code-snippet__string"> suicides_per_100k </span>, <span class="code-snippet__string"> sum </span>)) .sort_values(<span class="code-snippet__string"> suicides_sum </span>, ascending=False) .head(<span class="code-snippet__number">10</span>))</span>
用排序值(sort_values)和 head 得到自杀率排前十的国家和年份
1
<span class="code-snippet_outer">(df </span>
1
<span class="code-snippet_outer">.groupby([<span class="code-snippet__string"> country </span>, <span class="code-snippet__string"> year </span>]) </span>
1
<span class="code-snippet_outer">.agg({<span class="code-snippet__string"> suicides_per_100k </span>: <span class="code-snippet__string"> sum </span>}) </span>
1
<span class="code-snippet_outer">.rename(columns={<span class="code-snippet__string"> suicides_per_100k </span>:<span class="code-snippet__string"> suicides_sum </span>})</span>
1
<span class="code-snippet_outer"># Recommended <span class="code-snippet__keyword">from</span> v0<span class="code-snippet__number">.25</span></span>
1
<span class="code-snippet_outer"># .agg(suicides_sum=(<span class="code-snippet__string"> suicides_per_100k </span>, <span class="code-snippet__string"> sum </span>)) </span>
1
<span class="code-snippet_outer">.nlargest(<span class="code-snippet__number">10</span>, columns=<span class="code-snippet__string"> suicides_sum </span>))</span>
用排序值 nlargest 得到自杀率排前十的国家和年份
在这些例子中,输出都是一样的:有两个指标(国家和年份)的 MultiIndex 的 DataFrame,还有包含排序后的 10 个最大值的新列 suicides_sum。
nlargest(10) 比 sort_values(ascending=False).head(10) 更有效。
另一个有趣的方法是 unstack:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html,这种方法允许转动索引水平。
1
<span class="code-snippet_outer">(mi_df </span>
1
<span class="code-snippet_outer">.loc[(<span class="code-snippet__string"> Switzerland </span>, <span class="code-snippet__number">2000</span>)] </span>
1
<span class="code-snippet_outer">.unstack(<span class="code-snippet__string"> sex </span>) [[<span class="code-snippet__string"> suicides_no </span>, <span class="code-snippet__string"> population </span>]])</span>
「age」是索引,列「suicides_no」和「population」都有第二个水平列「sex」。
下一个方法 pipe 是最通用的方法之一。这种方法允许管道运算(就像在 shell 脚本中)执行比链更多的运算。
1
<span class="code-snippet_outer"><span class="code-snippet__function"><span class="code-snippet__keyword">def</span> <span class="code-snippet__title">log_head(df, head_count=<span class="code-snippet__number">10</span>)</span>:</span> </span>
1
<span class="code-snippet_outer"> print(df.head(head_count)) </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">return</span> df</span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer"><span class="code-snippet__function"><span class="code-snippet__keyword">def</span> <span class="code-snippet__title">log_columns(df)</span>:</span> </span>
1
<span class="code-snippet_outer"> print(df.columns) </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">return</span> df</span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer"><span class="code-snippet__function"><span class="code-snippet__keyword">def</span> <span class="code-snippet__title">log_shape(df)</span>:</span> </span>
1
<span class="code-snippet_outer"> print(<span class="code-snippet__string">f shape = <span class="code-snippet__subst">{df.shape}</span> </span>) </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">return</span> df</span>
举个例子,我们想验证和 year 列相比,country_year 是否正确:
1
<span class="code-snippet_outer">(df </span>
1
<span class="code-snippet_outer">.assign(valid_cy=<span class="code-snippet__keyword">lambda</span> _serie: _serie.apply( </span>
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">lambda</span> _row: re.split(<span class="code-snippet__string">r (?=d{4}) </span>, </span>
1
<span class="code-snippet_outer">_row[<span class="code-snippet__string"> country_year </span>])[<span class="code-snippet__number">1</span>] == str(_row[<span class="code-snippet__string"> year </span>]), axis=<span class="code-snippet__number">1</span>)) </span>
1
<span class="code-snippet_outer">.query(<span class="code-snippet__string"> valid_cy == False </span>) </span>
1
<span class="code-snippet_outer">.pipe(log_shape))</span>
用来验证「country_year」列中年份的管道。
管道的输出是 DataFrame,但它也可以在标准输出(console/REPL)中打印。
1
<span class="code-snippet_outer"><span class="code-snippet__attr">shape</span> = (<span class="code-snippet__number">0</span>, <span class="code-snippet__number">13</span>)</span>
1
<span class="code-snippet_outer">(df .pipe(log_shape) </span>
1
<span class="code-snippet_outer">.query(<span class="code-snippet__string"> sex == "female" </span>) </span>
1
<span class="code-snippet_outer">.groupby([<span class="code-snippet__string"> year </span>, <span class="code-snippet__string"> country </span>]) </span>
1
<span class="code-snippet_outer">.agg({<span class="code-snippet__string"> suicides_per_100k </span>:<span class="code-snippet__string"> sum </span>}) </span>
1
<span class="code-snippet_outer">.pipe(log_shape) </span>
1
<span class="code-snippet_outer">.rename(columns={<span class="code-snippet__string"> suicides_per_100k </span>:<span class="code-snippet__string"> sum_suicides_per_100k_female </span>})</span>
1
<span class="code-snippet_outer"># Recommended <span class="code-snippet__keyword">from</span> v0<span class="code-snippet__number">.25</span></span>
1
<span class="code-snippet_outer"># .agg(sum_suicides_per_100k_female=(<span class="code-snippet__string"> suicides_per_100k </span>, <span class="code-snippet__string"> sum </span>)) </span>
1
<span class="code-snippet_outer">.nlargest(n=<span class="code-snippet__number">10</span>, columns=[<span class="code-snippet__string"> sum_suicides_per_100k_female </span>]))</span>
1
<span class="code-snippet_outer"><span class="code-snippet__attr">shape</span> = (<span class="code-snippet__number">27820</span>, <span class="code-snippet__number">12</span>)</span>
1
<span class="code-snippet_outer"><span class="code-snippet__attr">shape</span> = (<span class="code-snippet__number">2321</span>, <span class="code-snippet__number">1</span>)</span>
除了记录到控制台外,pipe 还可以直接在数据框的列上应用函数。
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">from</span> sklearn.preprocessing <span class="code-snippet__keyword">import</span> MinMaxScaler</span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer">def norm_df(df, columns): </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">return</span> df.assign(**{col: MinMaxScaler().fit_transform(df[[col]].values.astype(float)) </span>
1
<span class="code-snippet_outer"> <span class="code-snippet__keyword">for</span> col <span class="code-snippet__keyword">in</span> columns}) </span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer"><span class="code-snippet__keyword">for</span> sex <span class="code-snippet__keyword">in</span> [<span class="code-snippet__string"> male </span>, <span class="code-snippet__string"> female </span>]: </span>
1
<span class="code-snippet_outer"> print(sex) </span>
1
<span class="code-snippet_outer"> print( df .query(f<span class="code-snippet__string"> sex == "{sex}" </span>) </span>
1
<span class="code-snippet_outer"> .groupby([<span class="code-snippet__string"> country </span>]) </span>
1
<span class="code-snippet_outer"> .agg({<span class="code-snippet__string"> suicides_per_100k </span>: <span class="code-snippet__string"> sum </span>, <span class="code-snippet__string"> gdp_year </span>: <span class="code-snippet__string"> mean </span>}) </span>
1
<span class="code-snippet_outer"> .rename(columns={<span class="code-snippet__string"> suicides_per_100k </span>:<span class="code-snippet__string"> suicides_per_100k_sum </span>, <span class="code-snippet__string"> gdp_year </span>: <span class="code-snippet__string"> gdp_year_mean </span>})</span>
1
<span class="code-snippet_outer"> # Recommended <span class="code-snippet__keyword">in</span> v0<span class="code-snippet__number">.25</span></span>
1
<span class="code-snippet_outer"> # .agg(suicides_per_100k=(<span class="code-snippet__string"> suicides_per_100k_sum </span>, <span class="code-snippet__string"> sum </span>), </span>
1
<span class="code-snippet_outer"> # gdp_year=(<span class="code-snippet__string"> gdp_year_mean </span>, <span class="code-snippet__string"> mean </span>)) </span>
1
<span class="code-snippet_outer"> .pipe(norm_df, columns=[<span class="code-snippet__string"> suicides_per_100k_sum </span>, <span class="code-snippet__string"> gdp_year_mean </span>]) </span>
1
<span class="code-snippet_outer"> .corr(method=<span class="code-snippet__string"> spearman </span>) ) </span>
1
<span class="code-snippet_outer"> print(<span class="code-snippet__string"> </span>)</span>
自杀数量是否和 GDP 的下降相关? 是否和性别相关?
1
<span class="code-snippet_outer" style="">male</span>
1
<span class="code-snippet_outer"> <span class="code-snippet__selector-tag">suicides_per_100k_sum</span> <span class="code-snippet__selector-tag">gdp_year_mean</span></span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">suicides_per_100k_sum</span> 1<span class="code-snippet__selector-class">.000000</span> 0<span class="code-snippet__selector-class">.421218</span></span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">gdp_year_mean</span> 0<span class="code-snippet__selector-class">.421218</span> 1<span class="code-snippet__selector-class">.000000</span></span>
1
<span class="code-snippet_outer"><br></span>
1
<span class="code-snippet_outer" style="">female</span>
1
<span class="code-snippet_outer"> <span class="code-snippet__selector-tag">suicides_per_100k_sum</span> <span class="code-snippet__selector-tag">gdp_year_mean</span></span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">suicides_per_100k_sum</span> 1<span class="code-snippet__selector-class">.000000</span> 0<span class="code-snippet__selector-class">.452343</span></span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">gdp_year_mean</span> 0<span class="code-snippet__selector-class">.452343</span> 1<span class="code-snippet__selector-class">.000000</span></span>
1
<span class="code-snippet_outer"><br></span>
深入研究代码。norm_df() 将一个 DataFrame 和用 MinMaxScaling 扩展列的列表当做输入。使用字典理解,创建一个字典 {column_name: method, …},然后将其解压为 assign() 函数的参数 (colunmn_name=method, …)。
在这种特殊情况下,min-max 缩放不会改变对应的输出:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html,它仅用于参数。
在(遥远的?)未来,缓式评估(lazy evaluation)可能出现在方法链中,所以在链上做一些投资可能是一个好想法。
itertuples() 可以更高效地遍历数据框的行;
1
<span class="code-snippet_outer">>>> %%<span class="code-snippet__selector-tag">time</span></span>
1
<span class="code-snippet_outer">>>> <span class="code-snippet__selector-tag">for</span> <span class="code-snippet__selector-tag">row</span> <span class="code-snippet__selector-tag">in</span> <span class="code-snippet__selector-tag">df.iterrows</span>(): <span class="code-snippet__selector-tag">continue</span></span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">CPU</span> <span class="code-snippet__selector-tag">times</span>: <span class="code-snippet__selector-tag">user</span> 1<span class="code-snippet__selector-class">.97</span> <span class="code-snippet__selector-tag">s</span>, <span class="code-snippet__selector-tag">sys</span>: 17<span class="code-snippet__selector-class">.3</span> <span class="code-snippet__selector-tag">ms</span>, <span class="code-snippet__selector-tag">total</span>: 1<span class="code-snippet__selector-class">.99</span> <span class="code-snippet__selector-tag">s</span></span>
1
<span class="code-snippet_outer">>>> <span class="code-snippet__selector-tag">for</span> <span class="code-snippet__selector-tag">tup</span> <span class="code-snippet__selector-tag">in</span> <span class="code-snippet__selector-tag">df.itertuples</span>(): <span class="code-snippet__selector-tag">continue</span></span>
1
<span class="code-snippet_outer"><span class="code-snippet__selector-tag">CPU</span> <span class="code-snippet__selector-tag">times</span>: <span class="code-snippet__selector-tag">user</span> 55<span class="code-snippet__selector-class">.9</span> <span class="code-snippet__selector-tag">ms</span>, <span class="code-snippet__selector-tag">sys</span>: 2<span class="code-snippet__selector-class">.85</span> <span class="code-snippet__selector-tag">ms</span>, <span class="code-snippet__selector-tag">total</span>: 58<span class="code-snippet__selector-class">.8</span> <span class="code-snippet__selector-tag">ms</span></span>
在 Jupyter 笔记本中,在代码块的开头写上 %%time,可以有效地测量时间;
UInt8 类:https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na 支持带有整数的 NaN 值;
记住,任何密集的 I/O(例如展开大型 CSV 存储)用低级方法都会执行得更好(尽可能多地用 Python 的核心函数)。
还有一些本文没有涉及到的有用的方法和数据结构,这些方法和数据结构都很值得花时间去理解:
数据透视表:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html?source=post_page—————————
时间序列/日期功能:https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html?source=post_page—————————;
绘图:https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html?source=post_page—————————。
希望你可以因为这篇简短的文章,更好地理解 Pandas 背后的工作原理,以及 Pandas 库的发展现状。本文还展示了不同的用于优化数据框内存以及快速分析数据的工具。希望对现在的你来说,索引和查找的概念能更加清晰。最后,你还可以试着用方法链写更长的链。
这里还有一些笔记:https://github.com/unit8co/medium-pandas-wan?source=post_page—————————
除了文中的所有代码外,还包括简单数据索引数据框(df)和多索引数据框(mi_df)性能的定时指标。
熟能生巧,所以继续修炼技能,并帮助我们建立一个更好的世界吧。
原文链接:https://medium.com/unit8-machine-learning-publication/from-pandas-wan-to-pandas-master-4860cf0ce442
Rudolf Höhn
扫一扫下载订阅号助手,用手机发文章
赞赏
长按二维码向我转账
受苹果公司新规定影响,微信 iOS 版的赞赏功能被关闭,可通过二维码转账支持公众号。
微信扫一扫
使用小程序
朋友会在“发现-看一看”看到你“在看”的内容