学习资料
学习Pandas的资料,最权威的应该就是Pandas作者本人开源的《Python for Data Analysis, 2nd Edition》
这本书有经过作者本人认可的中文译本,多亏有大牛们种树。对于初学者,按照这本书从Python基础一步一步学习,确实是挺好的学习体验。
读过这两本书之后,再去看pandas的官方在先文档,就会有似曾相识的感觉。pandas官方文档
Getting Started with pandas
首先第一步,当然是导入pandas模块。顺便也把numpy一起导入了吧。
1 | import pandas as pd |
Introduction to pandas Data Structures
Pansas 的数据结构其实就是 Series(一维)和Dataframe(二维)
Series
1 | # 生成一个10个元素的正态分布序列,Index是0-9 |
0 -2.030423
1 -0.142240
2 0.467358
3 -1.106364
4 1.700141
5 -1.656620
6 -0.302091
7 0.205873
8 -0.769502
9 -0.022895
dtype: float64
DataFrame
把dataframe当成Excel或者数据库二维表格是最直观的理解方法。就是现实中的二维数据的抽象。
1 | # dict 转换成dataframe |
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
1 | # 生成 3*5 的正态分布dataframe |
A | B | C | D | E | |
---|---|---|---|---|---|
2019 | 0.625769 | -0.359062 | -1.444692 | -0.908577 | 0.968998 |
2020 | -0.472260 | -0.275508 | -1.666659 | -1.202708 | -1.542985 |
2021 | -0.866006 | -0.203043 | 0.436053 | -0.703099 | 1.131525 |
1 | print(df.dtypes) |
A float64
B float64
C float64
D float64
E float64
dtype: object
A B C D E
count 3.000000 3.000000 3.000000 3.000000 3.000000
mean -0.237499 -0.279205 -0.891766 -0.938128 0.185846
std 0.773099 0.078075 1.155268 0.251112 1.499415
min -0.866006 -0.359062 -1.666659 -1.202708 -1.542985
25% -0.669133 -0.317285 -1.555675 -1.055642 -0.286993
50% -0.472260 -0.275508 -1.444692 -0.908577 0.968998
75% 0.076755 -0.239276 -0.504320 -0.805838 1.050262
max 0.625769 -0.203043 0.436053 -0.703099 1.131525
>注意:columns返回的是一个view(slice切片),而不是新建了一个copy。因此,任何对series的改变,会反映在DataFrame上。除非我们用copy方法来新建一个。
1 | # 用切片修改列,没问题 |
A | B | C | D | E | |
---|---|---|---|---|---|
2019 | 0.625769 | 0.0 | 0.0 | -0.908577 | 0.968998 |
2020 | -0.472260 | 0.0 | 0.0 | -1.202708 | -1.542985 |
2021 | -0.866006 | 0.0 | 0.0 | -0.703099 | 1.131525 |
1 | # 用切片修改行,也没问题 |
A | B | C | D | E | |
---|---|---|---|---|---|
2019 | 0.625769 | 0.0 | 0.0 | -0.908577 | 0.968998 |
2020 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
2021 | -0.866006 | 0.0 | 0.0 | -0.703099 | 1.131525 |
1 | # 如果用切片来修改指定行列的dataframe,pandas会警告,让我们用loc来替换 |
/home/yutianc/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A | B | C | D | E | |
---|---|---|---|---|---|
2019 | 0.625769 | 0.0 | 0.0 | -0.908577 | 0.968998 |
2020 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
2021 | -0.866006 | 0.0 | 0.0 | -0.703099 | 1.131525 |
1 | # 应该是pandas加强了安全校验,推荐使用loc而不是slice切边赋值。 |
A | B | C | D | E | |
---|---|---|---|---|---|
2019 | 0.625769 | 0.0 | 0.0 | -0.908577 | 0.968998 |
2020 | 0.000000 | 2.0 | 2.0 | 0.000000 | 0.000000 |
2021 | -0.866006 | 0.0 | 0.0 | -0.703099 | 1.131525 |
Index Objects(索引对象)
Series 和Dataframe都包含Index对象,Series可以看作有索引的数据,而dataframe则可以看作一组公用索引的Series。
一个数组或其他一个序列标签,只要被用来做构建series或DataFrame,就会被自动转变为index:
1 | obj = pd.Series(range(3), index=['a', 'b', 'c']) |
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
1 | labels = pd.Index(np.arange(3)) |
0 1.5
1 -2.5
2 0.0
dtype: float64
True
Index对象,不可改变。
正因为不可修改,所以data structure中分享index object是很安全的:
1 | index[1] = 'B' |
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-93-717d70dc796b> in <module>
----> 1 index[1] = 'B'
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
2063
2064 def __setitem__(self, key, value):
-> 2065 raise TypeError("Index does not support mutable operations")
2066
2067 def __getitem__(self, key):
TypeError: Index does not support mutable operations
index除了想数组,还能像大小一定的set.
与python里的set不同,pandas的index可以有重复的labels:
Essential Functionality(基本方法)
Reindexing
1 | # For Series |
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
1 | obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) |
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
1 | # For Dataframe |
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Ohio | Texas | California | |
---|---|---|---|
a | 0.0 | 1.0 | 2.0 |
b | NaN | NaN | NaN |
c | 3.0 | 4.0 | 5.0 |
d | 6.0 | 7.0 | 8.0 |
1 | pd.options.display.max_rows |
60
Possible data inputs to DataFrame constructor
| Type | Notes |
|————————————|——————————————|
| 2D ndarray | A matrix of data, passing optional row and column labels |
| dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length |
| NumPy structured/record | Treated as the “dict of arrays” case
| array |
| dict of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| dict of dicts | Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case |
| List of dicts or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples | Treated as the “2D ndarray” case |
| Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
| NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result |