01 Get Start with Pandas

学习资料

学习Pandas的资料，最权威的应该就是Pandas作者本人开源的《Python for Data Analysis, 2nd Edition》

这本书有经过作者本人认可的中文译本，多亏有大牛们种树。对于初学者，按照这本书从Python基础一步一步学习，确实是挺好的学习体验。

读过这两本书之后，再去看pandas的官方在先文档，就会有似曾相识的感觉。pandas官方文档

Getting Started with pandas

首先第一步，当然是导入pandas模块。顺便也把numpy一起导入了吧。

1 2	import pandas as pd import numpy as np

Introduction to pandas Data Structures

Pansas 的数据结构其实就是 Series（一维）和Dataframe（二维）

Series

1
2
3

# 生成一个10个元素的正态分布序列，Index是0-9
obj = pd.Series(np.random.randn(10),index=np.arange(10))
print(obj)

0   -2.030423
1   -0.142240
2    0.467358
3   -1.106364
4    1.700141
5   -1.656620
6   -0.302091
7    0.205873
8   -0.769502
9   -0.022895
dtype: float64

DataFrame

把dataframe当成Excel或者数据库二维表格是最直观的理解方法。就是现实中的二维数据的抽象。

# dict 转换成dataframe

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002, 2003], 
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame

	state	year	pop
0	Ohio	2000	1.5
1	Ohio	2001	1.7
2	Ohio	2002	3.6
3	Nevada	2001	2.4
4	Nevada	2002	2.9
5	Nevada	2003	3.2

1
2
3

# 生成 3*5 的正态分布dataframe
df = pd.DataFrame(np.random.randn(3,5), index=['2019','2020','2021'], columns=['A','B','C','D','E'])
df

	A	B	C	D	E
2019	0.625769	-0.359062	-1.444692	-0.908577	0.968998
2020	-0.472260	-0.275508	-1.666659	-1.202708	-1.542985
2021	-0.866006	-0.203043	0.436053	-0.703099	1.131525

1 2	print(df.dtypes) print(df.describe())

A float64
B float64
C float64
D float64
E float64
dtype: object
A B C D E
count 3.000000 3.000000 3.000000 3.000000 3.000000
mean -0.237499 -0.279205 -0.891766 -0.938128 0.185846
std 0.773099 0.078075 1.155268 0.251112 1.499415
min -0.866006 -0.359062 -1.666659 -1.202708 -1.542985
25% -0.669133 -0.317285 -1.555675 -1.055642 -0.286993
50% -0.472260 -0.275508 -1.444692 -0.908577 0.968998
75% 0.076755 -0.239276 -0.504320 -0.805838 1.050262
max 0.625769 -0.203043 0.436053 -0.703099 1.131525

>注意：columns返回的是一个view(slice切片)，而不是新建了一个copy。因此，任何对series的改变，会反映在DataFrame上。除非我们用copy方法来新建一个。

1
2
3

# 用切片修改列，没问题
df[['B','C']] = 0
df

	A	D	E
2019	0.625769	-0.908577	0.968998
2020	-0.472260	-1.202708	-1.542985
2021	-0.866006	-0.703099	1.131525

1
2
3

# 用切片修改行，也没问题
df[1:2]=0
df

	A	D	E
2019	0.625769	-0.908577	0.968998
2020	0.000000	0.000000	0.000000
2021	-0.866006	-0.703099	1.131525

1
2
3

# 如果用切片来修改指定行列的dataframe，pandas会警告，让我们用loc来替换
df[['B','C']][1:2]=1
df

/home/yutianc/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

	A	D	E
2019	0.625769	-0.908577	0.968998
2020	0.000000	0.000000	0.000000
2021	-0.866006	-0.703099	1.131525

# 应该是pandas加强了安全校验，推荐使用loc而不是slice切边赋值。
# 说好的loc来了, 警告没了。 

df.loc['2020',['B','C']] = 2
df

	A	B	C	D	E
2019	0.625769	0.0	0.0	-0.908577	0.968998
2020	0.000000	2.0	2.0	0.000000	0.000000
2021	-0.866006	0.0	0.0	-0.703099	1.131525

Index Objects(索引对象）

Series 和Dataframe都包含Index对象，Series可以看作有索引的数据，而dataframe则可以看作一组公用索引的Series。

一个数组或其他一个序列标签，只要被用来做构建series或DataFrame，就会被自动转变为index：

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
index[1:]

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')

labels = pd.Index(np.arange(3))
labels
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)
obj2.index is labels

0    1.5
1   -2.5
2    0.0
dtype: float64

True

Index对象，不可改变。

正因为不可修改，所以data structure中分享index object是很安全的：

1	index[1] = 'B'

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-93-717d70dc796b> in <module>
----> 1 index[1] = 'B'


~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2063 
   2064     def __setitem__(self, key, value):
-> 2065         raise TypeError("Index does not support mutable operations")
   2066 
   2067     def __getitem__(self, key):


TypeError: Index does not support mutable operations

index除了想数组，还能像大小一定的set.
与python里的set不同，pandas的index可以有重复的labels：

Essential Functionality(基本方法)

Reindexing

1
2
3

# For Series
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

1 2	obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

# For Dataframe
# Conform DataFrame to new index with optional filling logic, placing
# NA/NaN in locations having no value in the previous index. A new object
# is produced unless the new index is equivalent to the current one and
# copy=False

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
print(frame)
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

	Ohio	Texas	California
a	0.0	1.0	2.0
b	NaN	NaN	NaN
c	3.0	4.0	5.0
d	6.0	7.0	8.0

1	pd.options.display.max_rows