01 Get Start with Pandas

NBviewer传送门

学习资料

学习Pandas的资料,最权威的应该就是Pandas作者本人开源的《Python for Data Analysis, 2nd Edition》

这本书有经过作者本人认可的中文译本,多亏有大牛们种树。对于初学者,按照这本书从Python基础一步一步学习,确实是挺好的学习体验。

读过这两本书之后,再去看pandas的官方在先文档,就会有似曾相识的感觉。pandas官方文档

Getting Started with pandas

首先第一步,当然是导入pandas模块。顺便也把numpy一起导入了吧。

1
2
import pandas as pd
import numpy as np

Introduction to pandas Data Structures

Pansas 的数据结构其实就是 Series(一维)和Dataframe(二维)

Series

1
2
3
# 生成一个10个元素的正态分布序列,Index是0-9
obj = pd.Series(np.random.randn(10),index=np.arange(10))
print(obj)
0   -2.030423
1   -0.142240
2    0.467358
3   -1.106364
4    1.700141
5   -1.656620
6   -0.302091
7    0.205873
8   -0.769502
9   -0.022895
dtype: float64

DataFrame

把dataframe当成Excel或者数据库二维表格是最直观的理解方法。就是现实中的二维数据的抽象。

1
2
3
4
5
6
7
8
# dict 转换成dataframe

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame















































state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
5 Nevada 2003 3.2
1
2
3
# 生成 3*5 的正态分布dataframe
df = pd.DataFrame(np.random.randn(3,5), index=['2019','2020','2021'], columns=['A','B','C','D','E'])
df





































A B C D E
2019 0.625769 -0.359062 -1.444692 -0.908577 0.968998
2020 -0.472260 -0.275508 -1.666659 -1.202708 -1.542985
2021 -0.866006 -0.203043 0.436053 -0.703099 1.131525

1
2
print(df.dtypes)
print(df.describe())


A float64
B float64
C float64
D float64
E float64
dtype: object
A B C D E
count 3.000000 3.000000 3.000000 3.000000 3.000000
mean -0.237499 -0.279205 -0.891766 -0.938128 0.185846
std 0.773099 0.078075 1.155268 0.251112 1.499415
min -0.866006 -0.359062 -1.666659 -1.202708 -1.542985
25% -0.669133 -0.317285 -1.555675 -1.055642 -0.286993
50% -0.472260 -0.275508 -1.444692 -0.908577 0.968998
75% 0.076755 -0.239276 -0.504320 -0.805838 1.050262
max 0.625769 -0.203043 0.436053 -0.703099 1.131525

>注意:columns返回的是一个view(slice切片),而不是新建了一个copy。因此,任何对series的改变,会反映在DataFrame上。除非我们用copy方法来新建一个。


1
2
3
# 用切片修改列,没问题
df[['B','C']] = 0
df









































A B C D E
2019 0.625769 0.0 0.0 -0.908577 0.968998
2020 -0.472260 0.0 0.0 -1.202708 -1.542985
2021 -0.866006 0.0 0.0 -0.703099 1.131525


1
2
3
# 用切片修改行,也没问题
df[1:2]=0
df









































A B C D E
2019 0.625769 0.0 0.0 -0.908577 0.968998
2020 0.000000 0.0 0.0 0.000000 0.000000
2021 -0.866006 0.0 0.0 -0.703099 1.131525



1
2
3
# 如果用切片来修改指定行列的dataframe,pandas会警告,让我们用loc来替换
df[['B','C']][1:2]=1
df


/home/yutianc/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






































A B C D E
2019 0.625769 0.0 0.0 -0.908577 0.968998
2020 0.000000 0.0 0.0 0.000000 0.000000
2021 -0.866006 0.0 0.0 -0.703099 1.131525
1
2
3
4
5
# 应该是pandas加强了安全校验,推荐使用loc而不是slice切边赋值。
# 说好的loc来了, 警告没了。

df.loc['2020',['B','C']] = 2
df





































A B C D E
2019 0.625769 0.0 0.0 -0.908577 0.968998
2020 0.000000 2.0 2.0 0.000000 0.000000
2021 -0.866006 0.0 0.0 -0.703099 1.131525

Index Objects(索引对象)

Series 和Dataframe都包含Index对象,Series可以看作有索引的数据,而dataframe则可以看作一组公用索引的Series。

一个数组或其他一个序列标签,只要被用来做构建series或DataFrame,就会被自动转变为index:

1
2
3
4
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
index[1:]
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')
1
2
3
4
5
labels = pd.Index(np.arange(3))
labels
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)
obj2.index is labels
0    1.5
1   -2.5
2    0.0
dtype: float64

True

Index对象,不可改变。

正因为不可修改,所以data structure中分享index object是很安全的:

1
index[1] = 'B'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-93-717d70dc796b> in <module>
----> 1 index[1] = 'B'


~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2063 
   2064     def __setitem__(self, key, value):
-> 2065         raise TypeError("Index does not support mutable operations")
   2066 
   2067     def __getitem__(self, key):


TypeError: Index does not support mutable operations

index除了想数组,还能像大小一定的set.
与python里的set不同,pandas的index可以有重复的labels:

Essential Functionality(基本方法)

Reindexing

1
2
3
# For Series
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
1
2
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
# For Dataframe
# Conform DataFrame to new index with optional filling logic, placing
# NA/NaN in locations having no value in the previous index. A new object
# is produced unless the new index is equivalent to the current one and
# copy=False

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
print(frame)
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8



































Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
1
pd.options.display.max_rows
60

Possible data inputs to DataFrame constructor
| Type | Notes |
|————————————|——————————————|
| 2D ndarray | A matrix of data, passing optional row and column labels |
| dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length |
| NumPy structured/record | Treated as the “dict of arrays” case
| array |
| dict of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| dict of dicts | Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case |
| List of dicts or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples | Treated as the “2D ndarray” case |
| Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
| NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result |