十分钟快速了解Pandas的常用操作!
原文 | https://pandas.pydata.org/pandas-docs/version/0.18.0/
编译|刘早起(有删改)
目录
- 创建数据
- 数据查看
- 数据选取
- 使用[]选取数据
- 通过标签选取数据
- 通过位置选取数据
- 使用布尔索引
- 修改数据
- 缺失值处理
- reindex
- 删除缺失值
- 填充缺失值
- 常用操作
- 统计
- Apply函数
- value_counts()
- 字符串方法
- 数据合并
- Concat
- Join
- Append
- 数据分组
- 数据重塑
- 数据堆叠
- 数据透视表
- 时间序列
- 灵活的使用分类数据
- 数据可视化
- 导入导出数据
- 获得帮助
首先导入Python数据处理中常用的三个库
如果没有可以分别执行下方代码框安装
#安装pandas
!pip install pandas
#安装numpy
!pip install numpy
#安装matplotlib
!pip install matoplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
创建数据
使用pd.Series创建Series对象
s = pd.Series([1,3,5,np.nan,6,8])
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
通过numpy的array数据来创建DataFrame对象
dates = pd.date_range('20130101', periods=6)
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-01 |
-0.469364 |
-1.389291 |
0.844032 |
0.042866 |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
-0.740562 |
通过字典创建DataFrame对象
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
df2
A |
B |
C |
D |
E |
F |
|
---|---|---|---|---|---|---|
0 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
1 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
2 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
3 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
dir(df2)
['A',
'B',
'C',
'D',
'E',
'F',
'T',
'_AXIS_ALIASES',
'_AXIS_IALIASES',
'_AXIS_LEN',
'_AXIS_NAMES',
'_AXIS_NUMBERS',
'_AXIS_ORDERS',
'_AXIS_REVERSED',
······
'unstack',
'update',
'values',
'var',
'where',
'xs']
数据查看
基本方法,务必掌握,更多相关查看数据的方法可以参与官方文档[1]
下面分别是查看数据的顶部和尾部的方法
df.head()
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-01 |
-0.469364 |
-1.389291 |
0.844032 |
0.042866 |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
df.tail(3)
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
-0.740562 |
查看DataFrame对象的索引,列名,数据信息
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[-0.46936354, -1.38929068, 0.84403157, 0.04286594],
[ 0.98657633, -0.68954348, -0.38326456, -1.10493201],
[-0.19242554, 1.74076522, 0.73047859, -1.32078058],
[ 0.04734752, -1.95230265, -0.6915437 , -1.40388308],
[ 0.23302102, 0.61911183, 0.628579 , -0.80258543],
[ 0.49394583, 0.84824737, 1.633055 , -0.74056229]])
描述性统计
df.describe()
A |
B |
C |
D |
|
---|---|---|---|---|
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
0.183184 |
-0.137169 |
0.460223 |
-0.888313 |
std |
0.515722 |
1.430893 |
0.855835 |
0.528401 |
min |
-0.469364 |
-1.952303 |
-0.691544 |
-1.403883 |
25% |
-0.132482 |
-1.214354 |
-0.130304 |
-1.266818 |
50% |
0.140184 |
-0.035216 |
0.679529 |
-0.953759 |
75% |
0.428715 |
0.790963 |
0.815643 |
-0.756068 |
max |
0.986576 |
1.740765 |
1.633055 |
0.042866 |
数据转置
df.T
2013-01-01 00:00:00 |
2013-01-02 00:00:00 |
2013-01-03 00:00:00 |
2013-01-04 00:00:00 |
2013-01-05 00:00:00 |
2013-01-06 00:00:00 |
|
---|---|---|---|---|---|---|
A |
-0.469364 |
0.986576 |
-0.192426 |
0.047348 |
0.233021 |
0.493946 |
B |
-1.389291 |
-0.689543 |
1.740765 |
-1.952303 |
0.619112 |
0.848247 |
C |
0.844032 |
-0.383265 |
0.730479 |
-0.691544 |
0.628579 |
1.633055 |
D |
0.042866 |
-1.104932 |
-1.320781 |
-1.403883 |
-0.802585 |
-0.740562 |
根据列名排序
df.sort_index(axis=1, ascending=False)
D |
C |
B |
A |
|
---|---|---|---|---|
2013-01-01 |
0.042866 |
0.844032 |
-1.389291 |
-0.469364 |
2013-01-02 |
-1.104932 |
-0.383265 |
-0.689543 |
0.986576 |
2013-01-03 |
-1.320781 |
0.730479 |
1.740765 |
-0.192426 |
2013-01-04 |
-1.403883 |
-0.691544 |
-1.952303 |
0.047348 |
2013-01-05 |
-0.802585 |
0.628579 |
0.619112 |
0.233021 |
2013-01-06 |
-0.740562 |
1.633055 |
0.848247 |
0.493946 |
根据B列数值排序
df.sort_values(by='B')
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
2013-01-01 |
-0.469364 |
-1.389291 |
0.844032 |
0.042866 |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
-0.740562 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
数据选取
官方建议使用优化的熊猫数据访问方法.at,.iat,.loc
和.iloc
,部分较早的pandas版本可以使用.ix
这些选取函数的使用需要熟练掌握,我也曾写过相关文章帮助理解
使用[]选取数据
选取单列数据,等效于df.A
:
df['A']
2013-01-01 -0.469364
2013-01-02 0.986576
2013-01-03 -0.192426
2013-01-04 0.047348
2013-01-05 0.233021
2013-01-06 0.493946
Freq: D, Name: A, dtype: float64
按行选取数据,使用[]
df[0:3]
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-01 |
-0.469364 |
-1.389291 |
0.844032 |
0.042866 |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
df['20130102':'20130104']
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
通过标签选取数据
df.loc[dates[0]]
A -0.469364
B -1.389291
C 0.844032
D 0.042866
Name: 2013-01-01 00:00:00, dtype: float64
df.loc[:,['A','B']]
A |
B |
|
---|---|---|
2013-01-01 |
-0.469364 |
-1.389291 |
2013-01-02 |
0.986576 |
-0.689543 |
2013-01-03 |
-0.192426 |
1.740765 |
2013-01-04 |
0.047348 |
-1.952303 |
2013-01-05 |
0.233021 |
0.619112 |
2013-01-06 |
0.493946 |
0.848247 |
df.loc['20130102':'20130104',['A','B']]
A |
B |
|
---|---|---|
2013-01-02 |
0.986576 |
-0.689543 |
2013-01-03 |
-0.192426 |
1.740765 |
2013-01-04 |
0.047348 |
-1.952303 |
df.loc['20130102',['A','B']]
A 0.986576
B -0.689543
Name: 2013-01-02 00:00:00, dtype: float64
df.loc[dates[0],'A']
-0.46936353804430075
df.at[dates[0],'A']
-0.46936353804430075
通过位置选取数据
df.iloc[3]
A 0.047348
B -1.952303
C -0.691544
D -1.403883
Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5, 0:2]
A |
B |
|
---|---|---|
2013-01-04 |
0.047348 |
-1.952303 |
2013-01-05 |
0.233021 |
0.619112 |
df.iloc[[1,2,4],[0,2]]
A |
C |
|
---|---|---|
2013-01-02 |
0.986576 |
-0.383265 |
2013-01-03 |
-0.192426 |
0.730479 |
2013-01-05 |
0.233021 |
0.628579 |
df.iloc[1:3]
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
df.iloc[:, 1:3]
B |
C |
|
---|---|---|
2013-01-01 |
-1.389291 |
0.844032 |
2013-01-02 |
-0.689543 |
-0.383265 |
2013-01-03 |
1.740765 |
0.730479 |
2013-01-04 |
-1.952303 |
-0.691544 |
2013-01-05 |
0.619112 |
0.628579 |
2013-01-06 |
0.848247 |
1.633055 |
df.iloc[1, 1]
-0.689543482094678
df.iat[1, 1]
-0.689543482094678
使用布尔索引
df[df.A>0]
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
-0.740562 |
df[df>0]
A |
B |
C |
D |
|
---|---|---|---|---|
2013-01-01 |
NaN |
NaN |
0.844032 |
0.042866 |
2013-01-02 |
0.986576 |
NaN |
NaN |
NaN |
2013-01-03 |
NaN |
1.740765 |
0.730479 |
NaN |
2013-01-04 |
0.047348 |
NaN |
NaN |
NaN |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
NaN |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
NaN |
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2
A |
B |
C |
D |
E |
|
---|---|---|---|---|---|
2013-01-01 |
-0.469364 |
-1.389291 |
0.844032 |
0.042866 |
one |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
-1.104932 |
one |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
two |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
-1.403883 |
three |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
four |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
-0.740562 |
three |
df2[df2['E'].isin(['two','four'])]
A |
B |
C |
D |
E |
|
---|---|---|---|---|---|
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
-1.320781 |
two |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
-0.802585 |
four |
修改数据
添加新列并自动按索引对齐数据
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
df['F'] = s1
df.at[dates[0], 'A'] = 0
df.iat[0, 1] = 0
df.loc[:, 'D'] = np.array([5] * len(df))
df
A |
B |
C |
D |
F |
|
---|---|---|---|---|---|
2013-01-01 |
0.000000 |
0.000000 |
0.844032 |
5 |
NaN |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
5 |
1.0 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
5 |
2.0 |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
5 |
3.0 |
2013-01-05 |
0.233021 |
0.619112 |
0.628579 |
5 |
4.0 |
2013-01-06 |
0.493946 |
0.848247 |
1.633055 |
5 |
5.0 |
df2 = df.copy()
df2[df2 > 0] = -df2
df2
A |
B |
C |
D |
F |
|
---|---|---|---|---|---|
2013-01-01 |
0.000000 |
0.000000 |
-0.844032 |
-5 |
NaN |
2013-01-02 |
-0.986576 |
-0.689543 |
-0.383265 |
-5 |
-1.0 |
2013-01-03 |
-0.192426 |
-1.740765 |
-0.730479 |
-5 |
-2.0 |
2013-01-04 |
-0.047348 |
-1.952303 |
-0.691544 |
-5 |
-3.0 |
2013-01-05 |
-0.233021 |
-0.619112 |
-0.628579 |
-5 |
-4.0 |
2013-01-06 |
-0.493946 |
-0.848247 |
-1.633055 |
-5 |
-5.0 |
缺失值处理
缺失值处理是Pandas数据处理的一部分,以下仅展示了部分操作
有关缺失值的处理可以查看下面两篇文章:
reindex
Pandas中使用np.nan
来表示缺失值,可以使用reindex
更改/添加/删除指定轴上的索引
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
A |
B |
C |
D |
F |
E |
|
---|---|---|---|---|---|---|
2013-01-01 |
0.000000 |
0.000000 |
0.844032 |
5 |
NaN |
1.0 |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
5 |
1.0 |
1.0 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
5 |
2.0 |
NaN |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
5 |
3.0 |
NaN |
删除缺失值
舍弃含有NaN的行
df1.dropna(how='any')
A |
B |
C |
D |
F |
E |
|
---|---|---|---|---|---|---|
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
5 |
1.0 |
1.0 |
填充缺失值
填充缺失数据
df1.fillna(value=5)
A |
B |
C |
D |
F |
E |
|
---|---|---|---|---|---|---|
2013-01-01 |
0.000000 |
0.000000 |
0.844032 |
5 |
5.0 |
1.0 |
2013-01-02 |
0.986576 |
-0.689543 |
-0.383265 |
5 |
1.0 |
1.0 |
2013-01-03 |
-0.192426 |
1.740765 |
0.730479 |
5 |
2.0 |
5.0 |
2013-01-04 |
0.047348 |
-1.952303 |
-0.691544 |
5 |
3.0 |
5.0 |
pd.isnull(df1)
A |
B |
C |
D |
F |
E |
|
---|---|---|---|---|---|---|
2013-01-01 |
False |
False |
False |
False |
True |
False |
2013-01-02 |
False |
False |
False |
False |
False |
False |
2013-01-03 |
False |
False |
False |
False |
False |
True |
2013-01-04 |
False |
False |
False |
False |
False |
True |
常用操作
在我的Pandas120题系列中有很多关于Pandas常用操作介绍!
欢迎微信搜索公众号【早起Python】关注
后台回复pandas获取相关习题!
统计
在进行统计操作时需要排除缺失值!
「描述性统计?」
纵向求均值
df.mean()
A 0.261411
B 0.094380
C 0.460223
D 5.000000
F 3.000000
dtype: float64
横向求均值
df.mean(1)
2013-01-01 1.461008
2013-01-02 1.182754
2013-01-03 1.855764
2013-01-04 1.080700
2013-01-05 2.096142
2013-01-06 2.595050
Freq: D, dtype: float64
s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
df.sub(s, axis='index')
A |
B |
C |
D |
F |
|
---|---|---|---|---|---|
2013-01-01 |
NaN |
NaN |
NaN |
NaN |
NaN |
2013-01-02 |
NaN |
NaN |
NaN |
NaN |
NaN |
2013-01-03 |
-1.192426 |
0.740765 |
-0.269521 |
4.0 |
1.0 |
2013-01-04 |
-2.952652 |
-4.952303 |
-3.691544 |
2.0 |
0.0 |
2013-01-05 |
-4.766979 |
-4.380888 |
-4.371421 |
0.0 |
-1.0 |
2013-01-06 |
NaN |
NaN |
NaN |
NaN |
NaN |
Apply函数
df.apply(np.cumsum)
A |
B |
C |
D |
F |
|
---|---|---|---|---|---|
2013-01-01 |
0.000000 |
0.000000 |
0.844032 |
5 |
NaN |
2013-01-02 |
0.986576 |
-0.689543 |
0.460767 |
10 |
1.0 |
2013-01-03 |
0.794151 |
1.051222 |
1.191246 |
15 |
3.0 |
2013-01-04 |
0.841498 |
-0.901081 |
0.499702 |
20 |
6.0 |
2013-01-05 |
1.074519 |
-0.281969 |
1.128281 |
25 |
10.0 |
2013-01-06 |
1.568465 |
0.566278 |
2.761336 |
30 |
15.0 |
df.apply(lambda x: x.max() - x.min())
A 1.179002
B 3.693068
C 2.324599
D 0.000000
F 4.000000
dtype: float64
value_counts()
文档中为Histogramming
,但示例就是.value_counts()
的使用
s = pd.Series(np.random.randint(0, 7, size=10))
s
0 6
1 1
2 4
3 6
4 3
5 2
6 3
7 5
8 2
9 2
dtype: int64
s.value_counts()
2 3
6 2
3 2
5 1
4 1
1 1
dtype: int64
字符串方法
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
数据合并
在我的Pandas120题系列中有很多关于数据合并的操作,
欢迎微信搜索公众号【早起Python】关注
后台回复pandas获取相关习题!
Concat
在连接/合并类型操作的情况下,pandas提供了各种功能,可以轻松地将Series和DataFrame对象与各种用于索引和关系代数功能的集合逻辑组合在一起。
df = pd.DataFrame(np.random.randn(10, 4))
df
0 |
1 |
2 |
3 |
|
---|---|---|---|---|
0 |
0.413620 |
-1.114527 |
0.322678 |
1.207744 |
1 |
-1.812499 |
-1.338866 |
0.611622 |
0.445057 |
2 |
0.365098 |
0.177919 |
0.823212 |
1.529158 |
3 |
-0.803774 |
-1.422255 |
1.411392 |
0.400721 |
4 |
0.732753 |
1.413181 |
-0.338617 |
0.088442 |
5 |
-0.509033 |
-1.237311 |
1.021978 |
-0.596258 |
6 |
0.841053 |
-0.404684 |
1.528639 |
-0.273577 |
7 |
0.966884 |
-2.142516 |
1.041670 |
0.109264 |
8 |
2.231267 |
2.011625 |
0.601062 |
0.533928 |
9 |
-0.134641 |
0.165157 |
-1.236827 |
1.681187 |
pieces = [df[:3], df[3:6], df[7:]]
pd.concat(pieces)
0 |
1 |
2 |
3 |
|
---|---|---|---|---|
0 |
0.413620 |
-1.114527 |
0.322678 |
1.207744 |
1 |
-1.812499 |
-1.338866 |
0.611622 |
0.445057 |
2 |
0.365098 |
0.177919 |
0.823212 |
1.529158 |
3 |
-0.803774 |
-1.422255 |
1.411392 |
0.400721 |
4 |
0.732753 |
1.413181 |
-0.338617 |
0.088442 |
5 |
-0.509033 |
-1.237311 |
1.021978 |
-0.596258 |
7 |
0.966884 |
-2.142516 |
1.041670 |
0.109264 |
8 |
2.231267 |
2.011625 |
0.601062 |
0.533928 |
9 |
-0.134641 |
0.165157 |
-1.236827 |
1.681187 |
「注意」
将列添加到DataFrame相对较快。
但是,添加一行需要一个副本,并且可能浪费时间
我们建议将预构建的记录列表传递给DataFrame构造函数,而不是通过迭代地将记录追加到其来构建DataFrame
Join
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
left
key |
lval |
|
---|---|---|
0 |
foo |
1 |
1 |
foo |
2 |
right
key |
rval |
|
---|---|---|
0 |
foo |
4 |
1 |
foo |
5 |
pd.merge(left, right, on='key')
key |
lval |
rval |
|
---|---|---|---|
0 |
foo |
1 |
4 |
1 |
foo |
1 |
5 |
2 |
foo |
2 |
4 |
3 |
foo |
2 |
5 |
Append
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
A |
B |
C |
D |
|
---|---|---|---|---|
0 |
-0.142659 |
-0.941171 |
-0.186519 |
-0.811977 |
1 |
0.584561 |
0.177886 |
-0.190396 |
0.664233 |
2 |
-1.807829 |
0.268193 |
0.683990 |
0.477042 |
3 |
-1.474986 |
-1.098600 |
-0.038280 |
2.087236 |
4 |
1.906703 |
0.678425 |
-0.090156 |
-0.444430 |
5 |
0.329748 |
1.110306 |
0.713732 |
-0.714841 |
6 |
1.218329 |
-0.376264 |
0.389029 |
-1.526025 |
7 |
0.423347 |
1.821127 |
-1.795346 |
-0.795738 |
s = df.iloc[3]
df.append(s, ignore_index=True)
A |
B |
C |
D |
|
---|---|---|---|---|
0 |
-0.142659 |
-0.941171 |
-0.186519 |
-0.811977 |
1 |
0.584561 |
0.177886 |
-0.190396 |
0.664233 |
2 |
-1.807829 |
0.268193 |
0.683990 |
0.477042 |
3 |
-1.474986 |
-1.098600 |
-0.038280 |
2.087236 |
4 |
1.906703 |
0.678425 |
-0.090156 |
-0.444430 |
5 |
0.329748 |
1.110306 |
0.713732 |
-0.714841 |
6 |
1.218329 |
-0.376264 |
0.389029 |
-1.526025 |
7 |
0.423347 |
1.821127 |
-1.795346 |
-0.795738 |
8 |
-1.474986 |
-1.098600 |
-0.038280 |
2.087236 |
数据分组
「数据分组」是指涉及以下一个或多个步骤的过程:
- 根据某些条件将数据分成几组
- 对每个组进行独立的操作
- 对结果进行合并
更多操作可以查阅官方文档[2]
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
A |
B |
C |
D |
|
---|---|---|---|---|
0 |
foo |
one |
-1.145254 |
0.974305 |
1 |
bar |
one |
1.195757 |
-0.187145 |
2 |
foo |
two |
-0.699446 |
0.248682 |
3 |
bar |
three |
-0.587003 |
-0.200543 |
4 |
foo |
two |
2.046185 |
-1.377637 |
5 |
bar |
two |
0.444696 |
-0.880975 |
6 |
foo |
one |
0.057713 |
-1.275762 |
7 |
foo |
three |
0.272196 |
0.016167 |
df.groupby('A').sum()
C |
D |
|
---|---|---|
A |
||
bar |
1.053451 |
-1.268663 |
foo |
0.531394 |
-1.414245 |
df.groupby(['A', 'B']).sum()
C |
D |
||
---|---|---|---|
A |
B |
||
bar |
one |
1.195757 |
-0.187145 |
three |
-0.587003 |
-0.200543 |
|
two |
0.444696 |
-0.880975 |
|
foo |
one |
-1.087541 |
-0.301457 |
three |
0.272196 |
0.016167 |
|
two |
1.346739 |
-1.128956 |
数据重塑
详细教程请参阅官方文档[3]中「分层索引和重塑」部分。
数据堆叠
可以进行数据压缩
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
A |
B |
||
---|---|---|---|
first |
second |
||
bar |
one |
-0.625492 |
2.471493 |
two |
0.934708 |
1.595349 |
|
baz |
one |
0.686079 |
0.279957 |
two |
0.039190 |
-0.534317 |
stacked = df2.stack()
stacked
first second
bar one A -0.625492
B 2.471493
two A 0.934708
B 1.595349
baz one A 0.686079
B 0.279957
two A 0.039190
B -0.534317
dtype: float64
stack()的反向操作是unstack(),默认情况下,它会将最后一层数据进行unstack():
stacked.unstack()
A |
B |
||
---|---|---|---|
first |
second |
||
bar |
one |
-0.625492 |
2.471493 |
two |
0.934708 |
1.595349 |
|
baz |
one |
0.686079 |
0.279957 |
two |
0.039190 |
-0.534317 |
stacked.unstack(1)
second |
one |
two |
|
---|---|---|---|
first |
|||
bar |
A |
-0.625492 |
0.934708 |
B |
2.471493 |
1.595349 |
|
baz |
A |
0.686079 |
0.039190 |
B |
0.279957 |
-0.534317 |
stacked.unstack(0)
first |
bar |
baz |
|
---|---|---|---|
second |
|||
one |
A |
-0.625492 |
0.686079 |
B |
2.471493 |
0.279957 |
|
two |
A |
0.934708 |
0.039190 |
B |
1.595349 |
-0.534317 |
数据透视表
Pandas中实现数据透视表很简单,但是相比之下并没有Excel灵活,可以查看我的文章
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
df
A |
B |
C |
D |
E |
|
---|---|---|---|---|---|
0 |
one |
A |
foo |
-0.072719 |
-0.034173 |
1 |
one |
B |
foo |
1.262336 |
-0.907695 |
2 |
two |
C |
foo |
0.093161 |
-1.516473 |
3 |
three |
A |
bar |
0.190056 |
0.481209 |
4 |
one |
B |
bar |
1.319855 |
0.255924 |
5 |
one |
C |
bar |
0.374758 |
-0.019331 |
6 |
two |
A |
foo |
-1.019282 |
0.673759 |
7 |
three |
B |
foo |
-1.526206 |
-0.521203 |
8 |
one |
C |
foo |
1.600168 |
1.632461 |
9 |
one |
A |
bar |
-2.410462 |
-0.271305 |
10 |
two |
B |
bar |
0.387701 |
-1.039195 |
11 |
three |
C |
bar |
-1.367669 |
-1.760517 |
df.pivot_table(values='D', index=['A', 'B'], columns='C')
C |
bar |
foo |
|
---|---|---|---|
A |
B |
||
one |
A |
-2.410462 |
-0.072719 |
B |
1.319855 |
1.262336 |
|
C |
0.374758 |
1.600168 |
|
three |
A |
0.190056 |
NaN |
B |
NaN |
-1.526206 |
|
C |
-1.367669 |
NaN |
|
two |
A |
NaN |
-1.019282 |
B |
0.387701 |
NaN |
|
C |
NaN |
0.093161 |
时间序列
对于在频率转换期间执行重采样操作(例如,将秒数据转换为5分钟数据),pandas具有简单、强大和高效的功能。这在金融应用中非常常见,但不仅限于此。参见官方文档[4]中「时间序列」部分。
时区表示
rng = pd.date_range('1/1/2012', periods=100, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('5Min').sum()
2012-01-01 27339
Freq: 5T, dtype: int64
rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06 -0.118691
2012-03-07 -1.424038
2012-03-08 0.377441
2012-03-09 -1.116195
2012-03-10 1.180595
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2012-03-06 00:00:00+00:00 -0.118691
2012-03-07 00:00:00+00:00 -1.424038
2012-03-08 00:00:00+00:00 0.377441
2012-03-09 00:00:00+00:00 -1.116195
2012-03-10 00:00:00+00:00 1.180595
Freq: D, dtype: float64
时区转换
ts_utc.tz_convert('US/Eastern')
2012-03-05 19:00:00-05:00 -0.118691
2012-03-06 19:00:00-05:00 -1.424038
2012-03-07 19:00:00-05:00 0.377441
2012-03-08 19:00:00-05:00 -1.116195
2012-03-09 19:00:00-05:00 1.180595
Freq: D, dtype: float64
在时间跨度表示之间进行转换
rng = pd.date_range('1/1/2012', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31 1.138201
2012-02-29 0.677539
2012-03-31 0.272933
2012-04-30 -0.238112
2012-05-31 -1.122162
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01 1.138201
2012-02 0.677539
2012-03 0.272933
2012-04 -0.238112
2012-05 -1.122162
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01 1.138201
2012-02-01 0.677539
2012-03-01 0.272933
2012-04-01 -0.238112
2012-05-01 -1.122162
Freq: MS, dtype: float64
在周期和时间戳之间转换可以使用一些方便的算术函数。
在以下示例中,我们将以11月结束的年度的季度频率转换为季度结束后的月末的上午9点:
prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
1990-03-01 09:00 -1.555191
1990-06-01 09:00 1.535344
1990-09-01 09:00 -0.092187
1990-12-01 09:00 1.285081
1991-03-01 09:00 1.130063
Freq: H, dtype: float64
事实上,常用有关时间序列的操作远超过上方的官方示例,简单来说与日期有关的操作从创建到转换pandas都能很好的完成!
灵活的使用分类数据
Pandas可以在一个DataFrame中包含分类数据。有关完整文档,请参阅分类介绍和API文档。
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df['grade'] = df['raw_grade'].astype("category")
df['grade']
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
将类别重命名为更有意义的名称(Series.cat.categories()
)
df["grade"].cat.categories = ["very good", "good", "very bad"]
重新排序类别,并同时添加缺少的类别(在有缺失的情况下,string .cat()下的方法返回一个新的系列)。
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
df.sort_values(by='grade')
id |
raw_grade |
grade |
|
---|---|---|---|
5 |
6 |
e |
very bad |
1 |
2 |
b |
good |
2 |
3 |
b |
good |
0 |
1 |
a |
very good |
3 |
4 |
a |
very good |
4 |
5 |
a |
very good |
df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
数据可视化
在我的Pandas120题系列中有很多关于数据可视化的操作,
欢迎微信搜索公众号【早起Python】关注
后台回复pandas获取相关习题!
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts.head()
2000-01-01 -1.946554
2000-01-02 -0.354670
2000-01-03 0.361473
2000-01-04 -0.109408
2000-01-05 0.877671
Freq: D, dtype: float64
ts = ts.cumsum() #累加
在Pandas中可以使用.plot()
直接绘图,支持多种图形和自定义选项点击可以查阅官方文档[5]
ts.plot()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
使用plt
绘图,具体参数设置可以查阅matplotlib官方文档
plt.figure(); df.plot(); plt.legend(loc='best')
导入导出数据
「将数据写入csv
,如果有中文需要注意编码」
df.to_csv('foo.csv')
从csv
中读取数据
pd.read_csv('foo.csv').head()
Unnamed: 0 |
A |
B |
C |
D |
|
---|---|---|---|---|---|
0 |
2000-01-01 |
-0.640246 |
-1.846295 |
-0.181754 |
0.981574 |
1 |
2000-01-02 |
-1.580720 |
-2.382281 |
-0.745580 |
0.175213 |
2 |
2000-01-03 |
-2.745502 |
-1.809188 |
-0.371424 |
-0.724011 |
3 |
2000-01-04 |
-2.576642 |
-1.287329 |
-0.615925 |
-1.154665 |
4 |
2000-01-05 |
-2.442921 |
-0.481561 |
-0.283864 |
0.068934 |
将数据导出为hdf
格式
df.to_hdf('foo.h5','df')
从hdf
文件中读取数据前五行
pd.read_hdf('foo.h5','df').head()
A |
B |
C |
D |
|
---|---|---|---|---|
2000-01-01 |
-0.640246 |
-1.846295 |
-0.181754 |
0.981574 |
2000-01-02 |
-1.580720 |
-2.382281 |
-0.745580 |
0.175213 |
2000-01-03 |
-2.745502 |
-1.809188 |
-0.371424 |
-0.724011 |
2000-01-04 |
-2.576642 |
-1.287329 |
-0.615925 |
-1.154665 |
2000-01-05 |
-2.442921 |
-0.481561 |
-0.283864 |
0.068934 |
将数据保存为xlsx
格式
df.to_excel('foo.xlsx', sheet_name='Sheet1')
从xlsx
格式中按照指定要求读取sheet1中数据
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()
A |
B |
C |
D |
|
---|---|---|---|---|
2000-01-01 |
-0.640246 |
-1.846295 |
-0.181754 |
0.981574 |
2000-01-02 |
-1.580720 |
-2.382281 |
-0.745580 |
0.175213 |
2000-01-03 |
-2.745502 |
-1.809188 |
-0.371424 |
-0.724011 |
2000-01-04 |
-2.576642 |
-1.287329 |
-0.615925 |
-1.154665 |
2000-01-05 |
-2.442921 |
-0.481561 |
-0.283864 |
0.068934 |
获得帮助
如果你在使用Pandas的过程中遇到了错误,就像下面一样:
>>> if pd.Series([False, T`mrue, False]):
... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
可以查阅官方文档来了解该如何解决!
参考资料
[1]
https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics
[2]
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby
[3]
https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-hierarchical
[4]
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries
[5]
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#plotting
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- linux下rpm查询软件包依赖和被依赖关系
- 基于ActiveMQ的请求-应答模式
- Maven私服搭建
- Java线程状态详解
- 设计模式~命令模式
- 基于DelayQueue实现的带失效时间的缓存
- 基于AQS实现的简单的Semaphore
- 图解:基于B+树索引结构,MySQL可以这么优化
- Android开发笔记:Retrofit + OkHttp3 + coroutines + LiveData打造一款网络请求框架
- Nginx安装与使用
- 基于Redis实现分布式锁
- 通过简单代码示例了解七大软件设计原则
- Flink在新浪微博的在线机器学习和实时数据分析
- Nginx + Keepalived使用文档
- 22+ 高频实用的 JavaScript 片段 (2020年)