pandas 之 索引重塑
import numpy as np
import pandas as pd
There are a number of basic operations for rearanging tabular data. These are alternatingly referred to as reshape or pivot operations.
多层索引重塑
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
stack - 列拉长index
This "rotates" or pivots from the columns in the data to the rows.
unstack
This pivots from the rows into the columns.
I'll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
index=pd.Index(['Ohio', 'Colorado'], name='state'),
columns=pd.Index(['one', 'two', 'three'],
name='number'))
data
number | one | two | three |
---|---|---|---|
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
Using the stack method on this data pivots the columns into the rows, producing a Series.
"stack 将每一行, 叠成一个Series, 堆起来"
result = data.stack()
result
'stack 将每一行, 叠成一个Series, 堆起来'
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
From a hierarchically indexed Series, you can rearrage the data back into a DataFrame with unstack
"unstack 将叠起来的Series, 变回DF"
result.unstack()
'unstack 将叠起来的Series, 变回DF'
number | one | two | three |
---|---|---|---|
state | |||
Ohio | 0 | 1 | 2 |
Colorado | 3 | 4 | 5 |
By default the innermost level is unstacked(same with stack). You can unstack a different level by passing a level number or name.
result.unstack(level=0)
state | Ohio | Colorado |
---|---|---|
number | ||
one | 0 | 3 |
two | 1 | 4 |
three | 2 | 5 |
result.unstack(level='state')
state | Ohio | Colorado |
---|---|---|
number | ||
one | 0 | 3 |
two | 1 | 4 |
three | 2 | 5 |
Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups.
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data2.unstack() # 外连接哦
a | b | c | d | e | |
---|---|---|---|---|---|
one | 0.0 | 1.0 | 2.0 | 3.0 | NaN |
two | NaN | NaN | 4.0 | 5.0 | 6.0 |
%time data2.unstack().stack()
Wall time: 5 ms
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
%time data2.unstack().stack(dropna=False)
Wall time: 3 ms
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df
side | left | right | |
---|---|---|---|
state | number | ||
Ohio | one | 0 | 5 |
two | 1 | 6 | |
three | 2 | 7 | |
Colorado | one | 3 | 8 |
two | 4 | 9 | |
three | 5 | 10 |
df.unstack("state")
side | left | right | ||
---|---|---|---|---|
state | Ohio | Colorado | Ohio | Colorado |
number | ||||
one | 0 | 3 | 5 | 8 |
two | 1 | 4 | 6 | 9 |
three | 2 | 5 | 7 | 10 |
When calling stack, we can indicate the name of the axis to stack:
%time df.unstack('state').stack('side')
Wall time: 118 ms
state | Colorado | Ohio | |
---|---|---|---|
number | side | ||
one | left | 3 | 0 |
right | 8 | 5 | |
two | left | 4 | 1 |
right | 9 | 6 | |
three | left | 5 | 2 |
right | 10 | 7 |
长转宽形
A common way to store multiple time series in databases and CSV is in so-called long or stacked format. Let's load some example data and do a small amonut of time series wrangling and other data cleaning:
%%time
data = pd.read_csv("../examples/macrodata.csv")
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 14 columns):
year 203 non-null float64
quarter 203 non-null float64
realgdp 203 non-null float64
realcons 203 non-null float64
realinv 203 non-null float64
realgovt 203 non-null float64
realdpi 203 non-null float64
cpi 203 non-null float64
m1 203 non-null float64
tbilrate 203 non-null float64
unemp 203 non-null float64
pop 203 non-null float64
infl 203 non-null float64
realint 203 non-null float64
dtypes: float64(14)
memory usage: 22.3 KB
Wall time: 142 ms
data.head()
year | quarter | realgdp | realcons | realinv | realgovt | realdpi | cpi | m1 | tbilrate | unemp | pop | infl | realint | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1959.0 | 1.0 | 2710.349 | 1707.4 | 286.898 | 470.045 | 1886.9 | 28.98 | 139.7 | 2.82 | 5.8 | 177.146 | 0.00 | 0.00 |
1 | 1959.0 | 2.0 | 2778.801 | 1733.7 | 310.859 | 481.301 | 1919.7 | 29.15 | 141.7 | 3.08 | 5.1 | 177.830 | 2.34 | 0.74 |
2 | 1959.0 | 3.0 | 2775.488 | 1751.8 | 289.226 | 491.260 | 1916.4 | 29.35 | 140.5 | 3.82 | 5.3 | 178.657 | 2.74 | 1.09 |
3 | 1959.0 | 4.0 | 2785.204 | 1753.7 | 299.356 | 484.052 | 1931.3 | 29.37 | 140.0 | 4.33 | 5.6 | 179.386 | 0.27 | 4.06 |
4 | 1960.0 | 1.0 | 2847.699 | 1770.5 | 331.722 | 462.199 | 1955.5 | 29.54 | 139.6 | 3.50 | 5.2 | 180.007 | 2.31 | 1.19 |
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter, name='date')
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
# 修改列索引名
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D', 'end')
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:10]
date | item | value | |
---|---|---|---|
0 | 1959-03-31 | realgdp | 2710.349 |
1 | 1959-03-31 | infl | 0.000 |
2 | 1959-03-31 | unemp | 5.800 |
3 | 1959-06-30 | realgdp | 2778.801 |
4 | 1959-06-30 | infl | 2.340 |
5 | 1959-06-30 | unemp | 5.100 |
6 | 1959-09-30 | realgdp | 2775.488 |
7 | 1959-09-30 | infl | 2.740 |
8 | 1959-09-30 | unemp | 5.300 |
9 | 1959-12-31 | realgdp | 2785.204 |
This is so-called long format for multiple time series, or other observational data with two or more keys. Each row in the table represents a single observation.
Data is frequently stored this way in relational databases like MySQL, as a fixed schema allows the number of distinct values in the item columns to change as data is added to the table. In the previous example, date and keys offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation:
pivoted = ldata.pivot('date', 'item', 'value')
pivoted[:5]
item | infl | realgdp | unemp |
---|---|---|---|
date | |||
1959-03-31 | 0.00 | 2710.349 | 5.8 |
1959-06-30 | 2.34 | 2778.801 | 5.1 |
1959-09-30 | 2.74 | 2775.488 | 5.3 |
1959-12-31 | 0.27 | 2785.204 | 5.6 |
1960-03-31 | 2.31 | 2847.699 | 5.2 |
The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:
ldata['valu2'] = np.random.randn(len(ldata))
ldata[:10]
date | item | value | valu2 | |
---|---|---|---|---|
0 | 1959-03-31 | realgdp | 2710.349 | -0.143460 |
1 | 1959-03-31 | infl | 0.000 | -0.422318 |
2 | 1959-03-31 | unemp | 5.800 | 0.389872 |
3 | 1959-06-30 | realgdp | 2778.801 | -0.208526 |
4 | 1959-06-30 | infl | 2.340 | -1.538956 |
5 | 1959-06-30 | unemp | 5.100 | -0.143273 |
6 | 1959-09-30 | realgdp | 2775.488 | 0.385763 |
7 | 1959-09-30 | infl | 2.740 | 0.564365 |
8 | 1959-09-30 | unemp | 5.300 | 0.266295 |
9 | 1959-12-31 | realgdp | 2785.204 | -1.267871 |
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
Wide to Long
An inverse operation to pivot for DataFrame is pandas.melt. Rather than transroming one columns into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input, Let's look at an example:
df = pd.DataFrame({
'key': ['foo', 'bar', 'baz'],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]
})
df
key | A | B | C | |
---|---|---|---|---|
0 | foo | 1 | 4 | 7 |
1 | bar | 2 | 5 | 8 |
2 | baz | 3 | 6 | 9 |
The 'key' columns may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which colmuns are group indicators Let's use 'key' as the only group indicator here:
melted = pd.melt(df, ['key'])
melted
key | variable | value | |
---|---|---|---|
0 | foo | A | 1 |
1 | bar | A | 2 |
2 | baz | A | 3 |
3 | foo | B | 4 |
4 | bar | B | 5 |
5 | baz | B | 6 |
6 | foo | C | 7 |
7 | bar | C | 8 |
8 | baz | C | 9 |
Using pivot, we can reshape back to the original layout:(布局)
reshaped = melted.pivot('key', 'variable', 'value')
reshaped
variable | A | B | C |
---|---|---|---|
key | |||
bar | 2 | 5 | 8 |
baz | 3 | 6 | 9 |
foo | 1 | 4 | 7 |
Since the result of pivot creats an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:
reshaped.reset_index()
variable | key | A | B | C |
---|---|---|---|---|
0 | bar | 2 | 5 | 8 |
1 | baz | 3 | 6 | 9 |
2 | foo | 1 | 4 | 7 |
You can also specify a subset of columns to use as value columns:
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])
key | variable | value | |
---|---|---|---|
0 | foo | A | 1 |
1 | bar | A | 2 |
2 | baz | A | 3 |
3 | foo | B | 4 |
4 | bar | B | 5 |
5 | baz | B | 6 |
pandas.melt can be used without any group identifiers, too:
pd.melt(df, value_vars=['A', 'B', 'C'])
variable | value | |
---|---|---|
0 | A | 1 |
1 | A | 2 |
2 | A | 3 |
3 | B | 4 |
4 | B | 5 |
5 | B | 6 |
6 | C | 7 |
7 | C | 8 |
8 | C | 9 |
pd.melt(df, value_vars=['key', 'A', 'B'])
variable | value | |
---|---|---|
0 | key | foo |
1 | key | bar |
2 | key | baz |
3 | A | 1 |
4 | A | 2 |
5 | A | 3 |
6 | B | 4 |
7 | B | 5 |
8 | B | 6 |
小结
Now that you have some pandas basics for data import, clearning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to pandas later in the book when we discuss more advance analytics.
原文地址:https://www.cnblogs.com/chenjieyouge/p/11945169.html
- TensorFlow | 自己动手写深度学习模型之全连接神经网络
- 多线程编程学习二(对象及变量的并发访问).
- ASM基本配置问题(r5笔记第89天)
- 如何上手使用 Facebook 的开源平台 Detectron?
- 多线程编程学习三(线程间通信).
- 关于create database语句在10g,11g中的不同(r5笔记第88天)
- Web开发模式【Mode I 和Mode II的介绍、应用案例】
- 多线程编程学习四(Lock 的使用)
- Android编程规范
- 干货 | 深入分析Object.wait/notify实现机制
- 关于ORA-01555的问题分析(r5笔记第87天)
- 项目工具类
- AJAX常见面试题
- 干货 | Tomcat类加载机制触发的Too many open files问题分析
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- 一文搞定 Linux 常用高频命令
- 推荐一款科研必备的Python数据可视化神器——PyQtGraph
- 机器学习基础:可视化方式理解决策树剪枝
- 神级代码注释-这次是来搞笑的
- Gremlin 图查询概述
- JS,PHP,Python,Java对JSON数据的处理
- 基于Canal与Flink实现数据实时增量同步(二)
- Spring第四天:SSH的整合、HibernateTemplate的使用、OpenSessionInViewFilter的使用
- IDEA 下单程序多端口不同配置独立运行
- 基于Canal与Flink实现数据实时增量同步(一)
- 8848钛金手机之nacos的注册发现
- 让你git 时不再输入账号和密码
- JS 实现点击按钮复制一段文字
- Python操作Excel合并单元格
- CRM第一天:客户关系管理系统的环境搭建和注册