用python做科学计算之pandas入门简介

pandas是一个开源的python数据分析和处理包，使用灵活方便，性能高，速度快，简单介绍一下它里面比较常用的功能

数据读取

它支持多种数据读取的方式这里简单介绍2种

通过csv文件读取数据：

$ pip instal pandas
$ python
>>> import pandas as pd
>>> data = pd.read_csv('test.csv')

通过mysql读取数据：

$ pip install sqlalchemy
$ pip install MySQL-python
$ python
>>> import pandas as pd
>>> from sqlalchemy import create_engine
>>> engine = create_engine('mysql://user:password@localhost/test')
>>> with engine.connect() as conn, conn.begin():
>>>   data = pd.read_sql_table('data', conn)
>>> data
    x    y  shape  color    xx
0  0.8          21      2  0.60
1  NaN  0.9     23      2  0.93
2  0.5  0.3    NaN      1  0.30
3  0.3  0.5     24      1  0.10
4  0.0  0.2     25      2  0.00
5  0.3  0.3     25      1  0.10

数据清洗

对不符合要求的数据进行清除，去掉数据里出现空值（NaN）的行

>>> data.dropna(how='any')
     x    y  shape  color   xx
0  0.8          21      2  0.6
3  0.3  0.5     24      1  0.1
4  0.0  0.2     25      2  0.0
5  0.3  0.3     25      1  0.1

数据处理

取行列数量：

>>> data.shape #6行，5列
(6, 5)

取行列名：

>>> data.columns
Index([u'x', u'y', u'shape', u'color', u'xx'], dtype='object')

与sql比较：

select语句比较：类似 select shape, color from data limit 3;

>>> data[['shape','color']].head(3) 
    shape  color
0     21      2
1     23      2
2    NaN      1

where语句比较：类似 select color from data where color = 2 limit 3;

>>> data[data['color'] == 2].head(3) 
     x    y  shape  color    xx
0  0.8          21      2  0.60
1  NaN  0.9     23      2  0.93
4  0.0  0.2     25      2  0.00

group by语句比较：类似 select color, count(*) from data where gruop by color;

>>> data.groupby('color').size() 
    color
1    3
2    3
dtype: int64

join语句比较：类似 select * from date inner join data2 on date.x = date2.x;

>>> pd.merge(data, data2, on='x') 
     x  y_x  shape_x  color_x  xx_x  y_y  shape_y  color_y  xx_y
0  0.8            21        2  0.60            21        2  0.60
1  NaN  0.9       23        2  0.93  0.9       23        2  0.93
2  0.5  0.3      NaN        1  0.30  0.3      NaN        1  0.30
3  0.3  0.5       24        1  0.10  0.5       24        1  0.10
4  0.3  0.5       24        1  0.10  0.3       25        1  0.10
5  0.3  0.3       25        1  0.10  0.5       24        1  0.10
6  0.3  0.3       25        1  0.10  0.3       25        1  0.10
7  0.0  0.2       25        2  0.00  0.2       25        2  0.00

数据展示

pip install matplotlib
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> d = {'hubei': 20, 'guangdong': 10, 'zhejiang': 15} #演示数据key:value对 
>>> ts = pd.Series(d)                                  #序列化数据
>>> ts.plot(kind='barh')                               #选择绘制成水平条形图
>>> plt.savefig('test.png')                            #保存成图片