Python数据分析及可视化（一）数据的获取与表示_笔记

一.本地数据获取——文件

1.文件操作的三个步骤

打开文件——>读写文件——>关闭文件

为什么需要关闭文件呢？因为python可能会缓存写入的数据，如果程序异常崩溃了，那么数据可能就无法写到文件中，因此为了安全起见，文件读写完成后要主动关闭。

2.文件的打开

使用open函数，第一个参数为文件名(可以包含路径)，第二个参数表示读写模式，第三个参数表示缓冲

第一个参数：必须有

第二个参数：默认是r（只读），可省略。读写模式包括r、w、a、r+、w+、a+等等

第三个参数：默认值为-1，表示使用系统默认的缓冲区的大小，可省略。在Python中二进制文件可以不使用缓冲，但是文本文件必须使用缓冲。

open函数的返回结果是一个可迭代的文件对象，因而我们可以遍历其中的每一个子项

3.文件的读写

约定f = open("hello.txt", 'w')

a.文件的写操作：f.write('Hello, World')

#一种更好的读写语句
with open('hello.txt', 'w') as f:
    f.write('Hello, World')
#注：with语句在执行后会自动关闭文件句柄，因而在程序中不需要再写close语句

b.文件的读操作：f.read()，可选参数size，表示从文件中至多读出size字节数据，返回一个字符串

缺省情况下，读文件到文件结束，依然是返回一个字符串

c.扩展：

f.readlines()读入多行数据

f.readline()读入一行数据

f.writelines()写入多行数据

d.文件指针操作

f.seek(offset, whence=0)可用于在文件中移动文件指针，从whence(0表示文件头部，1表示当前位置，2表示文件尾部)偏移offset个字节，whence参数可选，缺省情况下为0

f.扩展知识

python的三个标准流：

stdin标准输入流　　stdout标准输出流　　stderr标准错误流

在Python中，键盘和显示终端也是文件，这些文件实际上是通过sys模块中提供的函数来实现的，比如print('hello, world')的实现为：

import sys
sys.stdout.write('hello, world')

二.网络数据获取

1.流程

网络数据的获取分两个阶段：爬取——>解析

2.爬取

Requests库

基本方法：requests.get()——请求获取指定URL位置的资源，对应HTTP协议的GET方法

说明：get方法返回一个response对象，这个对象包含requests请求信息以及服务器的response响应信息，且requests会自动解码来自服务器的信息，比如网页返回的信息是json格式，可以通过对象名.json()进行解码，如果网页返回的信息是二进制格式的，可以通过对象名.content()进行解码，特别的，对象名的.text()属性可以自动推测文本类型并解码，此外，可以通过encoding这个属性来修改文本的编码，常见编码为utf-8

示例：

#示例1
#假设获取的是二进制文件，可以通过如下方法保存数据
import requests

r = requests.get('https://www.baidu.com/img/bd_logo1.png')
with open('baidu.png', 'wb') as fp:
    fp.write(r.content)

#示例2
#为了反爬，有些网站会对Headers的User-Agent进行检测，需将headers信息传递给get函#数的headers参数，例如知乎，直接访问会返回400，加上headers参数后可正确返回：
headers = {"User-
Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.11 (KHTML, like Geck
o) Chrome/17.0.963.83 Safari/535.11"}
re = requests.get('https://www.zhihu.com', headers = headers)
print(re.status_code)

3.解析

获取到源码后，需要对源码进行解析

标签格式规则的源码，适合用beautiful soup库解析，而对于数据结构复杂的源码，适合用正则表达式提取

#BeautifulSoup解析实例
import requests
from bs4 import BeautifulSoup

r = requests.get('https://movie.douban.com/subject/10533913/')
soup = BeautufulSoup(r.text, 'lxml')
pattern = soup.find_all('span', 'short')
for item in pattern:
    print(item.string)

#正则表达式解析实例
import requests
from bs4 import BeautifulSoup
import re
s = 0
r = requests.get('https://book.douban.com/subject/1165179/comments/')
soup = BeautifulSoup(r.text, 'lxml')
pattern = soup.find_all('span', 'short')
for item in pattern:
    print(item.string)
pattern_s = re.compile('<span class="user-stars allstar(.*?) rating">')
p = re.findall(pattern_s, r.text)
for star in p:
    s += int(star)
print(s)

最后，给出一段完整的爬虫代码：爬取豆瓣某本书的前50页短评内容并计算评分的平均值

import requests, re, time
from bs4 import BeautifulSoup


count = 0
i = 0


s, count_s, count_del = 0, 0, 0
lst_stars = []
while count < 50:
    try:
        r = requests.get('https://book.douban.com/subject/10517238/comments/hot?p=' + str(i+1))
    except Exception as err:
        print(err)
        break
    soup = BeautifulSoup(r.text, 'lxml')
    comments = soup.find_all('span', 'short')
    pattern = re.compile('<span class="user-stars allstar(.*?)rating"')
    p = re.findall(pattern, r.text)
    
    for item in comments:
        count += 1
        if count > 50:
            count_del += 1
        else:
            print(count, item.string)
    for star in p:
        lst_stars.append(int(star))
    time.sleep(5)
    i += 1
    for star in lst_stars[:-count_del]:
        s += int(star)
    if count >= 50:
        print(s // len(lst_stars)-count_del)

原文地址：https://www.cnblogs.com/laideng/p/11440209.html