爬取数据缺失的补坑，Python数据爬取的坑坑洼洼如何铲平

渣渣业余选手讲解，关于爬取数据缺失的补坑，一点点关于Python数据爬取的坑坑洼洼如何铲平，个人的一些心得体会，还有结合实例的数据缺失的补全，几点参考，仅供观赏，如有雷同，那肯定是我抄袭的！

在使用Python爬取数据的过程中，尤其是用你自身电脑进行数据抓取，往往会有网络延迟，或者兼职网管拔插重启网络的情况发生，这是渣渣碰到的非常普遍的情况，当然推荐还是推荐使用服务器抓取数据。

当然这是比较常见和可控的网络爬取的异常，处理还是有不少方法或者说是方案的，也是这里着重谈谈的爬取数据缺失的补坑。

补坑一：timeou=x 的设置

requests抓取网页数据中，timeou属性建议一定要设置，一般为timeou=5，建议设置5s以上，如果你的网络差，或者抓取的网页服务器延迟比较厉害，比如国内访问国外网站服务器，建议设置10s以上！

为什么要设置imeou=x呢？

避免网络延迟，程序卡死，死机，连报错都不会出现，一直停滞在网页访问的过程中，这在 pyinstaller 打包的exe程序使用中尤为常见！

超时（timeout）

为防止服务器不能及时响应，大部分发至外部服务器的请求都应该带着 timeout 参数。

在默认情况下，除非显式指定了 timeout 值，requests 是不会自动进行超时处理的。

如果没有 timeout，你的代码可能会挂起若干分钟甚至更长时间。

连接超时指的是在你的客户端实现到远端机器端口的连接时（对应的是 connect() ），Request 会等待的秒数。

一个很好的实践方法是把连接超时设为比 3 的倍数略大的一个数值，因为 TCP 数据包重传窗口 (TCP packet retransmission window) 的默认大小是 3。

在爬虫代理这一块我们经常会遇到请求超时的问题，代码就卡在哪里，不报错也没有requests请求的响应。

通常的处理是在requests.get()语句中加入timeout限制请求时间

req = requests.get(url, headers=headers, proxies=proxies, timeout=5)

如果发现设置timeout=5后长时间不响应问题依然存在，可以将timeout里的参数细化

作出如下修改后，问题就消失了

req = requests.get(url, headers=headers, proxies=proxies, timeout=(3,7))

timeout是用作设置响应时间的，响应时间分为连接时间和读取时间，timeout(3,7)表示的连接时间是3，响应时间是7，如果只写一个的话，就是连接和读取的timeout总和！

来源：CSDN博主「明天依旧可好」

补坑二：requests超时重试

requests访问重试的设置，你非常熟悉的错误信息中显示的是 read timeout（读取超时）报错。

超时重试的设置，虽然不能完全避免读取超时报错，但能够大大提升你的数据获取量，避免偶尔的网络超时而无法获取数据，避免你后期大量补坑数据。

一般超时我们不会立即返回，而会设置一个三次重连的机制。

def gethtml(url):
    i = 0
    while i < 3:
        try:
            html = requests.get(url, timeout=5).text
            return html
        except requests.exceptions.RequestException:
            i += 1

其实 requests 已经帮我们封装好了。（但是代码好像变多了...）

import time
import requests
from requests.adapters import HTTPAdapter


s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=3))
s.mount('https://', HTTPAdapter(max_retries=3))


print(time.strftime('%Y-%m-%d %H:%M:%S'))
try:
    r = s.get('http://www.google.com.hk', timeout=5)
    return r.text
except requests.exceptions.RequestException as e:
    print(e)
print(time.strftime('%Y-%m-%d %H:%M:%S'))

max_retries 为最大重试次数，重试3次，加上最初的一次请求，一共是4次，所以上述代码运行耗时是20秒而不是15秒

2020-01-11 15:34:03
HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x0000000013269630>, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))
2020-01-11 15:34:23

来源：大龄码农的Python之路

补坑三：urlretrieve（）函数下载图片

解决urlretrieve下载不完整问题且避免用时过长

下载文件出现urllib.ContentTooShortError且重新下载文件会存在用时过长的问题，而且往往会尝试好几次，甚至十几次，偶尔会陷入死循环，这种情况是非常不理想的。为此，笔者利用socket模块，使得每次重新下载的时间变短，且避免陷入死循环，从而提高运行效率。

以下为代码：

import socket
import urllib.request
#设置超时时间为30s
socket.setdefaulttimeout(30)
#解决下载不完全问题且避免陷入死循环
try:
    urllib.request.urlretrieve(url,image_name)
except socket.timeout:
    count = 1
    while count <= 5:
        try:
            urllib.request.urlretrieve(url,image_name)                                                
            break
        except socket.timeout:
            err_info = 'Reloading for %d time'%count if count == 1 else 'Reloading for %d times'%count
            print(err_info)
            count += 1
    if count > 5:
        print("downloading picture fialed!")

来源：CSDN博主「山阴少年」

补坑四：time.sleep的使用

Python time sleep() 函数推迟调用线程的运行，可通过参数secs指秒数，表示进程挂起的时间。

某些网页请求过快，如果没有设置延迟1-2s，你是不会抓取到数据的！

当然这种情况还是比较少数！

想要顺利采集数据，不管什么方法，目的只有一个：记录下最后的状态，也就是你的抓取日志文件系统一定要完善！

附：

一次完整的数据补坑实例：

异常处理记录源码：

s = requests.session()
    s.mount('http://', HTTPAdapter(max_retries=3))
    s.mount('https://', HTTPAdapter(max_retries=3))
    try:
        print(f">>> 开始下载 {img_name}图片 ...")
        r=s.get(img_url,headers=ua(),timeout=15)
        with open(f'{path}/{img_name}','wb') as f:
            f.write(r.content)
        print(f">>>下载 {img_name}图片 成功！")
        time.sleep(2)
    except requests.exceptions.RequestException as e:
        print(f"{img_name}图片-{img_url}下载失败！")
        with open(f'{path}/imgspider.txt','a+') as f:
            f.write(f'{img_url},{img_name},{path}-下载失败，错误代码：{e}！n')

下载图片报错：

异常文件记录数据：

https://www.red-dot.org/index.php?f=65894&token=2aa10bf1c4ad54ea3b55f0f35f57abb4ba22cc76&eID=tx_solr_image&size=large&usage=overview,1_1_KRELL Automotive.jpg,2019Communication Design/Film & Animation-下载失败，错误代码：HTTPSConnectionPool(host='www.red-dot.org', port=443): Max retries exceeded with url: /index.php?f=65894&token=2aa10bf1c4ad54ea3b55f0f35f57abb4ba22cc76&eID=tx_solr_image&size=large&usage=overview (Caused by ReadTimeoutError("HTTPSConnectionPool(host='www.red-dot.org', port=443): Read timed out. (read timeout=15)"))！
https://www.red-dot.org/index.php?f=65913&token=8cf9f213e28d0e923e1d7c3ea856210502f57df3&eID=tx_solr_image&size=large&usage=overview,1_2_OLX – Free Delivery.jpg,2019Communication Design/Film & Animation-下载失败，错误代码：HTTPSConnectionPool(host='www.red-dot.org', port=443): Read timed out.！
https://www.red-dot.org/index.php?f=65908&token=426484d233356d6a1d4b8044f4994e1d7f8c141a&eID=tx_solr_image&size=large&usage=overview,1_3_Dentsu Aegis Network’s Data Training – Data Foundation.jpg,2019Communication Design/Film & Animation-下载失败，错误代码：HTTPSConnectionPool(host='www.red-dot.org', port=443): Max retries exceeded with url: /index.php?f=65908&token=426484d233356d6a1d4b8044f4994e1d7f8c141a&eID=tx_solr_image&size=large&usage=overview (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003943320>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))！

数据补坑思路：

第一步：搜索到异常记录文件，获取到文件路径

第二步：打开文件，获取到相关数据信息

第三步：重新下载图片信息，补充图片数据

几个关键点：

1.搜索异常文件，我这里是 imgspider.txt

#搜索文件
def search(path,key):
    """
    文件目录里 搜索想要查找的文件 输出文件所在路径
    :param path: 想要搜索查询的目录
    :param key:  搜索的文件关键字
    :return: 返回目录
    """
    key_paths=[]
    #查看当前目录文件列表（包含文件夹）
    allfilelist = os.listdir(path)
    print(allfilelist)
    for filelist in allfilelist:
        if "." not in filelist:
            filespath=os.path.join(path, filelist)
            files= os.listdir(filespath)
            print(files)
            for file in files:
                if "." not in file:
                    filepath=os.path.join(filespath, file)
                    file = os.listdir(filepath)
                    for file_name in file:
                        if key in file_name:
                            key_path=os.path.join(filepath,file_name)
                            print(f'找到文件，路径为{key_path}')
                            key_paths.append(key_path)
        else:
            if key in filelist:
                key_path=os.path.join(path, filelist)
                print(f'找到文件，路径为{key_path}')
                key_paths.append(key_path)

    return key_paths

这里只写到二级目录，其实可以改成递归函数调用，结合gui界面制作简易文件搜索工具助手！

搜索文件效果：

2.图片数据的处理

字符串分割函数 split

需要提取到三个信息，也就是异常记录里的信息内容

1.img_url：图片下载地址

2.img_name：图片名称

3.path：图片存储路径

    for data in datas:
        img_data=data.split('-下载失败')[0]
        img_url=img_data.split(',')[0]
        img_name = img_data.split(',')[1]
        path = img_data.split(',')[2]
        print(img_name,img_url,path)

补坑效果：

附完整源码：

# -*- coding: utf-8 -*-
#python3.7
# 20200111 by 微信：huguo00289
import os,time,requests
from fake_useragent import UserAgent
from requests.adapters import HTTPAdapter #引入 HTTPAdapter 库

#构成协议头
def ua():
    ua=UserAgent()
    headers={"User-Agent":ua.random}
    return headers


#搜索文件
def search(path,key):
    """
    文件目录里 搜索想要查找的文件 输出文件所在路径
    :param path: 想要搜索查询的目录
    :param key:  搜索的文件关键字
    :return: 返回目录
    """
    key_paths=[]
    #查看当前目录文件列表（包含文件夹）
    allfilelist = os.listdir(path)
    print(allfilelist)
    for filelist in allfilelist:
        if "." not in filelist:
            filespath=os.path.join(path, filelist)
            files= os.listdir(filespath)
            print(files)
            for file in files:
                if "." not in file:
                    filepath=os.path.join(filespath, file)
                    file = os.listdir(filepath)
                    for file_name in file:
                        if key in file_name:
                            key_path=os.path.join(filepath,file_name)
                            print(f'找到文件，路径为{key_path}')
                            key_paths.append(key_path)
        else:
            if key in filelist:
                key_path=os.path.join(path, filelist)
                print(f'找到文件，路径为{key_path}')
                key_paths.append(key_path)

    return key_paths


#获取图片下载失败的文件记录路径
def get_pmimgspider():
    img_paths=[]
    key = "imgspider"
    categorys = [
        "Advertising", "Annual Reports", "Apps", "Brand Design & Identity", "Brands", "Corporate Design & Identity",
        "Fair Stands", "Film & Animation", "Illustrations", "Interface & User Experience Design",
        "Online", "Packaging Design", "Posters", "Publishing & Print Media", "Retail Design", "Sound Design",
        "Spatial Communication", "Typography", "Red Dot_Junior Award",
    ]
    for category in categorys:
        path = f'2019Communication Design/{category}'
        key_paths = search(path, key)
        img_paths.extend(key_paths)

    print(img_paths)
    return img_paths


#下载图片
def get_img(img_name,img_url,path):
    s = requests.session()
    s.mount('http://', HTTPAdapter(max_retries=3))
    s.mount('https://', HTTPAdapter(max_retries=3))
    try:
        print(f">>> 开始下载 {img_name}图片 ...")
        r=s.get(img_url,headers=ua(),timeout=15)
        with open(f'{path}/{img_name}','wb') as f:
            f.write(r.content)
        print(f">>>下载 {img_name}图片 成功！")
        time.sleep(2)
    except requests.exceptions.RequestException as e:
        print(f"{img_name}图片-{img_url}下载失败！")
        with open(f'{path}/imgspider.txt','a+') as f:
            f.write(f'{img_url},{img_name},{path}-下载失败，错误代码：{e}！n')

def main2():
    img_paths = get_pmimgspider()
    for img_path in img_paths:
        print(img_path)
        with open(img_path) as f:
            datas = f.readlines()
        print(datas)

    for data in datas:
        img_data=data.split('-下载失败')[0]
        img_url=img_data.split(',')[0]
        img_name = img_data.split(',')[1]
        path = img_data.split(',')[2]
        print(img_name,img_url,path)
        get_img(img_name, img_url, path)


if __name__=="__main__":
    main2()