[Python]爬虫获取知乎某个问题下所有图片并去除水印
获取URL
进入某个知乎问题的主页下,按F12
打开开发者工具后查看network
面板。
network
面板可以查看页面向服务器请求的资源、资源的大小、加载资源花费的时间以及哪些资源加载失败等信息。还可以查看HTTP
的请求头,返回内容等。
以“你有哪些可爱的猫猫照片?”问题为例,我们可以看到network
面板如下:
按一下快捷键Ctrl + F在搜索面板中直接搜索对应的答案出现的文字,可以找到对应的目标url
及其response
:
安装对应的package
,其他包都比较简单,需要注意的是python
图像处理的包cv2
安装命令如下:
pip install opencv-python
URL分析
1. 参数分析
我们刚才获取的URL如下:
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop
其中包含的参数为:
- limit: 一页显示的答案条数
- offset:页面的偏移量
- sort_by:答案的排序方式,支持默认排序或者按时间排序
2.解析Response
尝试着发一个请求并截获http response
:
# python3
import requests
import json
if __name__ == '__main__':
target_url = "https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset=&limit=3&sort_by=default&platform=desktop"
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
}
response = requests.get(url = target_url, headers = headers)
html = response.text
print(html)
获取到的response
如下,我们需要做的是找到所有图片对应的链接,使用Json工具解析后可以从http返回值json中找到图片所在的位置,后续就是通过爬虫解析到下载地址即可:
Tips:值得注意的是网站的返回值样式经常变动,而且不同网站返回值的组织样式也不一样,所以不可盲目借鉴。
3.获取所有答案url
仍然使用在“开发者工具中”查找答案关键字的方法,我们可以拿到多个答案对应的url
,我们需要从这些url
中找到规律:
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=3&platform=desktop&sort_by=default
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default
尽管url
的格式不尽相同,但是我发现基本都遵循如下格式,只需要变更offset
参数即可
https://www.zhihu.com/api/v4/questions/356541789/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=3&offset=0&platform=desktop&sort_by=default
Code
1. 模拟请求
简单加上headers即可,知乎的校验没有其他网站来得严格,访问过于频繁时会限制访问一段时间,我这里简单使用随机请求头和代理IP来处理:
def get_http_content(number, offset):
"""读取知乎某问题下的答案url, 返回对应json
Args:
number: 知乎问题唯一标识
offset: 偏移量
"""
target_url = "https://www.zhihu.com/api/v4/questions/{number}/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2" \
"Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2" \
"Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2" \
"Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2" \
"Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5" \
"D.author.follower_count%2Cbadge%5B*%5D.topics&offset={offset}&limit={limit}&sort_by=default&platform=desktop".format(
number=number, offset=offset, limit=limit)
logger.info("target_url:{}", target_url)
headers = {
'User-Agent': fake_useragent.get_random_useragent(),
}
ip = IPPool().get_random_key()
proxies = {"http": "http://" + ip}
response = requests.get(target_url, headers=headers, proxies=proxies)
if (response is None) or (response.status_code != 200):
logger.warning("http response is None, number={}, offset={}, status_code={}".format(
number, offset, response.status_code))
return None
html = response.text
return json.loads(html)
2. 解析出图片地址
def start_crawl():
"""开始爬虫获取图片
"""
for i in range(0, max_pages):
offset = limit * i
logger.info("download pictures with offset {}". format(offset))
# 获取html
content_dict = get_http_content(number, offset)
if content_dict is None:
logger.error(
"get http resp fail, number={} offset={}", number, offset)
continue
# content_dict['data']存储了答案列表
if 'data' not in content_dict:
logger.error("parse data from http resp fail, dict={}", dict)
continue
for answer_text in content_dict['data']:
logger.info(
"get pictures from answer: https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
if 'content' not in answer_text:
logger.error(
"parse content from answer text fail, text={}", answer_text)
continue
answer_content = pq(answer_text['content'])
img_urls = answer_content.find('noscript').find('img')
# 此篇问答不包含图片时打印对应信息, 方便debug
if len(list(img_urls)) <= 0:
logger.warning(
"this answer has no pictures, url:https://www.zhihu.com/question/{}/answer/{}", number, answer_text['id'])
continue
for img_url in img_urls.items():
# src例子: https://pic2.zhimg.com/50/v2-c970108cd260ea095383627362c1d04f_720w.jpg?source=1940ef5c
src = img_url.attr("src")
# 解析出图片格式后缀: .jpeg 或者 .gif等
source_index = src.rfind('?source')
if source_index == -1:
logger.error("find source index fail, src:{} source_index{}",
src, source_index)
suffix = src[0:source_index]
suffix_index = src.rfind('.')
if source_index == -1:
logger.error("find suffix fail, src:{} suffix_index{}".format(
src, suffix_index))
suffix = suffix[suffix_index:]
logger.info("get picture url, src:{} suffix:{}", src, suffix)
store_picture(src, suffix)
time.sleep(1)
3. 将图片存储到本地
def store_picture(img_url, suffix):
"""将图片存储到文件夹中
Args:
img_url: 图片链接
suffix: 图片后缀, 比如'.jpg', '.gif'等
"""
headers = {
'User-Agent': fake_useragent.get_random_useragent(),
}
ip = IPPool().get_random_key()
proxies = {"http": "http://" + ip}
http_resp = requests.get(img_url, headers=headers, proxies=proxies)
if (http_resp is None) or (http_resp.status_code != 200):
logger.warning("get http resp fail, url={} http_resp={}",
img_url, http_resp)
return
content = http_resp.content
with open(f"{picture_path}/{uuid.uuid4()}{suffix}", 'wb') as f:
f.write(content)
4. 去除图片水印
本来打算使用图像识别进行抠图去除水印的(因为知乎的水印比较简单而且样式统一),无奈最近需要处理的事情比较多,因此就简单通过opencv
包进行裁剪:
def crop_watermark(ori_dir, adjusted_dir):
"""通过裁剪图片的方式来去除水印, 注意无法处理gif格式的图片
Args:
ori_dir: 图片所在的文件夹
adjusted_dir: 去除水印后存放的文件夹
"""
img_path_list = os.listdir(ori_dir) # 获取目录下的所有文件
total = len(img_path_list)
cnt = 1
for img_path in img_path_list:
logger.info(
"the overall process::{}/{}, now handle the picture:{}", cnt, total, img_path)
img_abs_path = ori_dir + '/' + img_path
img = cv2.imread(img_abs_path)
if img is None:
logger.error("cv2.imread fail, picture:{}", img_path)
continue
height, width = img.shape[0:2]
cropped = img[0:height-40, 0:width]
adjusted_img_abs_path = adjusted_dir + '/' + img_path
cv2.imwrite(adjusted_img_abs_path, cropped)
cnt += 1
写在最后
写这个程序主要还是为了学习html解析和锤炼一下python编程,虽然写完了之后回过头来看确实没啥值得称道的地方,就把代码放这里供大家一起参考了:
另外此程序的主要目的仅仅是将我搜集图片和剔除水印的过程自动化而已,还是再告诫大家一下不要因为爬虫给别人的服务器带来压力。
Reference
原文地址:https://www.cnblogs.com/TOMOCAT/p/15314131.html
- hyperledger v1.0.5 区块链运维入门(一)
- 分析无线遥控器信号并制作Hack硬件进行攻击
- 第二章:Shiro入门——深入浅出学Shiro细粒度权限开发框架
- 在Apache Spark上跑Logistic Regression算法
- 第四章:Shiro的身份认证(Authentication)——深入浅出学Shiro细粒度权限开发框架
- 第五章:Shiro的授权(Authorization)——深入浅出学Shiro细粒度权限开发框架
- 第六章:Shiro的Realms——深入浅出学Shiro细粒度权限开发框架
- 第八章:Shiro和Spring的集成——深入浅出学Shiro细粒度权限开发框架
- 第九章:Shiro的Web——深入浅出学Shiro细粒度权限开发框架
- 第十章:Shiro的Cache——深入浅出学Shiro细粒度权限开发框架
- Appboy基于MongoDB的数据密集型实践
- 微信企业号登录授权Java实现获取员工userid根据userid换openid
- 微信支付-微信红包Java版本
- Universe入门
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- 如何扩展单个Prometheus实现近万Kubernetes集群监控?
- 一文带你彻底厘清 Isito 中的证书工作机制
- 如何将第三方服务注册集成到 Istio ?
- 【Pod Terminating原因追踪系列】之 containerd 中被漏掉的 runc 错误信息
- 【Pod Terminating原因追踪系列之二】exec连接未关闭导致的事件阻塞
- CD+服务网格灰度发布实践,一文带你体验如何编排更灵活
- 花十分钟的时间武装你的代码库
- 对HTML-input的一些思考和理解
- 【投稿】刀哥:Rust学习笔记 1
- 【Rust日报】2020-08-13 关于群集(Bevy)引擎ECS框架中system的语法糖是怎么实现的
- 最新情报:所有的递归都可以改写成非递归?
- 算法篇:树之转换为二叉搜索树
- 算法篇:树之倒数k个节点
- 揭开链表的真面目
- Coder,我怀疑你并不会枚举