selenium二或者三层连接爬取
时间:2019-01-11
本文章向大家介绍selenium二或者三层连接爬取,主要包括selenium二或者三层连接爬取使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。
今天的案例以猫眼影院为例:
爬取里面各个地区,各地的电影院的所有信息
url:https://maoyan.com/cinemas
import requests
from lxml import etree
from selenium import webdriver
from urllib import request,parse
import time
dirver=webdriver.PhantomJS(executable_path=r'D:\ysc桌面\Desktop\phantomjs-2.1.1-windows\bin\phantomjs.exe')
#dirver=webdriver.Chrome()
#代理ip
proxy = {
"HTTP": "113.3.152.88:8118",
"HTTPS": "219.234.5.128:3128",
}
#伪装头
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/70.0.3538.110 Safari/537.36",
}
#地址
base_url="https://maoyan.com/cinemas"
#打开网页获取信息
response= requests.get(url=base_url,headers=headers,proxies=proxy)
html=response.content.decode("utf-8")
with open("maoyan.html","w",encoding="utf-8")as fb:
fb.write(html)
#调用etree.HTML进行树状转换
html_tree = etree.HTML(html)
#获取品牌id 行政区id 特殊厅id
li_tree=html_tree.xpath('//ul[@class="tags-lines"]/li')
#获取品牌id
brandId_dict={}
for i in li_tree[0].xpath('./ul/li')[1:]:
brand=i.xpath('./a/text()|./a/@data-id')
brandId_dict[brand[-1]]=brand[0]
#特殊厅id
hallType_dict={}
for k in li_tree[-1].xpath('./ul/li')[1:]:
hallType=k.xpath('./a/text()|./a/@data-id')
hallType_dict[hallType[-1]] = hallType[0]
#行政区id
districtId_dict={}
for j in li_tree[1].xpath('./ul/li')[1:]:
district=j.xpath('./a/text()|./a/@data-id')
districtId_dict[district[-1]] = district[0]
# print(brandId_dict)
# print(hallType_dict)
# print(districtId_dict)
#选中影院 厅 最后才选行政区
for brandId in brandId_dict.values():
for hallType in hallType_dict.values():
data={
'brandId':brandId,
'hallType':hallType,
}
#response=requests.get(url=base_url,params=data)
data_str=parse.urlencode(data)
new_url=base_url+"?"+data_str
dirver.get(new_url)
n=1
for districtId in districtId_dict.keys():
#找到行政区点击
print(districtId)
dirver.find_element_by_link_text(districtId).click()
time.sleep(1)
#查找行踪区第一层地区
if dirver.page_source.find("float-filter") == -1:
continue
filter_tree=etree.HTML(dirver.page_source)
oneplace=filter_tree.xpath('//div[@class="float-filter"]/ul[@class="tags"]/li/a/text()')[1:]
one_url = filter_tree.xpath('//div[@class="float-filter"]/ul[@class="tags"]/li/a/@href')[1:]
if n!=1:
for oneurl in one_url:
new_one=base_url+oneurl
print(new_one)
#找到第二层地址进行点击
two_res=requests.get(url=new_one,headers=headers,proxies=proxy)
info_tree = etree.HTML(two_res.content.decode("utf-8"))
info_total=info_tree.xpath('//div[@class="cinema-info"]/a/text()|//div[@class="cinema-info"]/p/text()')
if info_total:
print("影院名称: "+info_total[0]+" "+info_total[-1])
with open("maoyanyingyuan.txt","a",encoding="utf-8") as fb:
fb.write("影院名称: "+info_total[0]+" "+info_total[-1]+"\n")
else:
print("暂时没有该地区的影院信息")
continue
n+=1
oneplace_dict={}
for one in oneplace:
#找到第一层地址点击
print(one)
dirver.find_element_by_link_text(one).click()
if dirver.page_source.find("station-tags") == -1:
oneplace_dict[one]=""
continue
time.sleep(1)
two_tree = etree.HTML(dirver.page_source)
# twoplace=two_tree.xpath('//div[@class="float-filter"]/ul[@class="tags station-tags"]/li/a/text()')[1:]
href_url=two_tree.xpath('//div[@class="float-filter"]/ul[@class="tags station-tags"]/li/a/@href')[1:]
#print(href_url)
for two_url in href_url:
two_url=base_url+two_url
print(two_url)
#找到第二层地址进行点击
two_res=requests.get(url=two_url,headers=headers,proxies=proxy)
info_tree = etree.HTML(two_res.content.decode("utf-8"))
info_total=info_tree.xpath('//div[@class="cinema-info"]/a/text()|//div[@class="cinema-info"]/p/text()')
if info_total:
print("影院名称: "+info_total[0]+" "+info_total[-1])
with open("maoyanyingyuan.txt","a",encoding="utf-8") as fb:
fb.write("影院名称: "+info_total[0]+" "+info_total[-1]+"\n")
else:
print("暂时没有该地区的影院信息")
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- Spring Boot 日志配置
- R语言进阶之图形参数
- 第04期:Prometheus 数据采集(三)
- 技术分享 | Online DDL 工具 gh-ost
- Spring Boot 整合Mybatis
- R语言进阶之时间序列分析
- Spring Boot 实现员工信息管理demo
- 如何把 Flutter 云端一体化做到极致?
- 微服务[学成在线] day16:基于Spring Security Oauth2开发认证服务
- 新的跨域策略:使用COOP、COEP为浏览器创建更安全的环境
- CVE-2020-1948:Dubbo Provider默认反序列化复现
- R语言进阶之因子分析
- Spring Boot 整合Thymeleaf
- Spring Boot 整合Shiro
- 微服务[学成在线] day01:CMS接口开发