selenium爬虫后上传数据库。

一、准备工作

1.1安装软件

安装python、安装谷歌浏览器、将chromedriver.exe放到指定位置。
放到Scripts文件夹中。我这边的路径为：C:\Users\1\AppData\Local\Programs\Python\Python37\Scripts

1.2用到的python库。

用到的python的库有：time,datetiem,os,selenium,pandas,pymysql,logging,twisted

将pymysql进行处理。形成一个自己的包。

# encoding:utf-8
import pymysql.cursors




class MysqlOperation(object):
    def __init__(self, config):
        self.connection = pymysql.connect(host=config['mysql_host'],
                                          port=config['mysql_port'],
                                          user=config['mysql_user'],
                                          # pymysql直接连接是passwd,用连接池连接是password
                                          passwd=config['mysql_passwd'],
                                          db=config['mysql_db'],
                                          charset='utf8',
                                          cursorclass=pymysql.cursors.DictCursor
                                          )

    def read_sql(self, sql):
        with self.connection.cursor() as cursor:
            try:
                cursor.execute(sql)
                result = cursor.fetchall()
                return result
            except Exception as e:
                self.connection.rollback()  # 回滚
                print('事务失败', e)

    def insert_sql(self, sql):
        with self.connection.cursor() as cursor:
            try:
                cursor.execute(sql)
                self.connection.commit()
            except Exception as e:
                self.connection.rollback()
                print('事务失败', e)

    def update_sql(self, sql):
        # sql_update ="update user set username = '%s' where id = %d"

        with self.connection.cursor() as cursor:
            try:
                cursor.execute(sql)  # 像sql语句传递参数
                # 提交
                self.connection.commit()
            except Exception as e:
                # 错误回滚
                self.connection.rollback()

    def delect_sql(self, sql_delete):
        with self.connection.cursor() as cursor:
            try:
                cursor.execute(sql_delete)  # 像sql语句传递参数
                # 提交
                self.connection.commit()
            except Exception as e:
                # 错误回滚
                self.connection.rollback()

    def read_one(self, sql):
        with self.connection.cursor() as cursor:
            try:
                cursor.execute(sql)
                result = cursor.fetchone()
                return result
            except Exception as e:
                self.connection.rollback()  # 回滚
                print('事务失败', e)

    def reConnect(self):
        try:
            self.connection.ping()
        except:
            self.connection()

二、书写代码

2.1通用代码

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
import time
import datetime
import os
import pandas as pd
from sqlConnect import MysqlOperation
import math


# 配置浏览器
options = Options()
download_path = r"E:\splider\pk"
options.add_experimental_option("prefs", {
    "download.default_directory": download_path,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "safebrowsing.enabled": True
})

# 配置数据库
config = {'mysql_host': '',
          'mysql_port': ,
          'mysql_user': '',
          'mysql_passwd': '',
          'mysql_db': ''
          }

mysql=MysqlOperation(config=config)



# 防错清空
file_list = os.listdir(download_path)
for file in file_list:
    file_path = download_path + '\\' + file
    if os.path.exists(file_path):
        os.remove(file_path)
print('#Cleared')

# 将百分数转化为小数
def judge_percent(x):
    if isinstance(x, str):
        if '%' in x:
            return round(float(x[:-1]) / 100, 2)
        else:
            return x
    else:
        return x

def rename(download_path, start_time, end_time, province='all_country'):
    # 下载文件重命名
    number = 0
    original_file_path = 'old_name'
    rename_file_path = 'new_name'
    while not os.path.exists(original_file_path) and number < 5:
        time.sleep(10)
        number = number + 1
    os.rename(original_file_path, rename_file_path)
    print("完成重命名")

说明：

配置浏览器的目的是更改下载路径：download_path便是自定义的下载路径。

配置数据库的是为了连接数据库。

放错清空的目的是担心文件夹里面的数据对下载的数据形成干扰。我这是全部清空，也可以指定文件清空。

下载之后的文件进行重命名，是为了将汉字转化为英文。

2.2 selenium模拟下载。

我用的的定位是XPATH。

bro = webdriver.Chrome(options=options)
bro.implicitly_wait(10)   # 隐式等待10s
time.sleep(10)            # 等待10s

有时因为网速等原因，XPATH加载较慢。所以这时候就需要等待。

a = bro.find_element_by_xpath('')
a.click()
a.clear()
a.send_keys('输入文字')
a.send_keys(Keys.ENTER)  # 模拟按下enter键

爬虫的时候有时需要填入文字。但是填入文字之后又需要点击或者按enter。

这是需要用selenium模拟enter操作。

2.2重命名

下载之后用进行重名。

2.3操作数据库。

下载数据之后。需要将数据上传到数据库。

上传之前，记得删除。比如，下载的数据是1-5号的数据。需要先把数据库中1-5号的数据删除，在上传。

上传的时候如果数据量少，可以一次上传。

如果数据量比较大，可以分批次上传。

原文地址：https://www.cnblogs.com/qianslup/p/11805514.html