摘要:準備工作運行本地數據庫服務器安裝建表連接數據庫用操作還是比較簡單的,如果有一點數據庫基礎的話,可以直接上手,最后一定不要忘了寫提交,不然數據只是緩存,存不到數據庫里完整示例爬取百度上最熱的幾個新聞標題,并存儲到數據庫,太懶了沒寫注釋
準備工作
運行本地數據庫服務器
mysql -u root -p
安裝pymysql
pip install pymysql建表
CREATE DATABASE crawls; // show databases; use db; CREATE TABLE IF NOT EXISTS baiduNews(" "id INT PRIMARY KEY NOT NULL AUTO_INCREMENT," "ranking VARCHAR(30)," "title VARCHAR(60)," "datetime TIMESTAMP," "hot VARCHAR(30)); // show tables;pymysql連接數據庫
db = pymysql.connect(host="localhost", port=3306, user="root", passwd="123456", db="crawls", charset="utf8") cursor = db.cursor() cursor.execute(sql_query) db.commit()
用python操作mysql還是比較簡單的,如果有一點數據庫基礎的話,可以直接上手,最后一定不要忘了寫commit提交,不然數據只是緩存,存不到數據庫里
完整示例爬取百度上最熱的幾個新聞標題,并存儲到數據庫,太懶了沒寫注釋-_- (確保本地mysql服務器已經打開)
""" Get the hottest news title on baidu page, then save these data into mysql """ import datetime import pymysql from pyquery import PyQuery as pq import requests from requests.exceptions import ConnectionError URL = "https://www.baidu.com/s?wd=%E7%83%AD%E7%82%B9" headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36", "Upgrade-Insecure-Requests": "1" } def get_html(url): try: response = requests.get(url, headers=headers) if response.status_code == 200: return response.text return None except ConnectionError as e: print(e.args) return None def parse_html(html): doc = pq(html) trs = doc(".FYB_RD table.c-table tr").items() for tr in trs: index = tr("td:nth-child(1) span.c-index").text() title = tr("td:nth-child(1) span a").text() hot = tr("td:nth-child(2)").text().strip(""") yield { "index":index, "title":title, "hot":hot } def save_to_mysql(items): try: db = pymysql.connect(host="localhost", port=3306, user="root", passwd="123456", db="crawls", charset="utf8") cursor = db.cursor() cursor.execute("use crawls;") cursor.execute("CREATE TABLE IF NOT EXISTS baiduNews(" "id INT PRIMARY KEY NOT NULL AUTO_INCREMENT," "ranking VARCHAR(30)," "title VARCHAR(60)," "datetime TIMESTAMP," "hot VARCHAR(30));") try: for item in items: print(item) now = datetime.datetime.now() now = now.strftime("%Y-%m-%d %H:%M:%S") sql_query = "INSERT INTO baiduNews(ranking, title, datetime, hot) VALUES ("%s", "%s", "%s", "%s")" % ( item["index"], item["title"], now, item["hot"]) cursor.execute(sql_query) print("Save into mysql") db.commit() except pymysql.MySQLError as e: db.rollback() print(e.args) return except pymysql.MySQLError as e: print(e.args) return def check_mysql(): try: db = pymysql.connect(host="localhost", port=3306, user="root", passwd="123456", db="crawls", charset="utf8") cursor = db.cursor() cursor.execute("use crawls;") sql_query = "SELECT * FROM baiduNews" results = cursor.execute(sql_query) print(results) except pymysql.MySQLError as e: print(e.args) def main(): html = get_html(URL) items = parse_html(html) save_to_mysql(items) #check_mysql() if __name__ == "__main__": main()
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://m.specialneedsforspecialkids.com/yun/43127.html
摘要:學習網絡爬蟲主要分個大的版塊抓取,分析,存儲另外,比較常用的爬蟲框架,這里最后也詳細介紹一下。網絡爬蟲要做的,簡單來說,就是實現瀏覽器的功能。 Python學習網絡爬蟲主要分3個大的版塊:抓取,分析,存儲 另外,比較常用的爬蟲框架Scrapy,這里最后也詳細介紹一下。 首先列舉一下本人總結的相關文章,這些覆蓋了入門網絡爬蟲需要的基本概念和技巧:寧哥的小站-網絡爬蟲,當我們在瀏覽器中輸入...
閱讀 1970·2021-09-04 16:45
閱讀 758·2019-08-30 15:44
閱讀 902·2019-08-30 13:07
閱讀 463·2019-08-29 16:06
閱讀 1387·2019-08-29 13:43
閱讀 1279·2019-08-26 17:00
閱讀 1532·2019-08-26 13:51
閱讀 2301·2019-08-26 11:48