摘要:任務(wù),它是對(duì)協(xié)程對(duì)象的進(jìn)一步封裝,包含了任務(wù)的各個(gè)狀態(tài)。代表將來(lái)執(zhí)行或還沒(méi)有執(zhí)行的任務(wù),實(shí)際上和沒(méi)有本質(zhì)區(qū)別。
- 單線程、多線程
- 線程池
- 單協(xié)程、多協(xié)程
- headers中Refere的作用
- 異步模塊aiohttp使用
高性能異步爬蟲(chóng) :在爬蟲(chóng)中使用異步實(shí)現(xiàn)高性能的數(shù)據(jù)爬取操作
傳統(tǒng)爬取數(shù)據(jù)的操作是順序操作,下面看一個(gè)實(shí)例
分析上述代碼可知 for循環(huán)中的get方法會(huì)阻塞程序,只有請(qǐng)求到的數(shù)據(jù)獲取后,才可以進(jìn)行下一條url中對(duì)應(yīng)的數(shù)據(jù)
上述可知,使用異步會(huì)提高爬蟲(chóng)程序的數(shù)據(jù)獲取效率
- 多線程,多進(jìn)程
好處:可以為相關(guān)阻塞的操作多帶帶開(kāi)啟線程或進(jìn)程,阻塞操作就可以異步進(jìn)行
弊端:無(wú)法無(wú)限制的開(kāi)啟多線程或者多進(jìn)程- 線程池、進(jìn)程池(適當(dāng)?shù)氖褂茫?br /> 好處:我們可以降低系統(tǒng)對(duì)進(jìn)程或者線程創(chuàng)建和銷(xiāo)毀的一個(gè)頻率,從而很好的降低系統(tǒng)的開(kāi)銷(xiāo)
弊端:池中線程或進(jìn)程的數(shù)量是有上限的。
import time# 單線程串行方式運(yùn)行def get_page(str): print("正在下載:",str) time.sleep(2) print("下載成功:",str)name_list = ["xiaozi","aa","bb","cc"]start_time = time.time()for i in range(len(name_list)): get_page(name_list[i])end_time = time.time()print("%d second" % (end_time - start_time))
import time# 導(dǎo)入線程池模塊對(duì)應(yīng)的類from multiprocessing.dummy import Pool# 使用線程池方式執(zhí)行# Pool一定是應(yīng)用在阻塞操作中的 get_pagestart_time = time.time()def get_page(str): print("正在下載:",str) time.sleep(2) print("下載成功:",str)name_list = ["xiaozi","aa","bb","cc"]# 實(shí)例化一個(gè)線程池對(duì)象pool = Pool(4) # 4個(gè)線程對(duì)象# 將列表中每一個(gè)列表元素傳遞給get_page進(jìn)行處理# pool.map的返回值就是方法get_page()的返回值pool.map(get_page,name_list)end_time = time.time()print(end_time-start_time)
本例爬取梨視頻網(wǎng)站:梨視頻官網(wǎng)
以上視頻是我們需要爬取的具體視頻,鍵盤(pán)上F12查看并且分析相關(guān)的detail_url和name,
到此你可能認(rèn)為使用xpath進(jìn)行地址獲取下載即可,但是這樣就錯(cuò)了,你使用etree進(jìn)行解析出來(lái)的是空值,因?yàn)檫@個(gè)頁(yè)面的有些數(shù)據(jù)就是動(dòng)態(tài)加載出來(lái)的,如下進(jìn)行驗(yàn)證:
到此為止,我們發(fā)現(xiàn)數(shù)據(jù)是動(dòng)態(tài)加載出來(lái)的,因此我們就需要找出真正要下載的地址鏈接在哪里?
在此感謝這兩位博主提供的思路:
- 報(bào)錯(cuò)404:
https://video.pearvideo.com/mp4/adshort/20210927/1632790695959
-15774345_adpkg-ad_hd.mp4- 正確的路徑:
https://video.pearvideo.com/mp4/adshort/20210927/cont-1742572
-15774345_adpkg-ad_hd.mp4
此外多次實(shí)驗(yàn)發(fā)現(xiàn)獲取報(bào)錯(cuò)404地址的網(wǎng)址有一部分?jǐn)?shù)字是隨機(jī)產(chǎn)生的,
- 寶藏老師|數(shù)學(xué)老師的浪漫:用函數(shù)講述自己的愛(ài)情故事:
https://www.pearvideo.com/videoStatus.jsp?contId=1742617
&mrd=0.6446715186101781- 小伙腫瘤醫(yī)院旁開(kāi)共享廚房,宣傳5元吃飽虧本運(yùn)營(yíng):
https://www.pearvideo.com/videoStatus.jsp?contId=1742572
&mrd=0.5091336020577668
通過(guò)以上兩個(gè)視頻的地址不難發(fā)現(xiàn)它們網(wǎng)址都有共同的特點(diǎn)包含contId,我們?cè)倩氐皆嫉囊曨l網(wǎng)頁(yè)數(shù)據(jù)查看:
后來(lái)由于個(gè)人水平問(wèn)題這個(gè)屬性值解析出錯(cuò),因此換了思路:
https://www.pearvideo.com/videoStatus.jsp?contId=1742572&mrd=0.8749545784196235https://www.pearvideo.com/videoStatus.jsp?contId=1742572&mrd=0.5510063831062151https://www.pearvideo.com/videoStatus.jsp?contId=1742572&mrd=0.5091336020577668
import requestsfrom lxml import etreeimport randomimport jsonfrom multiprocessing.dummy import Pool# 需求:安去梨視頻的視頻數(shù)據(jù)headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36"}# 原則:線程池處理的是阻塞且耗時(shí)的操作# 對(duì)下述url發(fā)起請(qǐng)求解析出視頻詳情頁(yè)的url和視頻的名稱url = "https://www.pearvideo.com/"page_text = requests.get(url=url, headers=headers).texttree = etree.HTML(page_text)# xpath返回的是列表 ["video_1742557"]all_addresses = tree.xpath("http://div[@id="vervideoTlist"]//a[@class="vervideo-lilink actplay"]/@href")# print(all_addresses)# ["video_1742557", "video_1742534", "video_1742545", "video_1733739", "video_1718659"]all_names = tree.xpath("http://div[@id="vervideoTlist"]//div[@class="vervideo-name"]/text()")# print(all_names)urls = []for i in range(len(all_addresses)): video_url = "https://www.pearvideo.com/" + all_addresses[i] mp4_name = all_names[i] + ".mp4" video_page_text = requests.get(url=video_url,headers=headers).text video_tree = etree.HTML(video_page_text) # https://www.pearvideo.com/videoStatus.jsp?contId=1742572&mrd=0.8749545784196235 contId = video_tree.xpath("http://div[@class="fav"]/@data-id")[0] # print(contId) mrd = random.random() # random.random() 0.5330239801324711 new_headers={ "Referer":video_url, "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36" } ajax_url = "https://www.pearvideo.com/videoStatus.jsp?contId="+str(contId)+"&mrd="+str(mrd) real_video_content = requests.get(url=ajax_url,headers=new_headers).text # 返回的是json字符串 # 獲取偽地址 false_video_url = eval(real_video_content)["videoInfo"]["videos"]["srcUrl"] # 獲取真地址 old=false_video_url.split("/")[-1].split("-")[0] new="cont-"+str(contId) true_video_url=false_video_url.replace(old,new) dic = { "name":mp4_name, "video_url":true_video_url } urls.append(dic)#使用線程池對(duì)數(shù)據(jù)視頻進(jìn)行請(qǐng)求def get_video_data(dic): print(dic["name"]+"開(kāi)始下載.....") data_url=dic["video_url"] data=requests.get(url=data_url,headers=headers).content with open(dic["name"],"wb") as f: f.write(data) print(dic["name"]+"下載成功")pool=Pool(4)pool.map(get_video_data,urls) pool.close()# 關(guān)閉pool,使其不在接受新的(主進(jìn)程)任務(wù)pool.join() # 主進(jìn)程阻塞后,讓子進(jìn)程繼續(xù)運(yùn)行完成,子進(jìn)程運(yùn)行完后,再把主進(jìn)程全部關(guān)掉# print(video_url,mp4_name):# https://www.pearvideo.com/video_1742572 小伙腫瘤醫(yī)院旁開(kāi)共享廚房,宣傳5元吃飽虧本運(yùn)營(yíng).mp4# https://www.pearvideo.com/video_1742617 寶藏老師|數(shù)學(xué)老師的浪漫:用函數(shù)講述自己的愛(ài)情故事.mp4# https://www.pearvideo.com/video_1742606 每一幀都如夢(mèng)境!航拍神農(nóng)架絕美秋季,云霧浩渺紅架絕美秋季,云霧浩渺紅葉爭(zhēng)艷.mp4# https://www.pearvideo.com/video_1740575 89歲浙大教師拾破爛9年, 金全部捐助貧困生.mp4 資金全部捐助貧困生.mp4 丘菜刀嗎?18道工序,手 # https://www.pearvideo.com/video_1727677 聽(tīng)說(shuō)過(guò)章丘鐵鍋,聽(tīng)說(shuō)過(guò)章丘菜刀嗎?18道工序,手工鍛打.mp4
返回隨機(jī)生成的一個(gè)實(shí)數(shù),它在
[0,1)
范圍內(nèi)
>>> import random>>> random.random()0.5330239801324711
Referer是HTTP請(qǐng)求Header的一部分,當(dāng)瀏覽器向Web服務(wù)器發(fā)送請(qǐng)求的時(shí)候,請(qǐng)求頭信息一般需要包含Referer。
該Referer會(huì)告訴服務(wù)器我是從哪個(gè)頁(yè)面鏈接過(guò)來(lái)的,服務(wù)器基此可以獲得一些信息用于處理。
Referer作用是什么?
- 防盜鏈
比如辦事通服務(wù)器只允許網(wǎng)站訪問(wèn)自己的靜態(tài)資源,那服務(wù)器每次都需要判斷Referer的值是否是>zwfw.yn.gov.cn,如果是就繼續(xù)訪問(wèn),不是就攔截。- 防止惡意請(qǐng)求
比如靜態(tài)請(qǐng)求是.html結(jié)尾的,動(dòng)態(tài)請(qǐng)求是.shtml,那么所有的*.shtml請(qǐng)求,必須 Referer為我自己的>網(wǎng)站才可以訪問(wèn),這就是Referer的作用。
參考鏈接
切片目的:獲取地址中需要替換的內(nèi)容
報(bào)錯(cuò)404:https://video.pearvideo.com/mp4/adshort/20210927/1632790695959-15774345_adpkg-ad_hd.mp4正確的路徑:https://video.pearvideo.com/mp4/adshort/20210927/cont-1742572-15774345_adpkg-ad_hd.mp4# 獲取真地址old=false_video_url.split("/")[-1].split("-")[0] new="cont-"+str(contId)true_video_url=false_video_url.replace(old,new)
>>> url = "https://video.pearvideo.com/mp4/adshort/20210927/1632790695959-15774345_adpkg-ad_hd.mp4">>> url.split("/")["https:", "", "video.pearvideo.com", "mp4", "adshort", "20210927", "1632790695959-15774345_adpkg-ad_hd.mp4"]>>> str = url.split("/")[-1]>>> str"1632790695959-15774345_adpkg-ad_hd.mp4">>> str.split("-")["1632790695959", "15774345_adpkg", "ad_hd.mp4"]>>> str.split("-")[0]"1632790695959">>>
print("-----eval-----")print(eval(real_video_content))print("-----json.loads()-----")print(json.loads(real_video_content))
-----eval-----{"resultCode": "1", "resultMsg": "success", "reqId": "c44f6eb1-971a-410a-9959-a92d2afeb05f", "systemTime": "1632799446578", "videoInfo": {"playSta": "1", "video_image": "https://image2.pearvideo.com/cont/20210927/cont-1742572-12624944.jpg", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210927/1632799446578-15774345_adpkg-ad_hd.mp4"}}}-----json.loads()-----{"resultCode": "1", "resultMsg": "success", "reqId": "c44f6eb1-971a-410a-9959-a92d2afeb05f", "systemTime": "1632799446578", "videoInfo": {"playSta": "1", "video_image": "https://image2.pearvideo.com/cont/20210927/cont-1742572-12624944.jpg", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210927/1632799446578-15774345_adpkg-ad_hd.mp4"}}}-----eval-----{"resultCode": "1", "resultMsg": "success", "reqId": "e488637c-0200-41f0-a259-9493913f1d11", "systemTime": "1632799446898", "videoInfo": {"playSta": "1", "video_image": "https://image1.pearvideo.com/cont/20210927/cont-1742617-12625072.jpg", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210927/1632799446898-15774709_adpkg-ad_hd.mp4"}}}-----json.loads()-----{"resultCode": "1", "resultMsg": "success", "reqId": "e488637c-0200-41f0-a259-9493913f1d11", "systemTime": "1632799446898", "videoInfo": {"playSta": "1", "video_image": "https://image1.pearvideo.com/cont/20210927/cont-1742617-12625072.jpg", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210927/1632799446898-15774709_adpkg-ad_hd.mp4"}}}-----eval-----{"resultCode": "1", "resultMsg": "success", "reqId": "b2b607e8-9685-4f3f-8c16-e10aedad97d0", "systemTime": "1632799447261", "videoInfo": {"playSta": "1", "video_image": "https://image2.pearvideo.com/cont/20210927/cont-1742606-12625013.png", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210927/1632799447261-15774653_adpkg-ad_hd.mp4"}}}-----json.loads()-----{"resultCode": "1", "resultMsg": "success", "reqId": "b2b607e8-9685-4f3f-8c16-e10aedad97d0", "systemTime": "1632799447261", "videoInfo": {"playSta": "1", "video_image": "https://image2.pearvideo.com/cont/20210927/cont-1742606-12625013.png", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210927/1632799447261-15774653_adpkg-ad_hd.mp4"}}}-----eval-----{"resultCode": "1", "resultMsg": "success", "reqId": "16929bb3-257e-4c01-94df-4343a42b99d9", "systemTime": "1632799447584", "videoInfo": {"playSta": "1", "video_image": "https://image1.pearvideo.com/cont/20210227/cont-1721628-12558949.png", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210227/1632799447584-15618162_adpkg-ad_hd.mp4"}}}-----json.loads()-----{"resultCode": "1", "resultMsg": "success", "reqId": "16929bb3-257e-4c01-94df-4343a42b99d9", "systemTime": "1632799447584", "videoInfo": {"playSta": "1", "video_image": "https://image1.pearvideo.com/cont/20210227/cont-1721628-12558949.png", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210227/1632799447584-15618162_adpkg-ad_hd.mp4"}}}-----eval-----{"resultCode": "1", "resultMsg": "success", "reqId": "34db5053-02f9-40c1-98c8-4d83a0e7d50f", "systemTime": "1632799447961", "videoInfo": {"playSta": "1", "video_image": "https://image.pearvideo.com/cont/20210521/cont-1729993-12587145.png", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210521/1632799447961-15678629_adpkg-ad_hd.mp4"}}}-----json.loads()-----{"resultCode": "1", "resultMsg": "success", "reqId": "34db5053-02f9-40c1-98c8-4d83a0e7d50f", "systemTime": "1632799447961", "videoInfo": {"playSta": "1", "video_image": "https://image.pearvideo.com/cont/20210521/cont-1729993-12587145.png", "videos": {"hdUrl": "", "hdflvUrl": "", "sdUrl": "", "sdflvUrl": "", "srcUrl": "https://video.pearvideo.com/mp4/adshort/20210521/1632799447961-15678629_adpkg-ad_hd.mp4"}}}
通過(guò)以上測(cè)試不難發(fā)現(xiàn)json.loads()方式和eval方式都可以將json處理為相同的結(jié)果,但是既然是不同的方法,終究會(huì)有區(qū)別
event_loop:事件循環(huán),相當(dāng)于一個(gè)無(wú)限循環(huán),我們可以把一些函數(shù)注冊(cè)到這個(gè)事件循環(huán)上,當(dāng)滿足某些條件的時(shí)候,函數(shù)就會(huì)被循環(huán)執(zhí)行
coroutine:協(xié)程對(duì)象,我們可以將協(xié)程對(duì)象注冊(cè)到事件循環(huán)中,它會(huì)被事件循環(huán)調(diào)用。我們可以使用async關(guān)鍵字來(lái)定義一個(gè)方法,這個(gè)方法在調(diào)用時(shí)不會(huì)立即被執(zhí)行,而是返回一個(gè)協(xié)程對(duì)象。
task:任務(wù),它是對(duì)協(xié)程對(duì)象的進(jìn)一步封裝,包含了任務(wù)的各個(gè)狀態(tài)。
future:代表將來(lái)執(zhí)行或還沒(méi)有執(zhí)行的任務(wù),實(shí)際上和task沒(méi)有本質(zhì)區(qū)別。
async:定義一個(gè)協(xié)程
await?
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://m.specialneedsforspecialkids.com/yun/121681.html
摘要:以下這些項(xiàng)目,你拿來(lái)學(xué)習(xí)學(xué)習(xí)練練手。當(dāng)你每個(gè)步驟都能做到很優(yōu)秀的時(shí)候,你應(yīng)該考慮如何組合這四個(gè)步驟,使你的爬蟲(chóng)達(dá)到效率最高,也就是所謂的爬蟲(chóng)策略問(wèn)題,爬蟲(chóng)策略學(xué)習(xí)不是一朝一夕的事情,建議多看看一些比較優(yōu)秀的爬蟲(chóng)的設(shè)計(jì)方案,比如說(shuō)。 (一)如何學(xué)習(xí)Python 學(xué)習(xí)Python大致可以分為以下幾個(gè)階段: 1.剛上手的時(shí)候肯定是先過(guò)一遍Python最基本的知識(shí),比如說(shuō):變量、數(shù)據(jù)結(jié)構(gòu)、語(yǔ)法...
摘要:楚江數(shù)據(jù)是專業(yè)的互聯(lián)網(wǎng)數(shù)據(jù)技術(shù)服務(wù),現(xiàn)整理出零基礎(chǔ)如何學(xué)爬蟲(chóng)技術(shù)以供學(xué)習(xí),。本文來(lái)源知乎作者路人甲鏈接楚江數(shù)據(jù)提供網(wǎng)站數(shù)據(jù)采集和爬蟲(chóng)軟件定制開(kāi)發(fā)服務(wù),服務(wù)范圍涵蓋社交網(wǎng)絡(luò)電子商務(wù)分類信息學(xué)術(shù)研究等。 楚江數(shù)據(jù)是專業(yè)的互聯(lián)網(wǎng)數(shù)據(jù)技術(shù)服務(wù),現(xiàn)整理出零基礎(chǔ)如何學(xué)爬蟲(chóng)技術(shù)以供學(xué)習(xí),http://www.chujiangdata.com。 第一:Python爬蟲(chóng)學(xué)習(xí)系列教程(來(lái)源于某博主:htt...
摘要:用將倒放這次讓我們一個(gè)用做一個(gè)小工具將動(dòng)態(tài)圖片倒序播放發(fā)現(xiàn)引力波的機(jī)構(gòu)使用的包美國(guó)科學(xué)家日宣布,他們?nèi)ツ暝率状翁綔y(cè)到引力波。宣布這一發(fā)現(xiàn)的,是激光干涉引力波天文臺(tái)的負(fù)責(zé)人。這個(gè)機(jī)構(gòu)誕生于上世紀(jì)年代,進(jìn)行引力波觀測(cè)已經(jīng)有近年。 那些年我們寫(xiě)過(guò)的爬蟲(chóng) 從寫(xiě) nodejs 的第一個(gè)爬蟲(chóng)開(kāi)始陸陸續(xù)續(xù)寫(xiě)了好幾個(gè)爬蟲(chóng),從爬拉勾網(wǎng)上的職位信息到爬豆瓣上的租房帖子,再到去爬知乎上的妹子照片什么的,爬蟲(chóng)...
摘要:學(xué)習(xí)筆記七數(shù)學(xué)形態(tài)學(xué)關(guān)注的是圖像中的形狀,它提供了一些方法用于檢測(cè)形狀和改變形狀。學(xué)習(xí)筆記十一尺度不變特征變換,簡(jiǎn)稱是圖像局部特征提取的現(xiàn)代方法基于區(qū)域圖像塊的分析。本文的目的是簡(jiǎn)明扼要地說(shuō)明的編碼機(jī)制,并給出一些建議。 showImg(https://segmentfault.com/img/bVRJbz?w=900&h=385); 前言 開(kāi)始之前,我們先來(lái)看這樣一個(gè)提問(wèn): pyth...
閱讀 802·2021-10-14 09:43
閱讀 2129·2021-09-30 09:48
閱讀 3451·2021-09-08 09:45
閱讀 1100·2021-09-02 15:41
閱讀 1893·2021-08-26 14:15
閱讀 779·2021-08-03 14:04
閱讀 2981·2019-08-30 15:56
閱讀 3076·2019-08-30 15:52