摘要:當前版本是一個服務器端的的。也可以說是無界面瀏覽器。安裝不是程序,去官網下載對應系統版本的安裝即可。方法會一直等到頁面被完全加載,然后才會繼續程序,但是對于是無可奈何的。安裝設置的查看所有可用的屬性。
selenium:https://github.com/SeleniumHQ...
當前版本3.0.1
A browser automation framework and ecosystem
phantomjs:http://phantomjs.org/
是一個服務器端的 JavaScript API 的 WebKit。也可以說是無界面瀏覽器。其支持各種Web標準: DOM 處理, CSS 選擇器, JSON, Canvas, 和 SVG.
大部分的網頁抓取用urllib都可以搞定,但是涉及到JavaScript及Ajax渲染的時候,urlopen就完全傻逼了,所以不得不用模擬瀏覽器,方法也有很多,此處采用的是selenium2+phantomjs
selenium2支持所有主流的瀏覽器和phantomjs這些無界面的瀏覽器。
安裝:
pip install selenium
phantomjs不是python程序,去官網下載對應系統版本的安裝即可。
from selenium import webdriver import time driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close() from selenium import webdriver driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe") driver.set_window_size(1120, 550) driver.get("http://duckduckgo.com/") driver.find_element_by_id("search_form_input_homepage").send_keys("Nirvana") driver.find_element_by_id("search_button_homepage").click() print(driver.current_url) driver.close()
get方法會一直等到頁面被完全加載,然后才會繼續程序,但是對于ajax是無可奈何的。
send_keys就是填充input表單
#等待頁面渲染完成 from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap) driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close()處理Javascript重定向
#處理Javascript重定向 from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") #拋出StaleElementReferenceException異常說明elem元素已經消失了,也就說明頁面已經跳轉了。 except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source)設置PHANTOMJS的USER-AGENT
有些網站的WebServer對User-Agent有限制,可能會拒絕不熟悉的User-Agent的訪問。
設置PhantomJS的user-agent,是要設置“phantomjs.page.settings.userAgent”這個desired_capability.
from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap) driver.get("http://dianping.com/") cap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities屬性。 for key in cap_dict: print "%s: %s" % (key, cap_dict[key]) print driver.current_url driver.quit()Demo
github
#pip install selenium #安裝phantomjs from selenium import webdriver import time from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe", desired_capabilities=dcap) driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") time.sleep(3) print(driver.find_element_by_id("content").text) driver.close() #設置PHANTOMJS的USER-AGENT from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" driver = webdriver.PhantomJS(executable_path="./phantomjs.exe", desired_capabilities=dcap) driver.get("http://dianping.com/") cap_dict = driver.desired_capabilities #查看所有可用的desired_capabilities屬性。 for key in cap_dict: print("%s: %s" % (key, cap_dict[key])) print(driver.current_url) driver.quit() #等待頁面渲染完成 from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html") try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close() #處理Javascript重定向 from selenium import webdriver import time from selenium.webdriver.remote.webelement import WebElement from selenium.common.exceptions import StaleElementReferenceException def waitForLoad(driver): elem = driver.find_element_by_tag_name("html") count = 0 while True: count += 1 if count > 20: print("Timing out after 10 seconds and returning") return time.sleep(.5) try: elem == driver.find_element_by_tag_name("html") except StaleElementReferenceException: return driver = webdriver.PhantomJS(executable_path=r"C:Users aojwDesktoppyworkphantomjs-2.1.1-windowsinphantomjs.exe") driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html") waitForLoad(driver) print(driver.page_source) ################################################################################## #模擬拖拽 from selenium import webdriver from selenium.webdriver.remote.webelement import WebElement from selenium.webdriver import ActionChains driver = webdriver.PhantomJS(executable_path="phantomjs/bin/phantomjs") driver.get("http://pythonscraping.com/pages/javascript/draggableDemo.html") print(driver.find_element_by_id("message").text) element = driver.find_element_by_id("draggable") target = driver.find_element_by_id("div2") actions = ActionChains(driver) actions.drag_and_drop(element, target).perform() print(driver.find_element_by_id("message").text) ################################################################################## #截屏 driver.get_screenshot_as_file("tmp/pythonscraping.png") #### ################################################################################## #登陸知乎,然后能自動點擊頁面下方的“更多”,以載入更多的內容 from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver import ActionChains import time import sys driver = webdriver.PhantomJS(executable_path="C:UsersGentlyguitarDesktopphantomjs-1.9.7-windowsphantomjs.exe") driver.get("http://www.zhihu.com/#signin") #driver.find_element_by_name("email").send_keys("your email") driver.find_element_by_xpath("http://input[@name="password"]").send_keys("your password") #driver.find_element_by_xpath("http://input[@name="password"]").send_keys(Keys.RETURN) time.sleep(2) driver.get_screenshot_as_file("show.png") #driver.find_element_by_xpath("http://button[@class="sign-button"]").click() driver.find_element_by_xpath("http://form[@class="zu-side-login-box"]").submit() try: #等待頁面加載完畢 dr=WebDriverWait(driver,5) dr.until(lambda the_driver:the_driver.find_element_by_xpath("http://a[@class="zu-top-nav-userinfo "]").is_displayed()) except: print("登錄失敗") sys.exit(0) driver.get_screenshot_as_file("show.png") #user=driver.find_element_by_class_name("zu-top-nav-userinfo ") #webdriver.ActionChains(driver).move_to_element(user).perform() #移動鼠標到我的用戶名 loadmore=driver.find_element_by_xpath("http://a[@id="zh-load-more"]") actions = ActionChains(driver) actions.move_to_element(loadmore) actions.click(loadmore) actions.perform() time.sleep(2) driver.get_screenshot_as_file("show.png") print(driver.current_url) print(driver.page_source) driver.quit() ##################################################################################
參考:
http://www.cnblogs.com/chenqi...
http://www.realpython.com/blo...
http://selenium-python.readth...
http://www.cnblogs.com/paisen...
http://smilejay.com/2013/12/s...
更多參考:
selenium webdriver的各種driver
文章版權歸作者所有,未經允許請勿轉載,若此文章存在違規行為,您可以聯系管理員刪除。
轉載請注明本文地址:http://m.specialneedsforspecialkids.com/yun/44221.html
摘要:包括爬蟲編寫爬蟲避禁動態網頁數據抓取部署分布式爬蟲系統監測共六個內容,結合實際定向抓取騰訊新聞數據,通過測試檢驗系統性能。 1 項目介紹 本項目的主要內容是分布式網絡新聞抓取系統設計與實現。主要有以下幾個部分來介紹: (1)深入分析網絡新聞爬蟲的特點,設計了分布式網絡新聞抓取系統爬取策略、抓取字段、動態網頁抓取方法、分布式結構、系統監測和數據存儲六個關鍵功能。 (2)結合程序代碼分解說...
摘要:,集搜客開源代碼下載源開源網絡爬蟲源,文檔修改歷史,增補文字說明,增加第五章源代碼下載源,并更換源的網址 showImg(https://segmentfault.com/img/bVvMn3); 1,引言 在Python網絡爬蟲內容提取器一文我們詳細講解了核心部件:可插拔的內容提取器類gsExtractor。本文記錄了確定gsExtractor的技術路線過程中所做的編程實驗。這是第二...
摘要:,源代碼爬取京東商品列表,以手機商品列表為例示例網址版本京東手機列表源代碼下載位置請看文章末尾的源。,抓取結果運行上面的代碼,就會爬取京東手機品類頁面的所有手機型號價格等信息,并保存到本地文件京東手機列表中。 showImg(https://segmentfault.com/img/bVxXHW); 1,引言 在上一篇《python爬蟲實戰:爬取Drupal論壇帖子列表》,爬取了一個用...
摘要:,用庫實現網頁內容提取是的一個庫,可以迅速靈活地處理。,集搜客開源代碼下載源開源網絡爬蟲源,文檔修改歷史,增補文字說明把跟帖的代碼補充了進來,增加最后一章源代碼下載源 showImg(https://segmentfault.com/img/bVvBTt); 1,引言 在Python網絡爬蟲內容提取器一文我們詳細講解了核心部件:可插拔的內容提取器類gsExtractor。本文記錄了確定...
摘要:遇到的問題近來在寫個人博客的時候遇到了大家可能都會遇到的問題單頁面在時顯得很無力,尤其是百度不會抓取動態腳本配合前后端分離無法讓標簽在蜘蛛抓取時動態填充單頁面又是大勢所趨,寫起來也不止是一個爽,當然也可以選擇多頁面但即使是多頁面在面對文章 遇到的問題: 近來在寫個人博客的時候遇到了大家可能都會遇到的問題 Vue單頁面在SEO時顯得很無力,尤其是百度不會抓取動態腳本 Vue-Router...
閱讀 2696·2021-10-22 09:55
閱讀 2027·2021-09-27 13:35
閱讀 1281·2021-08-24 10:02
閱讀 1515·2019-08-30 15:55
閱讀 1211·2019-08-30 14:13
閱讀 3484·2019-08-30 13:57
閱讀 1986·2019-08-30 11:07
閱讀 2463·2019-08-29 17:12