scrapy學習之路2(圖片下載與下載的路徑獲取)

WelliJhon 發布于2019-07-30 15:21 / 812人閱讀

摘要：圖片下載和拿到下載后的路徑小封面圖的爬取，后面通過傳到中詳情頁的爬取詳情頁的完整地址下一頁的爬取與請求不明打開功能注意如要進一步定制功能補充新建

圖片下載和拿到下載后的路徑 1

items.py

import scrapy

class InfoItem(scrapy.Item):
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    small_image = scrapy.Field()
    small_image_path = scrapy.Field()
    big_image = scrapy.Field()
    big_image_path = scrapy.Field()
    code = scrapy.Field()
    date = scrapy.Field()
    lengths = scrapy.Field()
    author = scrapy.Field()
    cate = scrapy.Field()
    av_artor = scrapy.Field()

spider/jxxx.py

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
from scrapy.http import Request
from JaSpider.items import InfoItem
from JaSpider.utils.common import get_md5


class JxxxSpider(scrapy.Spider):
    name = "jxxx"
    allowed_domains = ["www.jxxx.com"]
    start_urls = ["http://www.jxxx.com/cn/vl_update.php"]

    def parse(self, response):
        for i in response.css(".video"):
            small_image = i.css("img::attr(src)").extract_first() # 小封面圖的爬取，后面通過meta傳到parse_info中
            link = i.css("a::attr(href)").extract_first() # 詳情頁的url爬取
            real_url = parse.urljoin(response.url, link) # 詳情頁的完整地址
            yield Request(url=real_url, meta={"small_image": small_image}, callback=self.parse_info)
        # 下一頁的爬取與請求    
        next_url = response.css(".page_selector .page.next::attr(href)").extract_first()
        perfect_next_url = parse.urljoin(response.url, next_url)
        if next_url:
            yield Request(url=perfect_next_url, callback=self.parse)

    def parse_info(self, response):
        small_image = "http:"+response.meta["small_image"]
        big_image = "http:"+response.xpath("http://div[@id="video_jacket"]/img/@src").extract_first()
        code = response.css("#video_id .text::text").extract_first()
        date = response.css("#video_date .text::text").extract_first()
        lengths = response.css("#video_length .text::text").extract_first()
        author = response.css("#video_director .director a::text").extract_first() if response.css("#video_director .director a::text").extract_first() else "不明"
        cate = ",".join([i.css("a::text").extract_first() for i in response.css("#video_genres .genre") if i.css("a::text").extract_first()])
        av_artor = ",".join([i.css("a::text").extract_first() for i in response.css(".star") if i.css("a::text").extract_first()])
        # print("http:"+small_image)
        info_item = InfoItem()
        info_item["url"] = response.url
        info_item["url_object_id"] = get_md5(response.url)
        info_item["small_image"] = small_image
        info_item["big_image"] = [big_image]
        info_item["code"] = code
        info_item["date"] = date
        info_item["lengths"] = lengths
        info_item["author"] = author
        info_item["cate"] = cate
        info_item["av_artor"] = av_artor
        yield info_item

打開pipeline功能 settings.py

注意!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!：
spider/jxxx.py

如要進一步定制功能
settings.py

pipeline.py

補充
新建utils/common.py

import hashlib


def get_md5(url):
    if isinstance(url, str):
        url = url.encode("utf-8")
    m = hashlib.md5()
    m.update(url)
    return m.hexdigest()


if __name__ == "__main__":
    a = get_md5("http://www.haddu.com")
    print(a)

云服務器 GPU云服務器服務器絕對路徑下載 ftp服務器下載路徑搭建求生之路2服務器深度學習下載

文章版權歸作者所有，未經允許請勿轉載,若此文章存在違規行為，您可以聯系管理員刪除。

轉載請注明本文地址：http://m.specialneedsforspecialkids.com/yun/41201.html

scrapy 學習之路上的那些坑

摘要：前言本文記錄自己在學習當中遇到的各種大小問題，持續更新。錯誤分析本身是一個網絡引擎框架，的運行依賴于。在打開新建的項目后，報錯顯示。錯誤分析的默認依賴項當中沒有，或者說默認查找的路徑中找不到。前言本文記錄自己在學習scrapy當中遇到的各種大小問題，持續更新。環境簡介：語言版本爬蟲框架 IDE 系統 python3.5 scrapy1.4.0 pycharm win1...

xiaodao 2019-07-30 15:12 評論0 收藏0
scrapy的學習之路1(簡單的例子)

摘要：的安裝環境是后面創建用來運行的名網站域名在創建可以通過此文件運行本文件名父文件名路徑和父文件名設置環境，必須以上運行可能在下會報錯準備工作完在下獲取列表頁每一個的把獲取到的交給 scrapy的安裝環境:python3.6 1 pip install -i https://pypi.douban.com/simple/ scrapy 2 scrapy startpr...

guqiu 2019-07-31 11:00 評論0 收藏0
20、 Python快速開發分布式搜索引擎Scrapy精講—編寫spiders爬蟲文件循環抓取內容

摘要：百度云搜索，搜各種資料搜網盤，搜各種資料編寫爬蟲文件循環抓取內容方法，將指定的地址添加到下載器下載頁面，兩個必須參數，參數頁面處理函數使用時需要方法，是庫下的方法，是自動拼接，如果第二個參數的地址是相對路徑會自動與第一個參數拼接導【百度云搜索，搜各種資料:http://bdy.lqkweb.com】【搜網盤，搜各種資料:http://www.swpan.cn】編寫spiders爬...

CntChen 2019-07-31 11:26 評論0 收藏0
windows下安裝python+scrapy

摘要：好啦一切準備工作就緒，現在開始安裝庫安裝成功后，安裝就簡單了，在命令提示符窗口直接輸入命令回車現在一切都搞定了，可以新建一個測試，敲一個基于框架的爬蟲程序咯。最近忽然有了想要學習python爬蟲的想法，但是首先需要安裝工具。python安裝倒是很輕松，只要傻瓜式一鍵安裝即可，但是在Windows下安裝scrapy倒不是件容易的事情。言歸正傳，說下我從昨天下午到今天上午安裝的步驟： 1...

dantezhao 2019-07-30 14:22 評論0 收藏0