Web開發(fā) - 網(wǎng)絡(luò)爬蟲

Object 發(fā)布于2019-08-14 17:00 / 1114人閱讀

摘要：網(wǎng)絡(luò)爬蟲是的爬蟲框架，比起直接采用爬取有強(qiáng)大的好處，框架中集成了斷點(diǎn)續(xù)爬去重自定義請求等。例如，底層實(shí)現(xiàn)都類似。先這樣吧，不太會(huì)寫文章，希望大家海涵。

網(wǎng)絡(luò)爬蟲

WebCollector是Java的爬蟲框架，比起直接采用HttpClient、JSoup爬取有強(qiáng)大的好處，框架中集成了斷點(diǎn)續(xù)爬、Url去重、自定義Http請求等。例如Nutch、Heritrix，底層實(shí)現(xiàn)都類似。

下面是倆種爬蟲的實(shí)現(xiàn)：

1、Node爬蟲

npm下載模塊

var eventproxy = require("./lib/eventproxy");
var ep = new eventproxy();
var superagent = require("superagent");
var cheerio = require("cheerio");
var url = require("url");

var cnodeUrl = "https://cnodejs.org/";

superagent.get(cnodeUrl).end(function(err,res){
    if(err)
        return console.error(err);
    var topicUrls = [];
    var $ = cheerio.load(res.text);
    //獲取首頁所有鏈接
    $("#topic_list .topic_title").each(function(idx,element){
        var $element = $(element);
        var href = url.resolve(cnodeUrl,$element.attr("href"));
        topicUrls.push(href);
    });
    console.log(topicUrls);

    // 命令 ep 重復(fù)監(jiān)聽 topicUrls.length 次（在這里也就是 40 次） `topic_html` 事件再行動(dòng)
    ep.after("topic_html", topicUrls.length, function (topics) {
      // topics 是個(gè)數(shù)組，包含了 40 次 ep.emit("topic_html", pair) 中的那 40 個(gè) pair

      // 開始行動(dòng)
      topics = topics.map(function (topicPair) {
        // 接下來都是 jquery 的用法了
        var topicUrl = topicPair[0];
        var topicHtml = topicPair[1];
        var $ = cheerio.load(topicHtml);
        return ({
          title: $(".topic_full_title").text().trim(),
          href: topicUrl,
          comment1: $(".reply_content").eq(0).text().trim(),
        });
      });

      console.log("final:");
      console.log(topics);
    });

    topicUrls.forEach(function (topicUrl) {
      superagent.get(topicUrl)
        .end(function (err, res) {
          console.log("fetch " + topicUrl + " successful");
          ep.emit("topic_html", [topicUrl, res.text]);
        });
    });
});
//異步并發(fā)
ep.all("data1","data2",function(data1,data2){
    console.log(data1+","+data2);
});
superagent.get(cnodeUrl).end(function(err,res){
    ep.emit("data1",res.test);
});
superagent.get(cnodeUrl).end(function(err,res){
    ep.emit("data2",res.test);
});

2、WebCollector

需要下載的Jar：

WebCollector，解壓后將webcollector-2.32-bin中的jar放入項(xiàng)目中。

selenium（用于解析Html）。

下面是爬取新浪微博的代碼：

import cn.edu.hfut.dmic.webcollector.model.CrawlDatum;
import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.net.HttpRequest;
import cn.edu.hfut.dmic.webcollector.net.HttpResponse;
import cn.edu.hfut.dmic.webcollector.plugin.berkeley.BreadthCrawler;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * 
 * 爬取微博
 * @author Alex
 *
 */
public class WeiboCrawler extends BreadthCrawler {

    private String cookie;

    public WeiboCrawler(String crawlPath, boolean autoParse) throws Exception {
        super(crawlPath, autoParse);
        cookie = WeiboCN.getSinaCookie("XXXXXXXXXXX", "XXXXXXXXXX");//賬號(hào)、密碼
    }

    @Override
    public HttpResponse getResponse(CrawlDatum crawlDatum) throws Exception {
        HttpRequest request = new HttpRequest(crawlDatum);
        request.setCookie(cookie);
        return request.getResponse();
    }

    public void visit(Page page, CrawlDatums next) {
        int pageNum = Integer.valueOf(page.getMetaData("pageNum"));
        Elements weibos = page.select("div.c");//或者Document doc = page.doc();
        for (Element weibo : weibos) {
            System.out.println("第" + pageNum + "頁	" + weibo.text());
        }
    }

    public static void main(String[] args) throws Exception {
        WeiboCrawler crawler = new WeiboCrawler("WeiboCrawler", false);
        crawler.setThreads(3);//線程數(shù)
        for (int i = 1; i <= 5; i++) {//爬取XXX前5頁
            crawler.addSeed(new CrawlDatum("http://weibo.cn/zhouhongyi?vt=4&page=" + i).putMetaData("pageNum", i + ""));
        }
        //crawlerNews.setResumable(true);//斷點(diǎn)續(xù)爬
        crawler.start(1);
    }

}

import cn.edu.hfut.dmic.webcollector.net.HttpRequest;
import cn.edu.hfut.dmic.webcollector.net.HttpResponse;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.util.Set;
import javax.imageio.ImageIO;

import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

import java.awt.BorderLayout;
import java.awt.Container;
import java.awt.Dimension;
import java.awt.Graphics;
import java.awt.Toolkit;
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.awt.image.BufferedImage;
import javax.swing.JButton;
import javax.swing.JFrame;
import javax.swing.JPanel;
import javax.swing.JTextField;

/**
 * 
 * @author Alex
 *
 */
public class WeiboCN {

    public static String getSinaCookie(String username, String password) throws Exception {
        HtmlUnitDriver driver = new HtmlUnitDriver();//加載Html解析驅(qū)動(dòng)
        driver.setJavascriptEnabled(true);
        driver.get("http://login.weibo.cn/login/");
        
        WebElement ele = driver.findElementByCssSelector("img");//selenium選擇器
        String src = ele.getAttribute("src");
        String cookie = concatCookie(driver);
        
        HttpRequest request = new HttpRequest(src);//請求驗(yàn)證碼
        request.setCookie(cookie);
        
        HttpResponse response = request.getResponse();
        ByteArrayInputStream is = new ByteArrayInputStream(response.getContent());
        BufferedImage img = ImageIO.read(is);
        is.close();
        ImageIO.write(img, "png", new File("result.png"));
        String userInput = new CaptchaFrame(img).getUserInput();
        
        //模擬表單登錄
        WebElement mobile = driver.findElementByCssSelector("input[name=mobile]");
        mobile.sendKeys(username);
        WebElement pass = driver.findElementByCssSelector("input[type=password]");
        pass.sendKeys(password);
        WebElement code = driver.findElementByCssSelector("input[name=code]");
        code.sendKeys(userInput);
        WebElement rem = driver.findElementByCssSelector("input[name=remember]");
        rem.click();
        WebElement submit = driver.findElementByCssSelector("input[name=submit]");
        submit.click();
        String result = concatCookie(driver);
        driver.close();
        if (result.contains("gsid_CTandWM")) {
            return result;
        } else {
            throw new Exception("weibo login failed");
        }
    }

    public static String concatCookie(HtmlUnitDriver driver) {
        Set cookieSet = driver.manage().getCookies();
        StringBuilder sb = new StringBuilder();
        for (Cookie cookie : cookieSet) {
            sb.append(cookie.getName() + "=" + cookie.getValue() + ";");
        }
        String result = sb.toString();
        return result;
    }

    //根據(jù)圖片生成窗體驗(yàn)證碼
    public static class CaptchaFrame {
        JFrame frame;//窗口
        JPanel panel;//面板
        JTextField input;//輸入框
        int inputWidth = 100;
        BufferedImage img;
        String userInput = null;

        public CaptchaFrame(BufferedImage img) {
            this.img = img;
        }

        public String getUserInput() {
            frame = new JFrame("輸入驗(yàn)證碼");
            final int imgWidth = img.getWidth();
            final int imgHeight = img.getHeight();
            int width = imgWidth * 2 + inputWidth * 2;
            int height = imgHeight * 2+50;
            Dimension dim = Toolkit.getDefaultToolkit().getScreenSize();
            int startx = (dim.width - width) / 2;
            int starty = (dim.height - height) / 2;
            frame.setBounds(startx, starty, width, height);
            Container container = frame.getContentPane();
            container.setLayout(new BorderLayout());
            panel = new JPanel() {
                @Override
                public void paintComponent(Graphics g) {//將圖片畫在面板上
                    super.paintComponent(g);
                    g.drawImage(img, 0, 0, imgWidth * 2, imgHeight * 2, null);
                }
            };
            panel.setLayout(null);
            container.add(panel);
            input = new JTextField(6);
            input.setBounds(imgWidth * 2, 0, inputWidth, imgHeight * 2);
            panel.add(input);
            JButton btn = new JButton("登錄");
            btn.addActionListener(new ActionListener() {//注冊監(jiān)聽
                public void actionPerformed(ActionEvent e) {
                    userInput = input.getText().trim();
                    synchronized (CaptchaFrame.this) {//同步窗口釋放
                        CaptchaFrame.this.notify();
                    }
                }
            });
            btn.setBounds(imgWidth * 2 + inputWidth, 0, inputWidth, imgHeight * 2);
            panel.add(btn);
            frame.setVisible(true);
            synchronized (this) {
                try {
                    this.wait();
                } catch (InterruptedException ex) {
                    ex.printStackTrace();
                }
            }
            frame.dispose();
            return userInput;
        }
    }

}

大家注意password！這個(gè)name="password_9384"其中的數(shù)字是動(dòng)態(tài)生成的，每次請求都會(huì)變，所以上面代碼中的selenium選擇器要用input[type=password]。

先這樣吧，不太會(huì)寫文章，希望大家海涵。

云服務(wù)器 GPU云服務(wù)器 web爬蟲 web爬蟲工具 web爬蟲技術(shù) 爬蟲開發(fā)

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://m.specialneedsforspecialkids.com/yun/66365.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

Object

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

[C/C++ C++11]正則表達(dá)式

閱讀 1723·2021-11-25 09:43
window.open 打開新窗口被攔截的解決方案

閱讀 2683·2019-08-30 15:53
canvas下的全屏問題

閱讀 1833·2019-08-30 15:52
div寬度和高度固定，讓圖片鋪滿整個(gè)div而且不變形

閱讀 2911·2019-08-29 13:56
JavaScript面向?qū)ο缶幊獭狥unction類型

閱讀 3335·2019-08-26 12:12
Node.js 全局對象

閱讀 579·2019-08-23 17:58
vue-element-admin簡化版

閱讀 2159·2019-08-23 16:59
每日 30 秒 ? 數(shù)組也會(huì)禿頂

閱讀 946·2019-08-23 16:21

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

Web開發(fā) - 網(wǎng)絡(luò)爬蟲

相關(guān)文章

爬蟲入門

爬蟲入門

**Python3網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)---6、Web庫的安裝：Flask、Tornado**

Python爬蟲學(xué)習(xí)路線

**爬蟲學(xué)習(xí)之一個(gè)簡單的網(wǎng)絡(luò)爬蟲**

發(fā)表評(píng)論

0條評(píng)論

Object

男|高級(jí)講師

TA的文章

[C/C++ C++11]正則表達(dá)式

window.open 打開新窗口被攔截的解決方案

canvas下的全屏問題

div寬度和高度固定，讓圖片鋪滿整個(gè)div而且不變形

JavaScript面向?qū)ο缶幊獭狥unction類型

Node.js 全局對象

vue-element-admin簡化版

每日 30 秒 ? 數(shù)組也會(huì)禿頂

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！

Web開發(fā) - 網(wǎng)絡(luò)爬蟲

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長期優(yōu)惠，快來選購！