node爬取拉勾網(wǎng)數(shù)據(jù)并導(dǎo)出為excel文件

dkzwm 發(fā)布于2019-08-22 18:30 / 1548人閱讀

摘要：前言之前斷斷續(xù)續(xù)學(xué)習(xí)了，今天就拿拉勾網(wǎng)練練手，順便通過(guò)數(shù)據(jù)了解了解最近的招聘行情哈方面算是萌新一個(gè)吧，希望可以和大家共同學(xué)習(xí)和進(jìn)步。

前言

之前斷斷續(xù)續(xù)學(xué)習(xí)了node.js，今天就拿拉勾網(wǎng)練練手，順便通過(guò)數(shù)據(jù)了解了解最近的招聘行情哈！node方面算是萌新一個(gè)吧，希望可以和大家共同學(xué)習(xí)和進(jìn)步。

一、概要

我們首先需要明確具體的需求：

可以通過(guò)node index 城市職位來(lái)爬取相關(guān)信息

也可以輸入node index start直接爬取我們預(yù)定義好的城市和職位數(shù)組，循環(huán)爬取不同城市的不同職位信息

將最終爬取的結(jié)果存儲(chǔ)在本地的./data目錄下

生成對(duì)應(yīng)的excel文件，并存儲(chǔ)到本地

二、爬蟲(chóng)用到的相關(guān)模塊

fs: 用于對(duì)系統(tǒng)文件及目錄進(jìn)行讀寫操作

async：流程控制

superagent：客戶端請(qǐng)求代理模塊

node-xlsx：將一定格式的文件導(dǎo)出為excel

三、爬蟲(chóng)主要步驟： 初始化項(xiàng)目

新建項(xiàng)目目錄

在合適的磁盤目錄下創(chuàng)建項(xiàng)目目錄 node-crwl-lagou

初始化項(xiàng)目

進(jìn)入node-crwl-lagou文件夾下
執(zhí)行npm init，初始化package.json文件

安裝依賴包

npm install async

npm install superagent

npm install node-xlsx

命令行輸入的處理

對(duì)于在命令行輸入的內(nèi)容，可以用process.argv來(lái)獲取，他會(huì)返回個(gè)數(shù)組，數(shù)組的每一項(xiàng)就是用戶輸入的內(nèi)容。
區(qū)分node index 地域職位和node index start兩種輸入，最簡(jiǎn)單的就是判斷process.argv的長(zhǎng)度，長(zhǎng)度為四的話，就直接調(diào)用爬蟲(chóng)主程序爬取數(shù)據(jù)，長(zhǎng)度為三的話，我們就需要通過(guò)預(yù)定義的城市和職位數(shù)組來(lái)拼湊url了，然后利用async.mapSeries循環(huán)調(diào)用主程序。關(guān)于命令分析的主頁(yè)代碼如下：

if (process.argv.length === 4) {
  let args = process.argv
  console.log("準(zhǔn)備開(kāi)始請(qǐng)求" + args[2] + "的" + args[3] + "職位數(shù)據(jù)");
  requsetCrwl.controlRequest(args[2], args[3])
} else if (process.argv.length === 3 && process.argv[2] === "start") {
  let arr = []
  for (let i = 0; i < defaultArgv.city.length; i++) {
    for (let j = 0; j < defaultArgv.position.length; j++) {
      let obj = {}
      obj.city = defaultArgv.city[i]
      obj.position = defaultArgv.position[j]
      arr.push(obj)
    }
  }
  async.mapSeries(arr, function (item, callback) {
    console.log("準(zhǔn)備開(kāi)始請(qǐng)求" + item.city + "的" + item.position + "職位數(shù)據(jù)");
    requsetCrwl.controlRequest(item.city, item.position, callback)
  }, function (err) {
    if (err) throw err
  })
} else {
  console.log("請(qǐng)正確輸入要爬取的城市和職位，正確格式為："node index 城市 關(guān)鍵詞" 或 "node index start" 例如："node index 北京 php" 或"node index start"")
}

預(yù)定義好的城市和職位數(shù)組如下：

{
    "city": ["北京","上海","廣州","深圳","杭州","南京","成都","西安","武漢","重慶"],
    "position": ["前端","java","php","ios","android","c++","python",".NET"]
}

接下來(lái)就是爬蟲(chóng)主程序部分的分析了。

分析頁(yè)面，找到請(qǐng)求地址

首先我們打開(kāi)拉勾網(wǎng)首頁(yè)，輸入查詢信息（比如node），然后查看控制臺(tái)，找到相關(guān)的請(qǐng)求，如圖：

這個(gè)post請(qǐng)求https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false就是我們所需要的，通過(guò)三個(gè)請(qǐng)求參數(shù)來(lái)獲取不同的數(shù)據(jù)，簡(jiǎn)單的分析就可得知：參數(shù)first是標(biāo)注當(dāng)前是否是第一頁(yè)，true為是，false為否；參數(shù)pn是當(dāng)前的頁(yè)碼；參數(shù)kd是查詢輸入的內(nèi)容。

通過(guò)superagent請(qǐng)求數(shù)據(jù)

首先需要明確得是，整個(gè)程序是異步的，我們需要用async.series來(lái)依次調(diào)用。
查看分析返回的response：

可以看到content.positionResult.totalCount就是我們所需要的總頁(yè)數(shù)
我們用superagent直接調(diào)用post請(qǐng)求，控制臺(tái)會(huì)提示如下信息：

{"success": False, "msg": "您操作太頻繁,請(qǐng)稍后再訪問(wèn)", "clientIp": "122.xxx.xxx.xxx"}

這其實(shí)是反爬蟲(chóng)策略之一，我們只需要給其添加一個(gè)請(qǐng)求頭即可，請(qǐng)求頭的獲取方式很簡(jiǎn)單，如下：

然后在用superagent調(diào)用post請(qǐng)求，主要代碼如下：

// 先獲取總頁(yè)數(shù)
    (cb) => {
      superagent
        .post(`https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false&city=${city}&kd=${position}&pn=1`)
        .send({
          "pn": 1,
          "kd": position,
          "first": true
        })
        .set(options.options)
        .end((err, res) => {
          if (err) throw err
          // console.log(res.text)
          let resObj = JSON.parse(res.text)
          if (resObj.success === true) {
            totalPage = resObj.content.positionResult.totalCount;
            cb(null, totalPage);
          } else {
            console.log(`獲取數(shù)據(jù)失敗:${res.text}}`)
          }
        })
    },

拿到總頁(yè)數(shù)后，我們就可以通過(guò)總頁(yè)數(shù)/15獲取到pn參數(shù)，循環(huán)生成所有url并存入urls中：

(cb) => {
      for (let i=0;Math.ceil(i
有了所有的url，在想爬到所有的數(shù)據(jù)就不是難事了，繼續(xù)用superagent的post方法循環(huán)請(qǐng)求所有的url，每一次獲取到數(shù)據(jù)后，在data目錄下創(chuàng)建json文件，將返回的數(shù)據(jù)寫入。這里看似簡(jiǎn)單，但是有兩點(diǎn)需要注意：

為了防止并發(fā)請(qǐng)求太多而導(dǎo)致被封IP：循環(huán)url時(shí)候需要使用async.mapLimit方法控制并發(fā)為3， 每次請(qǐng)求完都要過(guò)兩秒在發(fā)送下一次的請(qǐng)求
在async.mapLimit的第四個(gè)參數(shù)中，需要通過(guò)判斷調(diào)用主函數(shù)的第三個(gè)參數(shù)是否存在來(lái)區(qū)分一下是那種命令輸入，因?yàn)閷?duì)于node index start這個(gè)命令，我們使用得是async.mapSeries，每次調(diào)用主函數(shù)都傳遞了(city, position, callback)，所以如果是node index start的話，需要在每次獲取數(shù)據(jù)完后將null傳遞回去，否則無(wú)法進(jìn)行下一次循環(huán)

主要代碼如下：
// 控制并發(fā)為3
    (cb) => {
      async.mapLimit(urls, 3, (url, callback) => {
        num++;
        let page = url.split("&")[3].split("=")[1];
        superagent
          .post(url)
          .send({
            "pn": totalPage,
            "kd": position,
            "first": false
          })
          .set(options.options)
          .end((err, res) => {
            if (err) throw err
            let resObj = JSON.parse(res.text)
            if (resObj.success === true) {
              console.log(`正在抓取第${page}頁(yè)，當(dāng)前并發(fā)數(shù)量：${num}`);
              if (!fs.existsSync("./data")) {
                fs.mkdirSync("./data");
              }
              // 將數(shù)據(jù)以.json格式儲(chǔ)存在data文件夾下
              fs.writeFile(`./data/${city}_${position}_${page}.json`, res.text, (err) => {
                if (err) throw err;
                // 寫入數(shù)據(jù)完成后，兩秒后再發(fā)送下一次請(qǐng)求
                setTimeout(() => {
                  num--;
                  console.log(`第${page}頁(yè)寫入成功`);
                  callback(null, "success");
                }, 2000);
              });
            }
          })
      }, (err, result) => {
        if (err) throw err;
        // 這個(gè)arguments是調(diào)用controlRequest函數(shù)的參數(shù)，可以區(qū)分是那種爬取（循環(huán)還是單個(gè)）
        if (arguments[2]) {
          ok = 1;
        }
        cb(null, ok)
      })
    },
    () => {
      if (ok) {
        setTimeout(function () {
          console.log(`${city}的${position}數(shù)據(jù)請(qǐng)求完成`);
          indexCallback(null);
        }, 5000);
      } else {
        console.log(`${city}的${position}數(shù)據(jù)請(qǐng)求完成`);
      }
      // exportExcel.exportExcel() // 導(dǎo)出為excel
    }
導(dǎo)出的json文件如下：

json文件導(dǎo)出為excel
將json文件導(dǎo)出為excel有多種方式，我使用的是node-xlsx這個(gè)node包，這個(gè)包需要將數(shù)據(jù)按照固定的格式傳入，然后導(dǎo)出即可，所以我們首先做的就是先拼出其所需的數(shù)據(jù)格式：
function exportExcel() {
  let list = fs.readdirSync("./data")
  let dataArr = []
  list.forEach((item, index) => {
    let path = `./data/${item}`
    let obj = fs.readFileSync(path, "utf-8")
    let content = JSON.parse(obj).content.positionResult.result
    let arr = [["companyFullName", "createTime", "workYear", "education", "city", "positionName", "positionAdvantage", "companyLabelList", "salary"]]
    content.forEach((contentItem) => {
      arr.push([contentItem.companyFullName, contentItem.phone, contentItem.workYear, contentItem.education, contentItem.city, contentItem.positionName, contentItem.positionAdvantage, contentItem.companyLabelList.join(","), contentItem.salary])
    })
    dataArr[index] = {
      data: arr,
      name: path.split("./data/")[1] // 名字不能包含  / ? * [ ]
    }
  })

// 數(shù)據(jù)格式
// var data = [
//   {
//     name : "sheet1",
//     data : [
//       [
//         "ID",
//         "Name",
//         "Score"
//       ],
//       [
//         "1",
//         "Michael",
//         "99"
//
//       ],
//       [
//         "2",
//         "Jordan",
//         "98"
//       ]
//     ]
//   },
//   {
//     name : "sheet2",
//     data : [
//       [
//         "AA",
//         "BB"
//       ],
//       [
//         "23",
//         "24"
//       ]
//     ]
//   }
// ]

// 寫xlsx
  var buffer = xlsx.build(dataArr)
  fs.writeFile("./result.xlsx", buffer, function (err)
    {
      if (err)
        throw err;
      console.log("Write to xls has finished");

// 讀xlsx
//     var obj = xlsx.parse("./" + "resut.xls");
//     console.log(JSON.stringify(obj));
    }
  );
}
導(dǎo)出的excel文件如下，每一頁(yè)的數(shù)據(jù)都是一個(gè)sheet，比較清晰明了：

我們可以很清楚的從中看出目前西安.net的招聘情況，之后也可以考慮用更形象的圖表方式展示爬到的數(shù)據(jù)，應(yīng)該會(huì)更加直觀！
總結(jié)
其實(shí)整個(gè)爬蟲(chóng)過(guò)程并不復(fù)雜，注意就是注意的小點(diǎn)很多，比如async的各個(gè)方法的使用以及導(dǎo)出設(shè)置header等，總之，也是收獲滿滿噠！
源碼
gitbug地址： https://github.com/fighting12...