www.qjdy.com-奇迹赌场 > www.qjdy.com官网 > 所以这里需要另一个语言来支撑服务

原标题:所以这里需要另一个语言来支撑服务

浏览次数:192 时间:2019-08-01

运用PhantomJS做网页截图经济适用,但其API很少,做别的职能就比较棘手了。比方,其自带的Web Server Mongoose最高只好同临时候帮忙13个乞求,指望他能独立成为一个劳动是多少实际的。所以这里须求另多个言语来支撑服务,这里选拔NodeJS来产生。

安装PhantomJS

率先,去PhantomJS官方网址下载对应平台的版本,只怕下载源代码自行编写翻译。然后将PhantomJS配置进意况变量,输入

$ phantomjs

若是有感应,那么就可以张开下一步了。

应用PhantomJS进行简短截图

复制代码 代码如下:

var webpage = require('webpage') , page = webpage.create(); page.viewportSize = { width: 1024, height: 800 }; page.clipRect = { top: 0, left: 0, width: 1024, height: 800 }; page.settings = { javascriptEnabled: false, loadImages: true, userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/19.0' }; page.open('', function (status) { var data; if (status === 'fail') { console.log('open page fail!'); } else { page.render('./snapshot/test.png'); } // release the memory page.close(); });

这里我们设置了窗口大小为1024 * 800:

复制代码 代码如下:

page.viewportSize = { width: 1024, height: 800 };

截取从(0, 0)为源点的1024 * 800尺寸的图像:

复制代码 代码如下:

page.clipRect = { top: 0, left: 0, width: 1024, height: 800 };

禁绝Javascript,允许图片载入,并将userAgent改为"Mozilla/5.0 (Windows NT 6.1) AppleWeb基特/537.31 (KHTML, like Gecko) PhantomJS/19.0":

复制代码 代码如下:

page.settings = { javascriptEnabled: false, loadImages: true, userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/19.0'};

下一场使用page.open张开页面,最终截图输出到./snapshot/test.png中:

复制代码 代码如下:

page.render('./snapshot/test.png') ;

 

NodeJS与PhantomJS通讯

我们先来会见PhantomJS能做什么通信。

一声令下行传参

复制代码 代码如下:

例如:

phantomjs snapshot.js

命令行传参只好在PhantomJS开启时开始展览传参,在运维进程中就非常小概了。

正规输出

复制代码 代码如下:

职业输出能从PhantomJS向NodeJS输出数据,但却没有办法从NodeJS传数据给PhantomJS。

只是测验中,标准输出是那二种方法传输最快的,在大方多少传输中应有思考。

 HTTP

复制代码 代码如下:

PhantomJS向NodeJS服务爆发HTTP央求,然后NodeJS再次回到相应的数目。

这种措施很简短,可是须求只好由PhantomJS发出。

Websocket

复制代码 代码如下:

值得注意的是PhantomJS 1.9.0支撑Websocket了,可是缺憾是hixie-76 Websocket,可是到底仍旧提供了一种NodeJS主动向PhantomJS通信的方案了。

测量试验中,我们开采PhantomJS连上当地的Websocket服务照旧须要1秒左右,一时不思量这种方法吧。

phantomjs-node

复制代码 代码如下:

phantomjs-node成功将PhantomJS作为NodeJS的三个模块来选用,但我们看看小编的原掌握释:

I will answer that question with a question. How do you communicate with a process that doesn't support shared memory, sockets, FIFOs, or standard input?

Well, there's one thing PhantomJS does support, and that's opening webpages. In fact, it's really good at opening web pages. So we communicate with PhantomJS by spinning up an instance of ExpressJS, opening Phantom in a subprocess, and pointing it at a special webpage that turns socket.io messages into alert()calls. Those alert() calls are picked up by Phantom and there you go!

The communication itself happens via James Halliday's fantastic dnode library, which fortunately works well enough when combined with browserify to run straight out of PhantomJS's pidgin Javascript environment.

其实phantomjs-node使用的也是HTTP或然Websocket来进展报纸发表,不过其借助庞大,我们只想做二个简短的事物,近年来依旧不思量那么些东东啊。

 

设计图

图片 1

 

让我们初始吧
咱俩在第一版中选取HTTP实行落到实处。

第一利用cluster举办简要的进度守护(index.js):

复制代码 代码如下:

module.exports = (function () {
  "use strict"
  var cluster = require('cluster')
    , fs = require('fs');

  if(!fs.existsSync('./snapshot')) {
    fs.mkdirSync('./snapshot');
  }

  if (cluster.isMaster) {
    cluster.fork();

    cluster.on('exit', function (worker) {
      console.log('Worker' worker.id ' died :(');
      process.nextTick(function () {
        cluster.fork();
      });
    })
  } else {
    require('./extract.js');
  }
})();

下一场使用connect做我们的对外API(extract.js):

复制代码 代码如下:

module.exports = (function () {
  "use strict"
  var connect = require('connect')
    , fs = require('fs')
    , spawn = require('child_process').spawn
    , jobMan = require('./lib/jobMan.js')
    , bridge = require('./lib/bridge.js')
    , pkg = JSON.parse(fs.readFileSync('./package.json'));

  var app = connect()
    .use(connect.logger('dev'))
    .use('/snapshot', connect.static(__dirname '/snapshot', { maxAge: pkg.maxAge }))
    .use(connect.bodyParser())
    .use('/bridge', bridge)
    .use('/api', function (req, res, next) {
      if (req.method !== "POST" || !req.body.campaignId) return next();
      if (!req.body.urls || !req.body.urls.length) return jobMan.watch(req.body.campaignId, req, res, next);

      var campaignId = req.body.campaignId
        , imagesPath = './snapshot/' campaignId '/'
        , urls = []
        , url
        , imagePath;

      function _deal(id, url, imagePath) {
        // just push into urls list
        urls.push({
          id: id,
          url: url,
          imagePath: imagePath
        });
      }

      for (var i = req.body.urls.length; i--;) {
        url = req.body.urls[i];
        imagePath = imagesPath i '.png';
        _deal(i, url, imagePath);
      }

      jobMan.register(campaignId, urls, req, res, next);
      var snapshot = spawn('phantomjs', ['snapshot.js', campaignId]);
      snapshot.stdout.on('data', function (data) {
        console.log('stdout: ' data);
      });
      snapshot.stderr.on('data', function (data) {
        console.log('stderr: ' data);
      });
      snapshot.on('close', function (code) {
        console.log('snapshot exited with code ' code);
      });

    })
    .use(connect.static(__dirname '/html', { maxAge: pkg.maxAge }))
    .listen(pkg.port, function () { console.log('listen: ' ':' pkg.port); });

})();

这里我们援引了八个模块bridge和jobMan。

其间bridge是HTTP通信桥梁,jobMan是专门的学问管理器。大家经过campaignId来对号入座二个job,然后将job和response委托给jobMan管理。然后运营PhantomJS进行管理。

简报桥梁负担接受恐怕重返job的相干新闻,并交付jobMan(bridge.js):

复制代码 代码如下:

module.exports = (function () {
  "use strict"
  var jobMan = require('./jobMan.js')
    , fs = require('fs')
    , pkg = JSON.parse(fs.readFileSync('./package.json'));

  return function (req, res, next) {
      if (req.headers.secret !== pkg.secret) return next();
      // Snapshot APP can post url information
      if (req.method === "POST") {
        var body = JSON.parse(JSON.stringify(req.body));
        jobMan.fire(body);
        res.end('');
      // Snapshot APP can get the urls should extract
      } else {
        var urls = jobMan.getUrls(req.url.match(/campaignId=([^&]*)(s|&|$)/)[1]);
        res.writeHead(200, {'Content-Type': 'application/json'});
        res.statuCode = 200;
        res.end(JSON.stringify({ urls: urls }));
      }
  };

})();

假定request method为POST,则大家感到PhantomJS正在给我们推送job的有关音信。而为GET时,则以为其要拿走job的新闻。

jobMan担负管理job,并发送这段日子获得的job消息通过response再次来到给client(jobMan.js):

复制代码 代码如下:

module.exports = (function () {
  "use strict"
  var fs = require('fs')
    , fetch = require('./fetch.js')
    , _jobs = {};

  function _send(campaignId){
    var job = _jobs[campaignId];
    if (!job) return;
    if (job.waiting) {
      job.waiting = false;
      clearTimeout(job.timeout);
      var finished = (job.urlsNum === job.finishNum)
        , data = {
        campaignId: campaignId,
        urls: job.urls,
        finished: finished
      };
      job.urls = [];
      var res = job.res;
      if (finished) {
        _jobs[campaignId] = null;
        delete _jobs[campaignId]
      }
      res.writeHead(200, {'Content-Type': 'application/json'});
      res.statuCode = 200;
      res.end(JSON.stringify(data));
    }
  }

  function register(campaignId, urls, req, res, next) {
    _jobs[campaignId] = {
      urlsNum: urls.length,
      finishNum: 0,
      urls: [],
      cacheUrls: urls,
      res: null,
      waiting: false,
      timeout: null
    };
    watch(campaignId, req, res, next);
  }

  function watch(campaignId, req, res, next) {
    _jobs[campaignId].res = res;
    // 20s timeout
    _jobs[campaignId].timeout = setTimeout(function () {
      _send(campaignId);
    }, 20000);
  }

  function fire(opts) {
    var campaignId = opts.campaignId
      , job = _jobs[campaignId]
      , fetchObj = fetch(opts.html);

    if (job) {
      if ( opts.status && fetchObj.title) {
        job.urls.push({
          id: opts.id,
          url: opts.url,
          image: opts.image,
          title: fetchObj.title,
          description: fetchObj.description,
          status: opts.status
        });
      } else {
        job.urls.push({
          id: opts.id,
          url: opts.url,
          status: opts.status
        });
      }

      if (!job.waiting) {
        job.waiting = true;
        setTimeout(function () {
          _send(campaignId);
        }, 500);
      }
      job.finishNum ;
    } else {
      console.log('job can not found!');
    }
  }

  function getUrls(campaignId) {
    var job = _jobs[campaignId];
    if (job) return job.cacheUrls;
  }

  return {
    register: register,
    watch: watch,
    fire: fire,
    getUrls: getUrls
  };

})();

此间我们用到fetch对html举行抓取其title和description,fetch完毕相比较简单(fetch.js):

复制代码 代码如下:

module.exports = (function () {
  "use strict"

  return function (html) {
    if (!html) return { title: false, description: false };

    var title = html.match(/<title>(.*?)</title>/)
      , meta = html.match(/<metas(.*?)/?>/g)
      , description;

    if (meta) {
      for (var i = meta.length; i--;) {
        if(meta[i].indexOf('name="description"') > -1 || meta[i].indexOf('name="Description"') > -1){
          description = meta[i].match(/content="(.*?)"/)[1];
        }
      }
    }

    (title && title[1] !== '') ? (title = title[1]) : (title = 'No Title');
    description || (description = 'No Description');

    return {
      title: title,
      description: description
    };
  };

})();

终极是PhantomJS运转的源代码,其运营后经过HTTP向bridge获取job消息,然后每达成job的个中三个url就通过HTTP再次回到给bridge(snapshot.js):

复制代码 代码如下:

var webpage = require('webpage')
  , args = require('system').args
  , fs = require('fs')
  , campaignId = args[1]
  , pkg = JSON.parse(fs.read('./package.json'));

function snapshot(id, url, imagePath) {
  var page = webpage.create()
    , send
    , begin
    , save
    , end;
  page.viewportSize = { width: 1024, height: 800 };
  page.clipRect = { top: 0, left: 0, width: 1024, height: 800 };
  page.settings = {
    javascriptEnabled: false,
    loadImages: true,
    userAgent: 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.31 (KHTML, like Gecko) PhantomJS/1.9.0'
  };
  page.open(url, function (status) {
    var data;
    if (status === 'fail') {
      data = [
        'campaignId=',
        campaignId,
        '&url=',
        encodeURIComponent(url),
        '&id=',
        id,
        '&status=',
      ].join('');
      postPage.open(':' pkg.port '/bridge', 'POST', data, function () {});
    } else {
      page.render(imagePath);
      var html = page.content;
      // callback NodeJS
      data = [
        'campaignId=',
        campaignId,
        '&html=',
        encodeURIComponent(html),
        '&url=',
        encodeURIComponent(url),
        '&image=',
        encodeURIComponent(imagePath),
        '&id=',
        id,
        '&status=',
      ].join('');
      postMan.post(data);
    }
    // release the memory
    page.close();
  });
}

var postMan = {
  postPage: null,
  posting: false,
  datas: [],
  len: 0,
  currentNum: 0,
  init: function (snapshot) {
    var postPage = webpage.create();
    postPage.customHeaders = {
      'secret': pkg.secret
    };
    postPage.open(':' pkg.port '/bridge?campaignId='

  • campaignId, function () {
          var urls = JSON.parse(postPage.plainText).urls
            , url;

      this.len = urls.length;

      if (this.len) {
        for (var i = this.len; i--;) {
          url = urls[i];
          snapshot(url.id, url.url, url.imagePath);
        }
      }
    });
    this.postPage = postPage;
  },
  post: function (data) {
    this.datas.push(data);
    if (!this.posting) {
      this.posting = true;
      this.fire();
    }
  },
  fire: function () {
    if (this.datas.length) {
      var data = this.datas.shift()
        , that = this;
      this.postPage.open(':' pkg.port '/bridge', 'POST', data, function () {
        that.fire();
        // kill child process
        setTimeout(function () {
          if ( this.currentNum === this.len) {
            that.postPage.close();
            phantom.exit();
          }
        }, 500);
      });
    } else {
      this.posting = false;
    }
  }
};
postMan.init(snapshot);

效果

图片 2

 

您可能感兴趣的稿子:

  • Node.js编写爬虫的基本思路及抓取百度图片的实例分享
  • 用Node.js通过sitemap.xml批量抓取靓女图片
  • nodejs制作爬虫完毕批量下载图片
  • 接纳node.js写七个爬取果壳网妹纸图的小爬虫
  • nodejs达成爬取网址图片效能
  • Node Puppeteer图像识别达成百度指数爬虫的示范
  • ajax node request爬取网络图片的实例(丑挫穷福利)
  • 使用Node.js批量抓取高清妹子图片实例教程

本文由www.qjdy.com-奇迹赌场发布于www.qjdy.com官网,转载请注明出处:所以这里需要另一个语言来支撑服务

关键词: mg4355线路检测

上一篇:func 虽然是作为一个方法定义的

下一篇:没有了