Goutteを使ってスクレイピング

スクレイピングは今までPHP Simple HTML DOM Parserを使っていたのですが、
最近はGoutteが流行ってきているらしいので、こちらを使ってみました。
自分用のひな形として残します。

普通にcomposerでインストールして使用することができます。
ただし、PHP5.3系を使っている場合は、Goutteのバージョンを1.0.6を指定してインストールする必要があります。

以下、サンプルコード

一つのURLに対してアクセスし、
fooというcssクラスがついたdiv配下のテーブルの各行をパースしています。
ユーザー一覧が表示されているイメージです。

クラス定義

<?php
use Goutte\Client;

class WebScraper {
    const URL = 'http://example.com/';
    private $data = array();
    protected $client;
    protected $crawler;

    public function __construct() {
        $this->client = new Client();
    }

    public function execute(){
        $this->crawler = $this->client->request('GET',self::URL);
        $this->data = $this->crawler->filter('div.foo table tr')
                ->each(function($node){ //eachで各trへの処理を行う
            $row = array();
            $tds = $node->children();//$nodeはtr。その子のtdを取得している。
            $row['id'] =  $tds->first()->filter('a')->text();//1つ目のtdはfirstでも取れる
            $row['name'] = $tds->eq(1)->text();//インデックスを指定して2つ目のtdを取得する
            $row['address'] = $tds->eq(2)->text();
            return $row;//ここでreturnすることで、$this->dataに配列の要素として追加される。
        });
    }
    
    public function getList(){
        return $this->data;
    }

}

クラスの呼び出し側

<?php
$scraper = new WebScraper();
$scraper->execute();
$list = $scraper->getList();

参考

https://github.com/FriendsOfPHP/Goutte http://d.hatena.ne.jp/hnw/20120115 http://ss-complex.com/2014/05/php_blog/ http://qiita.com/77web@github/items/3cd3b56985d5c6845661

kikukawa's diary

都内で活動するシステムエンジニアが書いてます。興味を持った技術やハマったポイント、自分用メモをつけてます。最近はweb中心

Goutteを使ってスクレイピング

クラス定義

クラスの呼び出し側