Skip to content

Very slow performance scraping coinmarketcap.com #2

@viktorius007

Description

@viktorius007

The following example code executes in 1.3s on my MacBook...


use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/');

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
    [
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Rank',
              'xpath' => '//td[1]',
          ]
      ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Name',
                'xpath' => '//td[2]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Market Cap',
                'xpath' => '//td[3]',
            ]
        ),
        new \Scraper\Structure\TextField(
            [
                'name'  => 'Price',
                'xpath' => '//td[4]',
            ]
        ),
        new \Scraper\Structure\RegexField(
            [
                'name'  => '% Change',
                'xpath' => '//td[7]',
                'regex' => '/(.*)%/'
            ]
        ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();
print_r($data);

However this slightly tweaked version takes 4.5 minutes!


use Scraper\Scrape\Crawler\Types\GeneralCrawler;
use Scraper\Scrape\Extractor\Types\MultipleRowExtractor;

require_once(__DIR__ . '/../vendor/autoload.php');
date_default_timezone_set('UTC');

// Create crawler
$crawler = new GeneralCrawler('https://coinmarketcap.com/currencies/volume/monthly/');

// Setup configuration
$configuration = new \Scraper\Structure\Configuration();
$configuration->setTargetXPath('//table[@id="currencies-volume"]');
$configuration->setRowXPath('.//tbody/tr');
$configuration->setFields(
    [
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Rank',
              'xpath' => './/td[1]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Name',
              'xpath' => './/td[2]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Symbol',
              'xpath' => './/td[3]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Volume_1D',
              'xpath' => './/td[4]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Volume_7D',
              'xpath' => './/td[5]',
          ]
      ),
      new \Scraper\Structure\TextField(
          [
              'name'  => 'Volume_30D',
              'xpath' => './/td[6]',
          ]
      ),
    ]
);

// Extract  data
$extractor = new MultipleRowExtractor($crawler, $configuration);
$data = $extractor->extract();

print_r(array_slice($data, 0, 10));

Are you able to confirm this performance problem on your system, and if so, then why is there such a performance hit?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions