Day 18 - Crawling and Scraping HTTP Applications

In the PHP world, crawling web applications can be done via Guzzle or by using a web crawler like Goutte, which adds a nice DOM manipulation layer on top of Guzzle. Functional or acceptance tests for Web applications can be written via some other Open-Source projects like Behat or Codeception.

Blackfire provides an alternative Open-Source tool that sits between the web crawling and functional testing spaces: Blackfire Player. This is an exciting tool that lets developers define crawling scenarios, set expectations on responses, and of course run Blackfire assertions against your code. The main advantage of Blackfire Player over existing solutions is the balance it offers between native features and the simplicity of writing custom crawlers.

Warning

Blackfire Player is still a very young project that we have Open-Sourced recently to get feedback from the community. Please consider it as experimental.

The easiest way to use Blackfire Player is to download the phar file:

1
curl -OLsS http://get.blackfire.io/blackfire-player.phar

Then, use php blackfire-player.phar to run the player or make it executable and move it to a directory in your PATH:

1
2
chmod 755 blackfire-player.phar
mv blackfire-player.phar .../bin/blackfire-player

Crawling an HTTP Application

Let's crawl the GitList application by defining a scenario in a gitlist.bkf file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
endpoint 'https://gitlist.demo.blackfire.io/'

scenario
    name 'GitList Scenario'

    visit url('/')
        name 'Homepage'
        expect status_code() == 200
        expect header('content_type') matches '/html/'
        expect css('footer').text() matches '/Powered by GitList/'

A scenario can have several steps (like visit, click or submit), each one having its own options.

With the visit step, you must provide a mandatory URL to hit; like url('/').

Other options used in this example are:

  • expect: Some optional expectations on the HTTP response.
  • name: The step name.

Run the scenario via the blackfire-player command line tool:

1
blackfire-player run gitlist.bkf -vv

The -vv increases the verbosity of the output and adds some information about the HTTP interactions with the application:

1
2
3
4
5
6
7
8
Blackfire Player

Scenario  "GitList Scenario"
 "Homepage"
GET https://gitlist.demo.blackfire.io/
  OK

 OK  Scenarios  1  - Steps  1

If an expectation fails, the scenario is stopped and an error message is displayed. The command also exits with a status code of 64 instead of 0:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Blackfire Player

Scenario  "GitList Scenario"
 "Homepage"
GET https://gitlist.demo.blackfire.io/
  Failure on step  "Homepage"  defined in ./.blackfire/24days.bkf at line  6
  └ Expectation "css('footer').text() matches '/Powered by PHP/'" failed.
    └ css("footer").text() = "Powered by GitList 0.5.0
"

 KO  Scenarios  1  - Steps  1  - Failures  1

Use -vvv to make the logs very verbose. This flag adds debug information to the output.

Running Assertions

In addition to expectations, the player can also generate profiles and run assertions defined in the .blackfire.yml file by passing the --blackfire-env flag (all profiles are stored in a build):

1
blackfire-player run gitlist.bkf --blackfire-env=ENV_NAME_OR_UUID -vv

The output displays failed assertions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Blackfire Player

Scenario  "GitList Scenario"
 "Homepage"
GET https://gitlist.demo.blackfire.io/

  Failure on step  "Homepage"  defined in ./.blackfire/24days.bkf at line  6
  └ Assertions failed:
      metrics.twig.display.count + metrics.twig.render.count < 5
Blackfire Report at https://blackfire.io/build-sets/2c44ba7d-139b-41ca-b843-a3d1e2763539

 KO  Scenarios  1  - Steps  1  - Failures  1

Now, override the endpoint to https://fix2-ijtxpsladv67o.eu.platform.sh/ via the --endpoint flag:

1
2
3
4
blackfire-player run gitlist.bkf \
--blackfire-env=ENV_NAME_OR_UUID \
--endpoint=https://fix2-ijtxpsladv67o.eu.platform.sh/ \
-vv

Blackfire assertions should pass and the scenario should end successfully.

By default when using the --blackfire-env option (which is the case when ran from our servers), each step is automatically profiled. To disable Blackfire, use the blackfire setting:

1
2
3
visit url('/')
    name 'Homepage'
    blackfire false

Values Extraction

Now let's rewrite the scenario and remove the hardcoding of Twig by using variable extraction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
visit url('/')
    name 'Homepage'
    expect status_code() == 200
    expect header('content_type') matches '/html/'
    expect css('footer').text() matches '/Powered by GitList/'
    set repo_name css('.repository a').first().text()

click css('.repository a').first()
    name "First Project Page"
    expect status_code() == 200
    expect css('.breadcrumb li a').text() matches '/' ~ repo_name ~ '/'

The set option can be used to extract data from the HTTP response (the body should be HTML, XML, or JSON). The first argument is the variable name, the second is the value. Values can be any valid expressions evaluated against the HTTP response. Here, the name of the first repository listed on the homepage is extracted into the repo_name variable. This value is then used in the next step to check the breadcrumb on the project page.

Submitting Forms

Let's submit the search form as an additional step when on the Twig page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
endpoint 'https://gitlist.demo.blackfire.io/'

scenario
    name 'GitList Scenario'

    set query 'foo'

    # steps as above

    submit css('.form-search')
        name "Search"
        param query query
        expect css('table.tree td').count() > 10
        expect not (body() matches '/No results found./')

submit takes a button (via form("button_name")) or a form like above (as the GitList search form does not have a button anyway). Notice that we have defined the default value of the query variable in the set option.

Variables can also be defined or overridden via the --variable CLI flag:

1
blackfire-player run gitlist.bkf --variable "query=bar"

Crawling APIs

Crawling APIs can be done with the exact same primitives. For JSON responses, use JSON paths in expressions:

1
2
3
4
5
visit url('/' ~ repo_name ~ '/network/1.x/0.json')
    name 'Network Ajax Request'
    expect status_code() == 200
    expect json('repo') == repo_name
    expect json('commitishPath') == '1.x'

The json() function extracts data from JSON responses by using JSON expressions (see JMESPath for their syntax).

Scraping Values

The css(), xpath(), and json() functions can also be used to scrape data out of PHP responses via the set option:

1
2
3
4
5
6
7
8
visit url('/' ~ repo_name ~ '/network/1.x/0.json')
    name 'Network Ajax Request'
    expect status_code() == 200
    expect json('repo') == repo_name
    expect json('commitishPath') == '1.x'

    # commits.keys(@) is a JMESPath expression
    set commits json('commits.keys(@)')

Store a report of the execution with the extracted values via the --full-report flag:

1
blackfire-player run gitlist.bkf --variables "query=bar" --full-report > values.json

The values.json contains all variables from the scenario run:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
    "1": {
        "values": {
            "commits": [
                "db93842e80d758f6c74c3e1f1f088badb3fc84f0",
                "de15975cb3753e7325fffbb1cc0e13b814d2727f",
                "9d5e74602634cdd1e7899cddcbded01bc584f7a8",
                "093bf70d6ccb6033772dd7160354b822dcc87c26",
                "b7c124bae00ab0a019a084c6b99f1df023bcb996",
                "bf326dbb75d06a467a48e8dc4cf7261cd5fdf581",
                "d9b6333ae8dd2c8e3fd256e127548def0bc614c6",
                "e4bedf07d8d8418ff2537073022814ec365ffabe",
                "93ec90b857460c1e8c43b5d218ca072d0441856c",
                "aa7c53ae0a4c1e02931cb6eac034bc9e13191b5a",
                "b8a6e72414c5c8aa3ca913b06968dc2e35856f23",
                "0f5f9aaf2a72e178726b825f907dba76d74758fa",
                "8bb9b83e32d0659fe966cf10ddc9eedc8c9d629b",
                "355213ba6c4bf31af229d18c00741c5b788691e3",
                "d4e3adaaeeecb4c1d549747368f6f15ce6078428"
            ]
        },
        "error": null
    }
}

Defining Scenarios in PHP

Scenarios can also be defined in plain PHP code. For example, the previous scenario could also be written like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
require_once __DIR__.'/vendor/autoload.php';

use Blackfire\Client as BlackfireClient;
use Blackfire\ClientConfiguration;
use Blackfire\Player\Extension\BlackfireExtension;
use Blackfire\Player\Player;
use Blackfire\Player\Scenario;
use GuzzleHttp\Client as GuzzleClient;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$scenario = new Scenario('GitList Scenario');
$scenario
    ->endpoint('https://gitlist.demo.blackfire.io/')
    ->value('query', 'foo')

    ->visit("url('/')")
    ->title("Homepage")
    ->expect("status_code() == 200")
    ->expect("header('content_type') matches '/html/'")
    ->expect("css('footer').text() matches '/Powered by GitList/'")
    ->extract("repo_name", "css('.repository a').first().text()")

    ->click("css('.repository a').first()")
    ->title("First Project Page")
    ->expect("status_code() == 200")
    ->expect("css('.breadcrumb li a').text() matches '/Twig/'")

    ->submit("css('.form-search')", ["query" => "query"])
    ->title("Search")
    ->expect("status_code() == 200")
    ->expect("css('table.tree td').count() > 10")
    ->expect("not (body() matches '/No results found./')")

    ->visit("url('/' ~ repo_name ~ '/network/1.x/0.json')")
    ->title("Network Ajax Request")
    ->expect("status_code() == 200")
    ->expect("json('repo') == repo_name")
    ->expect("json('commitishPath') == '1.x'")
    ->extract("commits", "json('commits.keys(@)')")
;

$config = new ClientConfiguration();
$config->setEnv('symfony.com');
$blackfire = new BlackfireClient($config);

$logger = new Logger('player');
$logger->pushHandler(new StreamHandler('php://stdout', Logger::DEBUG));

$guzzle = new GuzzleClient(['cookies' => true]);

$player = new Player($guzzle);
$player->addExtension(new BlackfireExtension($blackfire, $logger));
$player->setLogger($logger);
$result = $player->run($scenario);

print_r($result->getValues()->all());

$report = $result->getExtra()->get('blackfire_report');

Conclusion

Blackfire Player is a very powerful Open-Source library for crawling, testing, and scraping HTTP applications. We have barely scratched the surface of all its features:

  • Several scenarios can be defined in a .bkf files or in PHP;
  • Abstract scenarios to reuse common steps;
  • Delays between requests;
  • Conditional scenarios execution based on extracted values;
  • etc.

You can read Blackfire Player's extensive documentation to learn more about all its features.

Similar to Blackfire Player, there are many other Open-Source libraries that provide native integrations with Blackfire. The next chapter covers the main integrations and how you can help us adding more.