Day 18 - Crawling and Scraping HTTP Applications

In the PHP world, crawling web applications can be done via Guzzle or by using a web crawler like Goutte, which adds a nice DOM manipulation layer on top of Guzzle. Functional or acceptance tests for Web applications can be written via some other Open-Source projects like Behat or Codeception.

Blackfire provides an alternative Open-Source tool that sits between the web crawling and functional testing spaces: Blackfire Player. This is an exciting tool that lets developers define crawling scenarios, set expectations on responses, and of course run Blackfire assertions against your code. The main advantage of Blackfire Player over existing solutions is the balance it offers between native features and the simplicity of writing custom crawlers.

Warning

Blackfire Player is still a very young project that we have Open-Sourced recently to get feedback from the community. Please consider it as experimental.

The easiest way to use Blackfire Player is to download the phar file:

1
curl -OLsS http://get.blackfire.io/blackfire-player.phar

Then, use php blackfire-player.phar to run the player or make it executable and move it to a directory in your PATH:

1
2
chmod 755 blackfire-player.phar
mv blackfire-player.phar .../bin/blackfire-player

Crawling an HTTP Application

Let's crawl the GitList application by defining a scenario in a gitlist.yml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
scenario:
    options:
        title: GitList Scenario
        endpoint: http://gitlist.demo.blackfire.io/

    steps:
        - title: "Homepage"
          visit: url('/')
          expect:
              - status_code() == 200
              - header('content_type') matches '/html/'
              - css('footer').text() matches '/Powered by GitList/'

A scenario can have several steps, each one having several options:

  • A mandatory URL to hit; visit being the simplest one (you can also click a link, submit a form, follow a redirection, and more);
  • expect: Some optional expectations on the HTTP response.
  • options defines some global settings like the scenario name or the default end point.

Run the scenario via the blackfire-player command line tool:

1
blackfire-player run gitlist.yml -vv

The -vv increases the verbosity of the output and adds some information about the HTTP interactions with the application:

1
2
3
[info] Starting scenario "GitList Scenario" (sent to client 0)
[info] Step 1: Homepage GET http://gitlist.demo.blackfire.io/
[info] Step 2: Twig Project Page GET http://gitlist.demo.blackfire.io/Twig/

If an assertion fails, the scenario is stopped and an error message is displayed. The command also exists with a status code of 1 instead of 0:

1
2
3
4
[info] Starting scenario "GitList Scenario" (sent to client 0)
[info] Step 1: Homepage GET http://gitlist.demo.blackfire.io/
[error] Expectation "css('footer').text() matches '/Powered by PHP/'" failed
[error] Scenario "GitList Scenario" ended with an error: Expectation "css('footer').text() matches '/Powered by PHP/'" failed

Use -vvv to make the logs very verbose. This flag will add debug information to the output, including your expectations:

1
2
3
4
5
6
7
8
9
[debug] Concurrency set to "1"
[info] Starting scenario "GitList Scenario" (sent to client 0)
[info] Step 1: Homepage GET http://gitlist.demo.blackfire.io/
[debug] Expectation "status_code() == 200" pass
[debug] Expectation "header('content_type') matches '/html/'" pass
[debug] Expectation "css('footer').text() matches '/Powered by GitList/'" pass
[info] Step 2: Twig Project Page GET http://gitlist.demo.blackfire.io/Twig/
[debug] Expectation "status_code() == 200" pass
[debug] Expectation "css('.breadcrumb li a').text() matches '/Twig/'" pass

Running Assertions

In addition to expectations, the player can also generate profiles and run assertions defined in the .blackfire.yml file by passing the --blackfire flag (all profiles are stored in a build):

1
blackfire-player run gitlist.yml --blackfire=ENV_NAME_OR_UUID -vv

The output displays failed assertions:

1
2
3
4
5
6
7
8
[info] Starting scenario "Blackfire Player Scenario" (sent to client 0)
[info] Step 1: Homepage GET http://gitlist.demo.blackfire.io/
[info] Step 2: First Project Page GET http://gitlist.demo.blackfire.io/Twig/
[info] Step 3: Search POST http://gitlist.demo.blackfire.io/Twig/tree/1.x/search
[info] Step 4: Network Ajax Request GET http://gitlist.demo.blackfire.io/Twig/network/1.x/0.json
[error] Assertion "main.wall_time 92.1ms < 50ms" failed
[error] Report "Blackfire Player Scenario" failed
[info] Report "Blackfire Player Scenario" URL: https://blackfire.io/build/e1dac5c2-13fb-4f30-92d4-9499c84b88c5

Now, override the endpoint to http://fix2-ijtxpsladv67o.eu.platform.sh/ via the --endpoint flag:

1
2
3
4
blackfire-player run gitlist.yml \
--blackfire=ENV_NAME_OR_UUID \
--endpoint=http://fix2-ijtxpsladv67o.eu.platform.sh/ \
-vv

Blackfire assertions should pass and the scenario should end successfully.

By default, only the last request of each step is automatically profiled. To force a profile or to disable Blackfire, use the blackfire setting:

1
2
3
- title: "Homepage"
  visit: url('/')
  blackfire: false

Values Extraction

Now let's rewrite the scenario and remove the hardcoding of Twig by using variable extraction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
steps:
    - title: "Homepage"
      visit: url('/')
      expect:
          - status_code() == 200
          - header('content_type') matches '/html/'
          - css('footer').text() matches '/Powered by GitList/'
      extract:
          repo_name: css('.repository a').first().text()

    - title: "First Project Page"
      click: css('.repository a').first()
      expect:
          - status_code() == 200
          - css('.breadcrumb li a').text() matches '/' ~ repo_name ~ '/'

The extract setting extracts data from the HTTP response (the body should be HTML, XML, or JSON). Keys are variable names and values can be any valid expression evaluated against the HTTP response. Here, the name of the first repository listed on the homepage is extracted into the repo_name variable. This value is then used in the next step to check the breadcrumb on the project page.

Submitting Forms

Let's submit the search form as an additional step when on the Twig page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
options:
    title: GitList Scenario
    endpoint: http://gitlist.demo.blackfire.io/
    variables:
        query: foo

steps:
    # steps as above

    - title: "Search"
      submit: css('.form-search')
      params:
          query: query
      expect:
          - status_code() == 200
          - css('table.tree td').count() > 10
          - not (body() matches '/No results found./')

submit takes a button (via form("button_name")) or a form like above (as the GitList search form does not have a button anyway). Notice that we have defined the default value of the query variable in the variables option.

Variables can also be defined or overridden via the --variables CLI flag:

1
blackfire-player run gitlist.yml --variables "query=bar"

Crawling APIs

Crawling APIs can be done with the exact same primitives. For JSON responses, use JSON paths in expressions:

1
2
3
4
5
6
- title: "Network Ajax Request"
  visit: url('/' ~ repo_name ~ '/network/1.x/0.json')
  expect:
      - status_code() == 200
      - json('repo') == repo_name
      - json('commitishPath') == '1.x'

The json() function extracts data from JSON responses by using JSON expressions (see JMESPath for their syntax).

Scraping Values

The css(), xpath(), and json() functions can also be used to scrape data out of PHP responses via the extract entry:

1
2
3
4
5
6
7
8
9
- title: "Network Ajax Request"
  visit: url('/' ~ repo_name ~ '/network/1.x/0.json')
  expect:
      - status_code() == 200
      - json('repo') == repo_name
      - json('commitishPath') == '1.x'
  extract:
      # commits.keys(@) is a JMESPath expression
      commits: json('commits.keys(@)')

Store extracted values via the --output flag:

1
blackfire-player run gitlist.yml --variables "query=bar" --output values.json

The values.json contains all variables from the scenario run:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[
    {
        "query": "foo",
        "repo_name": "Twig",
        "commits": [
            "db93842e80d758f6c74c3e1f1f088badb3fc84f0",
            "de15975cb3753e7325fffbb1cc0e13b814d2727f",
            "9d5e74602634cdd1e7899cddcbded01bc584f7a8",
            "093bf70d6ccb6033772dd7160354b822dcc87c26",
            "b7c124bae00ab0a019a084c6b99f1df023bcb996",
            "bf326dbb75d06a467a48e8dc4cf7261cd5fdf581",
            "d9b6333ae8dd2c8e3fd256e127548def0bc614c6",
            "e4bedf07d8d8418ff2537073022814ec365ffabe",
            "93ec90b857460c1e8c43b5d218ca072d0441856c",
            "aa7c53ae0a4c1e02931cb6eac034bc9e13191b5a",
            "b8a6e72414c5c8aa3ca913b06968dc2e35856f23",
            "0f5f9aaf2a72e178726b825f907dba76d74758fa",
            "8bb9b83e32d0659fe966cf10ddc9eedc8c9d629b",
            "355213ba6c4bf31af229d18c00741c5b788691e3",
            "d4e3adaaeeecb4c1d549747368f6f15ce6078428"
        ]
    }
]

Defining Scenarios in PHP

Scenarios can also be defined in plain PHP code. For example, the previous scenario could also be written like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
require_once __DIR__.'/vendor/autoload.php';

use Blackfire\Client as BlackfireClient;
use Blackfire\ClientConfiguration;
use Blackfire\Player\Extension\BlackfireExtension;
use Blackfire\Player\Player;
use Blackfire\Player\Scenario;
use GuzzleHttp\Client as GuzzleClient;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$scenario = new Scenario('GitList Scenario');
$scenario
    ->endpoint('http://gitlist.demo.blackfire.io/')
    ->value('query', 'foo')

    ->visit("url('/')")
    ->title("Homepage")
    ->expect("status_code() == 200")
    ->expect("header('content_type') matches '/html/'")
    ->expect("css('footer').text() matches '/Powered by GitList/'")
    ->extract("repo_name", "css('.repository a').first().text()")

    ->click("css('.repository a').first()")
    ->title("First Project Page")
    ->expect("status_code() == 200")
    ->expect("css('.breadcrumb li a').text() matches '/Twig/'")

    ->submit("css('.form-search')", ["query" => "query"])
    ->title("Search")
    ->expect("status_code() == 200")
    ->expect("css('table.tree td').count() > 10")
    ->expect("not (body() matches '/No results found./')")

    ->visit("url('/' ~ repo_name ~ '/network/1.x/0.json')")
    ->title("Network Ajax Request")
    ->expect("status_code() == 200")
    ->expect("json('repo') == repo_name")
    ->expect("json('commitishPath') == '1.x'")
    ->extract("commits", "json('commits.keys(@)')")
;

$config = new ClientConfiguration();
$config->setEnv('symfony.com');
$blackfire = new BlackfireClient($config);

$logger = new Logger('player');
$logger->pushHandler(new StreamHandler('php://stdout', Logger::DEBUG));

$guzzle = new GuzzleClient(['cookies' => true]);

$player = new Player($guzzle);
$player->addExtension(new BlackfireExtension($blackfire, $logger));
$player->setLogger($logger);
$result = $player->run($scenario);

print_r($result->getValues()->all());

$report = $result->getExtra()->get('blackfire_report');

Conclusion

Blackfire Player is a very powerful Open-Source library for crawling, testing, and scraping HTTP applications. We have barely scratched the surface of all its features:

  • Several scenarios can be defined in a YAML files or in PHP;
  • Scenarios can be run concurrently;
  • Abstract scenarios to reuse common steps;
  • Delays between requests;
  • Conditional scenarios execution based on extracted values;
  • etc.

You can read Blackfire Player's extensive documentation to learn more about all its features.

Similar to Blackfire Player, there are many other Open-Source libraries that provide native integrations with Blackfire. The next chapter covers the main integrations and how you can help us adding more.