Sewing without threads: PHP Multi cURL

Non-blocking HTTP requests without multithreading

Posted by Rafael Malgor on April 12, 2020 · 10 mins read

Let's suppose that you have to write a script to download information through HTTP requests. Let's also suppose that for whatever reason you have to write it on PHP. (Maybe because is cheap to host, or you are working with some legacy code, or maybe you like it, I won't judge). Let's also suppose that you have to send several HTTP requests on each run of your script. Let's finally suppose that you cannot relay on third party implementations like Guzzle (if you can, by all means do). With all those suppositions in place you are basically stuck with cURL. I say stuck because, even though it has a great performance and it has passed the test of time, its interface is a bit of a mess.

The naïve approach

Just in case you haven't use cURL, let's see how it works. Performing a HTTP requests is rather simple, you just need a few functions:

  • curl_init to create the handle.
  • curl_setopt to set parameters for the call.
  • curl_exec to actually fire the request.
  • curl_close to release the handle.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<?php
    /* create curl handle */
    $handle = curl_init();

    /* set the endpoint to query */
    curl_setopt($handle, CURLOPT_URL, "example.com");

    /* tell cURL that you want the response of the request as a string */
    curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

    /* get the content of the response */
    $output = curl_exec($handle);

    /* free the resources */
    curl_close($handle);  

Nice, we have our request, of course you might need to setup some request headers and what not for the endpoint you are going to use, but that is beside the point.

As we said we need to perform requests to several endpoints, so let's do that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?php
    $endpoints=array(
        "http://example.com/_things",
        "http://example.com/nice_things",
        "http://example.com/more_things",
        "http://example.com/extra_things",
        "http://example.com/free_things",
        "http://example.com/paid_things",
    );
    $outputs=[];
    /* We only create 1 handle and reuse it. */
    /* This improves performance, IF and only if we are quering the same endpoint each time. */
    $handle = curl_init();
    foreach($endpoints as $endpoint){

        /* Set the endpoint to query */
        curl_setopt($handle, CURLOPT_URL, $endpoint);

        /* Tell cURL that you want the response of the request as a string */
        curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

        /* Get the content of the response */
        $outputs[] = curl_exec($handle);
}

/* free the resources */
curl_close($handle);  

Simple enough, it should work fine, we run the script to test it and... it takes forever, or at least more than we expected. That is happening because curl_exec is a blocking function, so basically we are waiting for each request to succeed before firing the next one. With this approach we are wasting time, in most cases the hardware on both ends should be able to handle many request at the same time, so it would be great to have a way to do so.

Multi cURL: uglier and faster

The cURL standard library comes with multi-request capabilities built in. To use it we have to introduce a few extra functions:

  • curl_multi_init to create a group of handles.
  • curl_multi_add_handle to add handle to the group.
  • curl_multi_exec process the pending action on the query.
  • curl_multi_select to wait for activities on any handle.
  • curl_multi_getcontent to get the result of a handle.
  • curl_multi_close to to release the resources of the group.

Let's see them in action:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<?php
$endpoints=array(
    "http://example.com/_things",
    "http://example.com/nice_things",
    "http://example.com/more_things",
    "http://example.com/extra_things",
    "http://example.com/free_things",
    "http://example.com/paid_things",
);

$outputs = [];
$handles = [];

/* Create the group handle. */
$curl_multi = curl_multi_init();

/* Similar to the previous example but we do not call curl_exec. */
foreach ($endpoints as $endpoint) {
 echo "Creating for $endpoint\n";

 /* Create the handle for the endpoint. */
 $handle = curl_init();

 /* set the endpoint to query. */
 curl_setopt($handle, CURLOPT_URL, $endpoint);

 /* tell cURL that you want the response of the request as a string. */
 curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

 curl_multi_add_handle($curl_multi, $handle);

 /* Save each reference, we need it later. */
 $handles[] = $handle;
}

/* While we're still active, execute curl. */
$active = null;
do {
 $mrc = curl_multi_exec($curl_multi, $active);

 /* CURLM_CALL_MULTI_PERFORM means "keep calling curl_multi_exec" so we do. */
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {

 /* Wait for activity on any of the connections. */
 if (curl_multi_select($curl_multi) == -1) {

  /* If there is no activity skip the loop. */
  continue;
 }

 /* If we got here there was activity on some connections so do the multi_exec. */
 do {
  $mrc = curl_multi_exec($curl_multi, $active);
 } while ($mrc == CURLM_CALL_MULTI_PERFORM);
}

foreach ($handles as $han) {

 /* Get the result from the handle. */
 $results[] = curl_multi_getcontent($han);

 /* Remove the handle from the group. */
 curl_multi_remove_handle($curl_multi, $han);

 /* Close the handle. */
 curl_close($han);
}

/* Close the group handle. */
curl_multi_close($curl_multi);

That is a lot to swallow, trust me I know, but that is how it's done. You might be wondering what is about all those while loops there. The outer loop while ($active && $mrc == CURLM_OK) is basically saying "while there are pending request to process" and while ($mrc == CURLM_CALL_MULTI_PERFORM) means "while there is data on the buffer to send or receive". Something very important to notice is the check if (curl_multi_select($curl_multi) == -1) which is telling the program to wait if there is really nothing to do. The function curl_multi_select blocks the program until there is something to be done on any of the connections. This is good because it frees up the CPU until it is really needed.

Googling around you might find implementations where they don't use curl_multi_select, they do something like this:

1
2
3
4
do {
    curl_multi_exec($mh, $active);
}
while($active);
Warning! Do not try at home!.

That code will act like an infinite empty loop until all the calls are finished, making the CPU usage go to 100% doing nothing, and that is not good, so even though the code might look simpler you shouldn't use that approach.