Let's suppose that you have to write a script to download information through HTTP requests. Let's also suppose that for whatever reason you have to write it on PHP. (Maybe because is cheap to host, or you are working with some legacy code, or maybe you like it, I won't judge). Let's also suppose that you have to send several HTTP requests on each run of your script. Let's finally suppose that you cannot relay on third party implementations like Guzzle (if you can, by all means do). With all those suppositions in place you are basically stuck with cURL. I say stuck because, even though it has a great performance and it has passed the test of time, its interface is a bit of a mess.
Just in case you haven't use cURL, let's see how it works. Performing a HTTP requests is rather simple, you just need a few functions:
curl_init
to create the handle.curl_setopt
to set parameters for the call.curl_exec
to actually fire the request.curl_close
to release the handle.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<?php
/* create curl handle */
$handle = curl_init();
/* set the endpoint to query */
curl_setopt($handle, CURLOPT_URL, "example.com");
/* tell cURL that you want the response of the request as a string */
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
/* get the content of the response */
$output = curl_exec($handle);
/* free the resources */
curl_close($handle);
Nice, we have our request, of course you might need to setup some request headers and what not for the endpoint you are going to use, but that is beside the point.
As we said we need to perform requests to several endpoints, so let's do that:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<?php
$endpoints=array(
"http://example.com/_things",
"http://example.com/nice_things",
"http://example.com/more_things",
"http://example.com/extra_things",
"http://example.com/free_things",
"http://example.com/paid_things",
);
$outputs=[];
/* We only create 1 handle and reuse it. */
/* This improves performance, IF and only if we are quering the same endpoint each time. */
$handle = curl_init();
foreach($endpoints as $endpoint){
/* Set the endpoint to query */
curl_setopt($handle, CURLOPT_URL, $endpoint);
/* Tell cURL that you want the response of the request as a string */
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
/* Get the content of the response */
$outputs[] = curl_exec($handle);
}
/* free the resources */
curl_close($handle);
Simple enough, it should work fine, we run the script to test it and...
it takes forever, or at least more than we expected.
That is happening because curl_exec
is a blocking function,
so basically we are waiting for each request to succeed before firing the next one.
With this approach we are wasting time, in most cases the hardware on both ends should be able to
handle many request at the same time, so it would be great to have a way to do so.
The cURL standard library comes with multi-request capabilities built in. To use it we have to introduce a few extra functions:
curl_multi_init
to create a group of handles.curl_multi_add_handle
to add handle to the group.curl_multi_exec
process the pending action on the query.curl_multi_select
to wait for activities on any handle.curl_multi_getcontent
to get the result of a handle.curl_multi_close
to to release the resources of the group.Let's see them in action:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<?php
$endpoints=array(
"http://example.com/_things",
"http://example.com/nice_things",
"http://example.com/more_things",
"http://example.com/extra_things",
"http://example.com/free_things",
"http://example.com/paid_things",
);
$outputs = [];
$handles = [];
/* Create the group handle. */
$curl_multi = curl_multi_init();
/* Similar to the previous example but we do not call curl_exec. */
foreach ($endpoints as $endpoint) {
echo "Creating for $endpoint\n";
/* Create the handle for the endpoint. */
$handle = curl_init();
/* set the endpoint to query. */
curl_setopt($handle, CURLOPT_URL, $endpoint);
/* tell cURL that you want the response of the request as a string. */
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
curl_multi_add_handle($curl_multi, $handle);
/* Save each reference, we need it later. */
$handles[] = $handle;
}
/* While we're still active, execute curl. */
$active = null;
do {
$mrc = curl_multi_exec($curl_multi, $active);
/* CURLM_CALL_MULTI_PERFORM means "keep calling curl_multi_exec" so we do. */
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
/* Wait for activity on any of the connections. */
if (curl_multi_select($curl_multi) == -1) {
/* If there is no activity skip the loop. */
continue;
}
/* If we got here there was activity on some connections so do the multi_exec. */
do {
$mrc = curl_multi_exec($curl_multi, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
foreach ($handles as $han) {
/* Get the result from the handle. */
$results[] = curl_multi_getcontent($han);
/* Remove the handle from the group. */
curl_multi_remove_handle($curl_multi, $han);
/* Close the handle. */
curl_close($han);
}
/* Close the group handle. */
curl_multi_close($curl_multi);
That is a lot to swallow, trust me I know, but that is how it's done.
You might be wondering what is about all those while
loops there.
The outer loop while ($active && $mrc == CURLM_OK)
is basically saying
"while there are pending request to process" and
while ($mrc == CURLM_CALL_MULTI_PERFORM)
means "while there is data on the buffer to send or receive".
Something very important to notice is the check if (curl_multi_select($curl_multi) == -1)
which is telling the program to wait if there is really nothing to do. The function
curl_multi_select
blocks the program until
there is something to be done on any of the connections. This is good because it frees up the CPU until it is really needed.
Googling around you might find implementations where they don't use curl_multi_select
, they do something like this:
1
2
3
4
do {
curl_multi_exec($mh, $active);
}
while($active);
That code will act like an infinite empty loop until all the calls are finished, making the CPU usage go to 100% doing nothing, and that is not good, so even though the code might look simpler you shouldn't use that approach.