Skip to content
This repository was archived by the owner on May 31, 2021. It is now read-only.

Commit 55d3a5b

Browse files
committed
Add description for aiohttp example.
1 parent 81606af commit 55d3a5b

File tree

1 file changed

+76
-7
lines changed

1 file changed

+76
-7
lines changed

webscraper.rst

Lines changed: 76 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
+++++++++++++++++++++++++++++
2-
Larger Example - Web Scraping
3-
+++++++++++++++++++++++++++++
1+
++++++++++++
2+
Web Scraping
3+
++++++++++++
44

55
Web scraping means downloading multiple web pages, often from different
66
servers.
@@ -21,7 +21,7 @@ This is a very simple web server. (See below for the code.)
2121
Its only purpose is to wait for a given amount of time.
2222
Test it by running it from the command line::
2323

24-
python simple_server.py
24+
$ python simple_server.py
2525

2626
It will answer like this::
2727

@@ -125,7 +125,7 @@ provides the elapsed run time.
125125

126126
Finally, we can run our client::
127127

128-
python synchronous_client.py
128+
$ python synchronous_client.py
129129

130130
and get this output::
131131

@@ -254,7 +254,7 @@ This means, we wait until each pages has been retrieved before asking for
254254
the next.
255255
Let's run it from the command-line to see what happens::
256256

257-
async_client_blocking.py
257+
$ async_client_blocking.py
258258
It took 11.06 seconds for a total waiting time of 11.00.
259259
Waited for 1.00 seconds.
260260
That's all.
@@ -320,7 +320,7 @@ So, for a list with 100 tasks it would mean:
320320
321321
Let's see if we got any faster::
322322

323-
async_client_nonblocking.py
323+
$ async_client_nonblocking.py
324324
It took 5.08 seconds for a total waiting time of 11.00.
325325
Waited for 1.00 seconds.
326326
That's all.
@@ -355,7 +355,76 @@ Try numbers greater than five.
355355
High-Level Approach with ``aiohttp``
356356
------------------------------------
357357

358+
The library aiohttp_ allows to write HTTP client and server applications,
359+
using a high-level approach.
360+
Install with::
361+
362+
$ pip install aiohttp
363+
364+
365+
.. _aiohttp: http://aiohttp.readthedocs.io/en/stable/
366+
367+
The whole program looks like this:
368+
369+
.. literalinclude:: examples/aiohttp_client.py
370+
371+
The function to get one page is asynchronous, because of the ``async def``:
372+
373+
358374
.. literalinclude:: examples/aiohttp_client.py
375+
:language: python
376+
:start-after: import aiohttp
377+
:end-before: def get_multiple_pages
378+
379+
The arguments are the same as for the previous function to retrieve one page
380+
plus the additional argument ``session``.
381+
The first task is to construct the full URL as a string from the given
382+
host, port, and the desired waiting time.
383+
384+
We use a timeout of 10 seconds.
385+
If it takes longer than the given time to retrieve a page, the programm
386+
throws a ``TimeoutError``.
387+
Therefore, to make this more robust, you might want to catch this error and
388+
handle it appropriately.
389+
390+
The ``async with`` provides a context manager that gives us a response.
391+
After checking the status being ``200``, which means that all is alright,
392+
we need to ``await`` again to return the body of the page, using the method
393+
``text()`` on the response.
394+
395+
This is the interesting part of ``get_multiple_pages()``:
396+
397+
.. code-block:: python
398+
399+
with closing(asyncio.get_event_loop()) as loop:
400+
with aiohttp.ClientSession(loop=loop) as session:
401+
for wait in waits:
402+
tasks.append(fetch_page(session, host, port, wait))
403+
pages = loop.run_until_complete(asyncio.gather(*tasks))
404+
405+
It is very similar to the code in the example of the time-saving implementation
406+
with ``asyncio``.
407+
The only difference is the opened client session and handing over this session
408+
to ``fetch_page()`` as the first argument.
409+
410+
Finally, we run this program::
411+
412+
$ python aiohttp_client.py
413+
It took 5.04 seconds for a total waiting time of 11.00.
414+
Waited for 1.00 seconds.
415+
That's all.
359416

417+
Waited for 5.00 seconds.
418+
That's all.
419+
420+
Waited for 3.00 seconds.
421+
That's all.
422+
423+
Waited for 2.00 seconds.
424+
That's all.
360425

426+
It also takes about five seconds and gives the same output as our version
427+
before.
428+
But the implementation for getting a single page is much simpler and takes
429+
care of the encoding and other aspects not mentioned here.
361430

0 commit comments

Comments
 (0)