1- +++++++++++++++++++++++++++++
2- Larger Example - Web Scraping
3- +++++++++++++++++++++++++++++
1+ ++++++++++++
2+ Web Scraping
3+ ++++++++++++
44
55Web scraping means downloading multiple web pages, often from different
66servers.
@@ -21,7 +21,7 @@ This is a very simple web server. (See below for the code.)
2121Its only purpose is to wait for a given amount of time.
2222Test it by running it from the command line::
2323
24- python simple_server.py
24+ $ python simple_server.py
2525
2626It will answer like this::
2727
@@ -125,7 +125,7 @@ provides the elapsed run time.
125125
126126Finally, we can run our client::
127127
128- python synchronous_client.py
128+ $ python synchronous_client.py
129129
130130and get this output::
131131
@@ -254,7 +254,7 @@ This means, we wait until each pages has been retrieved before asking for
254254the next.
255255Let's run it from the command-line to see what happens::
256256
257- async_client_blocking.py
257+ $ async_client_blocking.py
258258 It took 11.06 seconds for a total waiting time of 11.00.
259259 Waited for 1.00 seconds.
260260 That's all.
@@ -320,7 +320,7 @@ So, for a list with 100 tasks it would mean:
320320
321321 Let's see if we got any faster::
322322
323- async_client_nonblocking.py
323+ $ async_client_nonblocking.py
324324 It took 5.08 seconds for a total waiting time of 11.00.
325325 Waited for 1.00 seconds.
326326 That's all.
@@ -355,7 +355,76 @@ Try numbers greater than five.
355355High-Level Approach with ``aiohttp ``
356356------------------------------------
357357
358+ The library aiohttp _ allows to write HTTP client and server applications,
359+ using a high-level approach.
360+ Install with::
361+
362+ $ pip install aiohttp
363+
364+
365+ .. _aiohttp : http://aiohttp.readthedocs.io/en/stable/
366+
367+ The whole program looks like this:
368+
369+ .. literalinclude :: examples/aiohttp_client.py
370+
371+ The function to get one page is asynchronous, because of the ``async def ``:
372+
373+
358374.. literalinclude :: examples/aiohttp_client.py
375+ :language: python
376+ :start-after: import aiohttp
377+ :end-before: def get_multiple_pages
378+
379+ The arguments are the same as for the previous function to retrieve one page
380+ plus the additional argument ``session ``.
381+ The first task is to construct the full URL as a string from the given
382+ host, port, and the desired waiting time.
383+
384+ We use a timeout of 10 seconds.
385+ If it takes longer than the given time to retrieve a page, the programm
386+ throws a ``TimeoutError ``.
387+ Therefore, to make this more robust, you might want to catch this error and
388+ handle it appropriately.
389+
390+ The ``async with `` provides a context manager that gives us a response.
391+ After checking the status being ``200 ``, which means that all is alright,
392+ we need to ``await `` again to return the body of the page, using the method
393+ ``text() `` on the response.
394+
395+ This is the interesting part of ``get_multiple_pages() ``:
396+
397+ .. code-block :: python
398+
399+ with closing(asyncio.get_event_loop()) as loop:
400+ with aiohttp.ClientSession(loop = loop) as session:
401+ for wait in waits:
402+ tasks.append(fetch_page(session, host, port, wait))
403+ pages = loop.run_until_complete(asyncio.gather(* tasks))
404+
405+ It is very similar to the code in the example of the time-saving implementation
406+ with ``asyncio ``.
407+ The only difference is the opened client session and handing over this session
408+ to ``fetch_page() `` as the first argument.
409+
410+ Finally, we run this program::
411+
412+ $ python aiohttp_client.py
413+ It took 5.04 seconds for a total waiting time of 11.00.
414+ Waited for 1.00 seconds.
415+ That's all.
359416
417+ Waited for 5.00 seconds.
418+ That's all.
419+
420+ Waited for 3.00 seconds.
421+ That's all.
422+
423+ Waited for 2.00 seconds.
424+ That's all.
360425
426+ It also takes about five seconds and gives the same output as our version
427+ before.
428+ But the implementation for getting a single page is much simpler and takes
429+ care of the encoding and other aspects not mentioned here.
361430
0 commit comments