@@ -4,13 +4,13 @@ Web Scraping
44
55Web scraping means downloading multiple web pages, often from different
66servers.
7- Typically, there is a considerable waiting time involved between sending a
8- request and receiving the answer.
7+ Typically, there is a considerable waiting time between sending a request and
8+ receiving the answer.
99Using a client that always waits for the server to answer before sending
10- the next request, means spending most of time waiting.
11- Here ``asyncio `` can help to send many request without waiting for a response
10+ the next request, can lead to spending most of time waiting.
11+ Here ``asyncio `` can help to send many requests without waiting for a response
1212and collecting the answers later.
13- The next examples show how a synchronous client spends most of the
13+ The following examples show how a synchronous client spends most of the time
1414waiting and how to use ``asyncio `` to write asynchronous client that
1515can handle many requests concurrently.
1616
@@ -75,7 +75,7 @@ The request handler only has a ``GET`` method:
7575It takes the last entry in the paths with ``self.path[1:] ``, i.e.
7676our ``2.5 ``, and tries to convert it into a floating point number.
7777This will be the time the function is going to sleep, using ``time.sleep() ``.
78- This means waits 2.5 seconds until it answers.
78+ This means waiting 2.5 seconds until it answers.
7979The rest of the method contains the HTTP header and message.
8080
8181A Synchronous Client
@@ -86,11 +86,11 @@ This is the full implementation:
8686
8787.. literalinclude :: examples/synchronous_client.py
8888
89- Again, we go through step-by-step.
89+ Again, we go through it step-by-step.
9090
9191While about 80 % of the websites use ``utf-8 `` as encoding
9292(provided by the default in ``ENCODING ``), it is a good idea to actually use
93- the encoding of that is specified by ``charset ``.
93+ the encoding specified by ``charset ``.
9494This is our helper to find out what the encoding of the page is:
9595
9696.. literalinclude :: examples/synchronous_client.py
@@ -120,8 +120,8 @@ Now, we want multiple pages:
120120We just iterate over the waiting times and call ``get_page() `` for all
121121of them.
122122The function ``time.perf_counter() `` provides a time stamp.
123- Taking two time stamps a different and calculating their difference
124- provides the elapsed run time.
123+ Taking two time stamps a different points in time and calculating their
124+ difference provides the elapsed run time.
125125
126126Finally, we can run our client::
127127
@@ -145,7 +145,7 @@ and get this output::
145145Because we wait for each call to ``get_page() `` to complete, we need to
146146wait about 11 seconds.
147147That is the sum of all waiting times.
148- Let's see see if we can do better going asynchronously.
148+ Let's see if we can do it any better going asynchronously.
149149
150150
151151Getting One Page Asynchronously
@@ -159,7 +159,7 @@ using the new Python 3.5 keywords ``async`` and ``await``:
159159As with the synchronous example, finding out the encoding of the page
160160is a good idea.
161161This function helps here by going through the lines of the HTTP header,
162- which it gets as an argument, searching for ``charset `` and returning is value
162+ which it gets as an argument, searching for ``charset `` and returning its value
163163if found.
164164Again, the default encoding is ``ISO-8859-1 ``:
165165
@@ -189,7 +189,7 @@ Therefore, we need to convert our strings in to bytestrings.
189189
190190Next, we read header and message from the reader, which is a ``StreamReader ``
191191instance.
192- We need to iterate over the reader by using the specific for loop for
192+ We need to iterate over the reader by using a special or loop for
193193``asyncio ``:
194194
195195.. code-block :: python
@@ -350,7 +350,7 @@ Exercise
350350Add more waiting times to the list ``waits `` and see how this impacts
351351the run times of the blocking and the non-blocking implementation.
352352Try (positive) numbers that are all less than five.
353- Try numbers greater than five.
353+ Then try numbers greater than five.
354354
355355High-Level Approach with ``aiohttp ``
356356------------------------------------
@@ -376,8 +376,8 @@ The function to get one page is asynchronous, because of the ``async def``:
376376 :start-after: import aiohttp
377377 :end-before: def get_multiple_pages
378378
379- The arguments are the same as for the previous function to retrieve one page
380- plus the additional argument ``session ``.
379+ The arguments are the same as those for the previous function to retrieve one
380+ page plus the additional argument ``session ``.
381381The first task is to construct the full URL as a string from the given
382382host, port, and the desired waiting time.
383383
0 commit comments