how to find out what api a website is using

Web Scraping 201: finding the API

February 15, 2015 // scraping , python , data , tutorial

This is part of a series of posts I have written near web scraping with Python.

  1. Web Scraping 101 with Python, which covers the nuts of using Python for web scraping.
  2. Web Scraping 201: Finding the API, which covers when sites load data client-side with Javascript.
  3. Asynchronous Scraping with Python, showing how to employ multithreading to speed things upwards.
  4. Scraping Pages Behind Login Forms, which shows how to log into sites using Python.

Update: Sorry folks, it looks similar the NBA doesn't make shot log information accessible anymore. The same principles of this post all the same apply, only the particular example used is no longer functional. I do non intend to rewrite this post.


Previously, I explained how to scrape a page where the information is rendered server-side. Even so, the increasing popularity of Javascript frameworks such every bit AngularJS coupled with RESTful APIs means that fewer sites are generated server-side and are instead being rendered client-side.

In this post, I'll requite a brief overview of the differences between the two and prove how to find the underlying API, allowing yous to get the data y'all're looking for.

Server-side vs customer-side

Imagine nosotros have a database of sports statistics and would like to build a spider web awarding on top of it (e.grand. something like Basketball Reference).

If we build our spider web app using a server-side framework like Django [i], something akin to the post-obit happens each time a user visits a page.

  1. User'southward browser sends a request to the server hosting our application.
  2. Our server processes the request, checking to brand certain the URL requested exists (amongst other things).
  3. If the requested URL does not be, send an fault dorsum to the user'south browser and direct them to a 404 page.
  4. If the requested URL does exist, execute some code on the server which gets data from our database. Let's say the user wants to meet John Wall's game-by-game stats for the 2014-15 NBA flavor. In this case, our Django/Python code queries the database and receives the information.
  5. Our Django/Python code injects the data into our application'southward templates to consummate the HTML for the page.
  6. Finally, the server sends the HTML to the user's browser (a response to their asking) and the page is displayed.

To illustrate the last step, go to John Wall's game log and view the page source. Ctrl+f or Cmd+f and search for "2014-10-29". This is the starting time row of the game-by-game stats table. Nosotros know the page was created server-side because the data is present in the page source.

Withal, if the web application is built with a client-side framework like Athwart, the process is slightly different. In this case, the server still sends the static content (the HTML, CSS, and Javascript), but the HTML is only a template - it doesn't hold any data. Separately, the Javascript in the server response fetches the data from an API and uses it to create the folio client-side.

To illustrate, view the source of John Wall's shot log page on NBA.com - there's no data to scrape! Encounter for yourself. Ctrl+f or Cmd+f for "Was @". Despite in that location being many instances of it in the shot log table, none plant in the page source.

If yous're thinking "Oh crap, I can't scrape this data," well, y'all're in luck! Applications using an API are often easier to scrape - you merely need to know how to find the API. Which means I should probably tell you how to practise that.

Finding the API

With a client-side app, your browser is doing much of the piece of work. And because your browser is what'due south rendering the HTML, we tin use it to see where the data is coming from using its built-in developer tools.

To illustrate, I'll be using Chrome, but Firefox should be more or less the same (Cyberspace Explorer users … y'all should switch to Chrome or Firefox and not expect back).

To open up Chrome'due south Developer Tools, become to View -> Programmer -> Developer Tools. In Firefox, information technology'due south Tools -> Spider web Developer -> Toggle Tools. We'll be using the Network tab, and then click on that ane. Information technology should be empty.

Now, go to the page that has your data. In this case, information technology's John Wall'southward shot logs. If you lot're already on the page, striking refresh. Your Network tab should expect similar to this:

network tab example

Next, click on the XHR filter. XHR is short for XMLHttpRequest - this is the type of asking used to fetch XML or JSON data. You should see a couple entries in this table (screenshot below). One of them is the API request that returns the data yous're looking for (in this case, John Wall's shots).

XHR requests example

At this signal, you'll demand to explore a bit to determine which asking is the one y'all want. For our instance, the one starting with "playerdashptshotlog" sounds promising. Allow's click on it and view information technology in the Preview tab. Things should at present look like this:

API response preview

Bingo! That's the API endpoint. We can use the Preview tab to explore the response.

API results preview

Y'all should come across a couple of objects:

  1. The resource name - playerdashptshotlog.
  2. The parameters (you might need to aggrandize the resource section). These are the request parameters that were passed to the API. Y'all can think of them similar the WHERE clause of a SQL query. This request has parameters of Season=2014-fifteen and PlayerID=202322 (amongst others). Change the parameters in the URL and you'll get dissimilar data (more than on that in a bit).
  3. The effect sets. This is cocky-explanatory.
  4. Inside the result sets, you'll find the headers and row set. Each object in the row fix is essentially the result of a database query, while the headers tell you the column order. We can run into that the first item in each row corresponds to the Game_ID, while the second is the Matchup.

Now, go to the Headers tab, grab the request URL, and open it in a new browser tab, we'll run across the data we're looking for (example below). Note that I'thousand using JSONView, which nicely formats JSON in your browser.

API response

To grab this information, we tin utilise something like Python's requests. Hither's an example:

                        import            requests            shots_url            =            'http://stats.nba.com/stats/playerdashptshotlog?'            +            \            'DateFrom=&DateTo=&GameSegment=&LastNGames=0&LeagueID=00&'            +            \            'Location=&Calendar month=0&OpponentTeamID=0&Effect=&Menstruation=0&'            +            \            'PlayerID=202322&Flavour=2014-fifteen&SeasonSegment=&'            +            \            'SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision='            # request the URL and parse the JSON            response            =            requests            .            get            (            shots_url            )            response            .            raise_for_status            ()            # heighten exception if invalid response            shots            =            response            .            json            ()[            'resultSets'            ][            0            ][            'rowSet'            ]            # do whatever we want with the shots information            do_things            (            shots            )          

That'south it. Now you take the information and can get to work.

Note that passing different parameter values to the API yields different results. For instance, change the Season parameter to 2013-14 - now you accept John Wall's shots for the 2013-14 flavour. Alter the PlayerID to 201935 - at present yous have James Harden's shots.

Additionally, different APIs render different types of information. Some might send XML; others, JSON. Some might store the results in an array of arrays; others, an assortment of maps or dictionaries. Some might not return the cavalcade headers at all. Things are vary between sites.

Had a situation where you oasis't been able to find the information you're looking for in the folio source? Well, now you know how to find it.

Was at that place something I missed? Take questions? Let me know.


[1] Really this can be whatsoever server-side framework - Carmine on Rails, PHP's Drupal or CodeIgniter, etc.

Congenital with Pelican and the newbird theme

© Copyright Greg Reda, 2013 to present

hildebrandount1995.blogspot.com

Source: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/

0 Response to "how to find out what api a website is using"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel