Using the New York Times API to Scrape Metadata

Final product image
What You'll Be Creating

Introduction

Last week, I wrote an introduction to scraping web pages to collect metadata, mentioning that it's not possible to scrape the New York Times site. The Times paywall blocks your attempts to gather basic metadata. But there is a way around this using the New York Times API.

Recently I began building a community site on top of the Yii platform, which I will have published in a future tutorial. I wanted to make it easy to add links related to content on the site. While it's easy for people to paste URLs into forms, it becomes time-consuming to also provide title and source information.

So in today's tutorial, I'm going to expand the scraping code I wrote recently to leverage the New York Times API to gather headlines when Times links are added.

Remember, I participate in the comment threads below, so tell me what you think! You can also reach me on Twitter @lookahead_io.

Getting Started

Sign Up for an API Key

New York Times API - API Gallery Home Page

First, let's sign up to request an API Key:

New York Times API - API Sign Up Page

After you submit the form, you'll receive your key in an email:

New York Times API - Email with API Key

Exploring the New York Times API

New York Times API - Categories

The Times offers APIs in the following categories:

  • Archive
  • Article Search
  • Books
  • Community
  • Geographic
  • Most Popular
  • Movie Reviews
  • Semantic
  • Times Newswire
  • TimesTags
  • Top Stories

It's a lot. And, from the Gallery page, you can click on any topic to see the individual API category documentation:

New York Times API - Documentation of articlesearch json

The Times uses LucyBot to power their API docs, and there is a helpful FAQ:

New York Times API - FAQ

They even show you how to quickly get your API usage limits (you'll need to plug in your key):

I initially struggled to make sense of the documentation—it's a parameter-based specification, not a programming guide. However, I posted some questions as issues to the New York Times API GitHub page, and they were quickly and helpfully answered.

Working With Article Search

For today's episode, I'm going to focus on using the NY Times Article Search. Basically, we'll extend the Create Link form from the last tutorial:

New York Times API - Create Link Form with NYT Story URL about Polar Bears

When the user clicks Lookup, we'll make an ajax request through to Link::grab($url). Here's the jQuery:

Here's the controller and model method:

Next, let's use our API key to make an article search request:

And it works quite easily—here's the resulting headline (by the way, climate change is killing Polar Bears and we should care):

New York Times API - Create Link Form with NYT Story URL and Headline from Article Search API

If you want more details from your API request, just add additional arguments to the ?fl=headline request such as keywords and lead_paragraph:

Here's the result:

The response from the API request

Perhaps I'll write a PHP library to better parse the NYT API in coming episodes, but this code breaks out the keywords and the lead paragraph:

Here's what it shows for this article:

Hopefully that starts to expand your imagination about how to use these APIs. It's pretty exciting what may now be possible.

In Closing

The New York Times API is very useful, and I'm glad to see them offering it to the developer community. It was also refreshing to get such quick API support via GitHub—I just didn't expect this. Keep in mind that it's intended for non-commercial projects. If you have some money-making idea, send them a note to see if they'll work with you. Publishers are eager for new sources of revenue.

I hope you found these web scraping episodes helpful and put them to use in your projects. If you'd like to see today's episode in action, you can try out some of the web scraping on my site, Active Together.

Please do share any thoughts and feedback in the comments. You can also always reach me on Twitter @lookahead_io directly. And be sure to check out my instructor page and other series, Building Your Startup With PHP and Programming With Yii2.

Related Links

Tags:

Comments

Related Articles