How to Scrape Web Pages With Node.js and jQuery

Node.js is growing rapidly; one of the biggest reasons for this is thanks to the developers who create amazing tools that significantly improve productivity with Node. In this article, we will go through the basic installation of Express, a development framework, and creating a basic project with it.


What We're Going to Build Today

Node is similar in design to, and influenced by, systems like Ruby's Event Machine or Python's Twisted. Node takes the event model a bit further - it presents the event loop as a language construct instead of as a library.

In this tutorial, we will scrape the YouTube home page, get all the regular sized thumbnails from the page as well as links and video duration time, send all those elements to a jQueryMobile template, and play the videos using YouTube embed (which does a nice job of detecting device media support (flash/html5-video).

We will also learn how to begin using npm and Express, npm's module installation process, basic Express routing and the usage of two modules of Node: request and jsdom.

For those of you who aren't yet familiar with Node.js is and how to install it, please refer to the node.js home page
and the npm GitHub project page.

You should also refer to our "Node.js: Step by Step" series.

Note: This tutorial requires and assumes that you understand what Node.js is and that you already have node.js and npm installed.


Step 1: Setting Up Express

So what exactly is Express? According to its developers, it's an..

Insanely fast (and small) server-side JavaScript web development framework built on Node and Connect.

Sounds cool, right? Let's use npm to install express. Open a Terminal window and type the following command:

By passing -g as a parameter to the install command, we're telling npm to make a global installation of the module.

I'm using /home/node-server/nettuts for this example, but you can use whatever you feel comfortable with.

After creating our express project, we need to isntruct npm to install express' dependencies.

If it ends with, "ok," then you're good to go. You can now run your project:

In your browser, go to http://localhost:3000.


Step 2: Installing Needed Modules

JSDOM

A JavaScript implementation of the W3C DOM.

Go back to your Terminal and, after stopping your current server (ctr + c), install jsdom:


Request

Simplified HTTP request method.

Type the following into the Terminal:

Everything should be setup now. Now, it's time to get into some actual code!


Step 3: Creating a Simple Scraper

app.js

First, let's include all our dependencies. Open your app.js file, and, in the very first lines, append the following code:

You will notice that Express has created some code for us. What you see in app.js is the most basic structure for a Node server using Express. In our previous code block, we told Express to include our recently installed modules: jsdom and request. Also, we're including the URL module, which will help us parse the video URL we will scrape from YouTube later.

Scraping Youtube.com

Within app.js, search for the "Routes" section (around line 40) and add the following code (read through the comments to understand what is going on):

In this case, we're fetching the content from the YouTube home page. Once complete, we're printing the text contained in the page's title tag (<title>). Return to the Terminal and run your server again.

In your browser, go to: http://localhost:3000/nodetube

You should see, "YouTube - Broadcast Yourself," which is YouTube's title.

Now that we have everything set up and running, it is time to get some video URLs. Go to the YouTube homepage and right click on any thumbnail from the "recommended videos" section. If you have Firebug installed, (which is highly recommended) you should see something like the following:

There's a pattern we can identify and which is present in almost all other regular video links:

Let's focus on those elements. Go back to your editor, and in app.js, add the following code to the /nodetube route:

It's time to restart our server one more time and reload the page in our browser (http://localhost:3000/nodetube). In your Terminal, you should see something like the following:

This looks good, but we need a way to display our results in the browser. For this, I will use the Jade template engine:

Jade is a high performance template engine heavily influenced by Haml, but implemented with JavaScript for Node.

In your editor, open views/layout.jade, which is the basic layout structure used when rendering a page with Express. It is nice but we need to modify it a bit.

views/layout.jade

If you compare the code above with the default code in layout.jade, you will notice that a few things have changed - doctype, the viewport meta tag, the style and script tags served from jquery.com. Let's create our list view:

views/list.jade

Before we start, please browse through jQuery Mobile's (JQM from now on) documentation on page layouts and anatomy.

The basic idea is to use a JQM listview, a thumbnail, title and video duration label for each item inside the listview along with a link to a video page for each one of the listed elements.

Note: Be careful with the indentation you use in your Jade documents, as it only accepts spaces or tabs - but not both in the same document.

That is all we need to create our listing. Return to app.js and replace the following code:

with this:

Restart your server one more time and reload your browser:

Note: Because we're using jQuery Mobile , I recommend using a Webkit based browser or an iPhone/Android cellphone (simulator) for better results.


Step 4: Viewing Videos

Let's create a view for our /watch route. Create views/video.jade and add the following code:

Again, go back to your Terminal, restart your server, reload your page, and click on any of the listed items. This time a video page will be displayed and you will be able to play the embed video!


Bonus: Using Forever to Run Your Server

There are ways we can keep our server running in the background, but there's one that I prefer, called Forever, a node module we can easily install using npm:

This will globally install Forever. Let's start our nodeTube application:

You can also restart your server, use custom log files, pass environment variables among other useful things:


Final Thoughts

I hope I've demonstrated how easy it is to begin using Node.js, Express and npm. In addition, you've learned how to install Node modules, add routes to Express, fetch remote pages using the Request module, and plenty of other helpful techniques.

If you have any comments or questions, please let me know in the comments section below!

Tags:

Comments

Related Articles