Fetching Web Pages In Python Using Urllib2 Image

Fetching Web Pages In Python Using Urllib2

This tutorial was built using Python 2.7. This will not work for Python 3+

In this tutorial I will be showing you how you can fetch the a webpage using the urllib2 python module. This is a relatively simple process that can be accomplished in 5 lines of code.

The Imports

To begin with we will need to import the urllib2 python module so that we can utilize it’s functionality:

1
import urllib2

Fetching a Website

Once we've imported all the appropriate python modules we can move onto fetching our desired webpage. For the purpose of this example we'll be fetching this webpage and we can do that like so:

1
2
3
4
5
6
7
req = urllib2.Request('http://www.tutorialedge.net/python/fetching-web-pages-python/')
response = urllib2.urlopen(req)
the_page = response.read()
# Here we print out the retrieved page's HTML to the console
# once we've got this we can start performing some analysis of 
# the webpage and do some cooler things.
print(the_page)

Now in order to prevent being hit by 403 forbidden responses we need to define a set of headers for our HTTP request.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import urllib2

# create a variable which will hold our desired web page as a string
site= "http://tutorialedge.net"
# create the approprriate headers for our http request so that we wont run
# into any 403 forbidden errors. All of this will be available at the tutorial
# page that I will link to in the description below
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

# Perform a HTTP request by passing in our desired URL and setting our headers to equal
# the headers that we've defined above.
req = urllib2.Request(site, headers=hdr)

# 
try:
    # here we are going to open our desired page using urllib2.urlopen
    # and passing in our request object as a parameter and as a means of protection we 
    # will surround this with a try except so that, should the script run into any errors
    # it will fail gracefully instead of just crashing.
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    # print out the HTTPError
    print e.fp.read()

# lastly we will want to read the response which was generated by opening
# the url and store it under content
content = page.read()
# and then print out this page.
print content

Video Tutorial

Moving Forward

This tutorial will act as the base tutorial for quite a number of tutorials in which we will be calculating key SEO metrics such as keyword density. 

Was This Post Helpful?
Submit a PR: Edit on Github