#Calculating Keyword Density of a Web Page with Python
In this tutorial I will be showing you how to calculate the keyword density of a
web page using the Python programming language. This will be a continuation of
the previous tutorial in which we retrieved a web page using Pythons urllib2
Python module.
Keyword Density Calculation
Keyword density is an easy metric to calculate as it has a relatively simple formula. The keyword density of a specific term is measured as the number of occurrences of the chosen keyword over the total number of words in the body of text.
Implementation
In the previous 2 tutorials I showed you how you could fetch a web page as well as strip html tags from a fetched web page separately. The next stage of this tutorial series is putting what we've learned together and then devising a method for counting the total number of words in our web page as well as counting the total number of occurrences of the chosen keyword.
Utilizing the Dictionary Data Structure
The easiest and fastest way to store our words as a list of words along with their respective occurrences is to utilize Python's dictionary data structure.
## declaring a dictionary in python
word_list = {}
Now that we've got our dictionary structure defined we can loop through every word from our html document after it's had it's html tags removed.
Source Code
import urllib2
import re
TAG_RE = re.compile(r'<[^>]+>')
def fetch_page(siteURL):
## create a variable which will hold our desired web page as a string
site= siteURL
## create the approprriate headers for our http request so that we wont run
## into any 403 forbidden errors. All of this will be available at the tutorial
## page that I will link to in the description below
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
## Perform a HTTP request by passing in our desired URL and setting our headers to equal
## the headers that we've defined above.
req = urllib2.Request(site, headers=hdr)
#
try:
## here we are going to open our desired page using urllib2.urlopen
## and passing in our request object as a parameter and as a means of protection we
## will surround this with a try except so that, should the script run into any errors
## it will fail gracefully instead of just crashing.
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
## print out the HTTPError
print e.fp.read()
## lastly we will want to read the response which was generated by opening
## the url and store it under content
content = page.read()
## and then print out this page.
return content
def remove_tags(text):
return TAG_RE.sub('', text)
def main():
page = fetch_page("https://tutorialedge.net")
wordsNoTags = remove_tags(page)
word_list = {}
for word in wordsNoTags:
if not word in word_list:
word_list[word] = 1
else:
word_list[word] += 1
print(len(word_list))
if __name__ == "__main__":
main()
Continue Learning
Getting Started With PyUnit Testing
In this tutorial we will be looking at the absolute basics of unit testing in python using PyUnit
Python Project Layout Best Practices
In this tutorial we'll be examining some of the best practices to follow when it comes to laying out your Python applications
Creating a Twitter Bot Using Python and the Twitter API
This tutorial teaches the user how they can build a twitter bot using both the Python programming language and the RESTful Twitter Library.
Creating a Python Web Crawler
In this tutorial, we'll look at how you can build your own web crawler in Python