January Sales! - 25% off premium subscriptions when you use coupon code: JAN2022 - View Pricing

# Calculating Keyword Density of a Web Page with Python

β° 3 Minutes π Apr 15, 2017

In this tutorial I will be showingΒ you how to calculate the keyword density of a web page using the Python programming language. This will be a continuation of the previous tutorial in which we retrieved a web page using Pythons `urllib2` Python module.

## Keyword Density Calculation

Keyword density is an easy metric to calculate as it has a relatively simple formula. The keyword density of a specific term is measured as the number of occurrences of the chosen keyword over the total number of words in the body of text.

## Implementation

In the previous 2 tutorials I showed you how you could fetch a web page as well as strip html tags from a fetched web page separately. The next stage of this tutorial series is putting what we've learned together and then devising a method for counting the total number of words in our web page as well as counting the total number of occurrences of the chosen keyword.

### Utilizing the Dictionary Data Structure

The easiest and fastest way to store our words as a list of words along with their respective occurrences is to utilize Python's dictionary data structure.Β

``````## declaring a dictionary in python
word_list = {}
``````

Now that we've got our dictionary structure defined we can loop through every word from our html document after it's had it's html tags removed.

## Source Code

``````import urllib2
import re

TAG_RE = re.compile(r'<[^>]+>')

def fetch_page(siteURL):
## create a variable which will hold our desired web page as a string
site= siteURL
## create the approprriate headers for our http request so that we wont run
## into any 403 forbidden errors. All of this will be available at the tutorial
## page that I will link to in the description below
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

## Perform a HTTP request by passing in our desired URL and setting our headers to equal
## the headers that we've defined above.

#
try:
## here we are going to open our desired page using urllib2.urlopen
## and passing in our request object as a parameter and as a means of protection we
## will surround this with a try except so that, should the script run into any errors
## it will fail gracefully instead of just crashing.
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
## print out the HTTPError

## lastly we will want to read the response which was generated by opening
## the url and store it under content
return content

def remove_tags(text):
return TAG_RE.sub('', text)

def main():
page = fetch_page("https://tutorialedge.net")
wordsNoTags = remove_tags(page)

word_list = {}

for word in wordsNoTags:
if not word in word_list:
word_list[word] = 1
else:
word_list[word] += 1

print(len(word_list))

if __name__ == "__main__":
main()
``````