Calculating Keyword Density of a Web Page with Python

Elliot Forbes Elliot Forbes ⏰ 3 Minutes 📅 Apr 15, 2017

In this tutorial I will be showing you how to calculate the keyword density of a web page using the Python programming language. This will be a continuation of the previous tutorial in which we retrieved a web page using Pythons urllib2 Python module.

Keyword Density Calculation

Keyword density is an easy metric to calculate as it has a relatively simple formula. The keyword density of a specific term is measured as the number of occurrences of the chosen keyword over the total number of words in the body of text.

Implementation

In the previous 2 tutorials I showed you how you could fetch a web page as well as strip html tags from a fetched web page separately. The next stage of this tutorial series is putting what we've learned together and then devising a method for counting the total number of words in our web page as well as counting the total number of occurrences of the chosen keyword.

Utilizing the Dictionary Data Structure

The easiest and fastest way to store our words as a list of words along with their respective occurrences is to utilize Python's dictionary data structure. 

## declaring a dictionary in python
word_list = {}

Now that we've got our dictionary structure defined we can loop through every word from our html document after it's had it's html tags removed.

Source Code

import urllib2
import re

TAG_RE = re.compile(r'<[^>]+>')

def fetch_page(siteURL):
    ## create a variable which will hold our desired web page as a string
    site= siteURL
    ## create the approprriate headers for our http request so that we wont run
    ## into any 403 forbidden errors. All of this will be available at the tutorial
    ## page that I will link to in the description below
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}

    ## Perform a HTTP request by passing in our desired URL and setting our headers to equal
    ## the headers that we've defined above.
    req = urllib2.Request(site, headers=hdr)

    #
    try:
        ## here we are going to open our desired page using urllib2.urlopen
        ## and passing in our request object as a parameter and as a means of protection we
        ## will surround this with a try except so that, should the script run into any errors
        ## it will fail gracefully instead of just crashing.
        page = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        ## print out the HTTPError
        print e.fp.read()

    ## lastly we will want to read the response which was generated by opening
    ## the url and store it under content
    content = page.read()
    ## and then print out this page.
    return content

def remove_tags(text):
    return TAG_RE.sub('', text)


def main():
    page = fetch_page("https://tutorialedge.net")
    wordsNoTags = remove_tags(page)

    word_list = {}

    for word in wordsNoTags:
        if not word in word_list:
            word_list[word] = 1
        else:
            word_list[word] += 1

    print(len(word_list))

if __name__ == "__main__":
    main()