Removing HTML Tags from a String with Python
This tutorial will demonstrate two different methods as to how one can remove html tags from a string such as the one that we retrieved in my previous tutorial on fetching a web page using Python
This method will demonstrate a way that we can remove html tags from a string using regex strings.
import re TAG_RE = re.compile(r'<[^>]+>') def remove_tags(text): return TAG_RE.sub('', text)
This is another method we can use to remove html tags using functionality present in the Python Standard library so there is no need for any imports.
def remove_tags(text): ''.join(xml.etree.ElementTree.fromstring(text).itertext())
In the coming tutorials we will be learning how to calculate important seo metrics such as keyword density that will allow us to perform important seo analysis of competing sites to try and understand how they have achieved their success.
The methods for tag removal can be found here: http://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string