About Extracting Information from an HTML File
You can extract information from an HTML file by extending the html.parser.HTMLParser
and overwriting the handle_*()
methods. For example, this class lets you extract Open Graph information from a web page:
from html.parser import HTMLParser
import requests
from pprint import pprint
class OpenGraphParser(HTMLParser):
OG_PROPERTIES = ["og:title", "og:type", "og:image", "og:url"]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.og_data = {}
def handle_starttag(self, tag, attrs):
if tag.lower() == "meta":
attrs_dict = dict(attrs)
if (
(prop := attrs_dict.get("property"))
and (content := attrs_dict.get("content"))
and prop in self.OG_PROPERTIES
):
self.og_data[prop.replace("og:", "")] = content
def get_data(self):
return self.og_data
if __name__ == "__main__":
response = requests.get("https://www.djangotricks.com/tricks/3J96KxVxbApk/")
og_parser = OpenGraphParser()
og_parser.feed(response.text)
og_data = og_parser.get_data()
pprint(og_data)
These methods are called repetitively for each occurrence, so you can collect them or search for a specific tag, text, character, or comment:
handle_startendtag(self, tag, attrs)
- for each self-closing taghandle_starttag(self, tag, attrs)
- for each opening taghandle_endtag(self, tag)
- for each closing taghandle_charref(self, name)
- for each character reference, e.g.🤩
handle_entityref(self, name)
- for each entity reference, e.g.€
handle_data(self, data)
- for each piece of inner text, including inline scripts and styleshandle_comment(self, data)
- for each HTML comment
Also by me
Django Paddle Subscriptions app
For Django-based SaaS projects.
Django App for You
Django GDPR Cookie Consent app
For Django websites that use cookies.
Django App for You