About Extracting Information from an HTML File
You can extract information from an HTML file by extending the html.parser.HTMLParser
and overwriting the handle_*()
methods. For example, this class lets you extract Open Graph information from a web page:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
These methods are called repetitively for each occurrence, so you can collect them or search for a specific tag, text, character, or comment:
handle_startendtag(self, tag, attrs)
- for each self-closing taghandle_starttag(self, tag, attrs)
- for each opening taghandle_endtag(self, tag)
- for each closing taghandle_charref(self, name)
- for each character reference, e.g.🤩
handle_entityref(self, name)
- for each entity reference, e.g.€
handle_data(self, data)
- for each piece of inner text, including inline scripts and styleshandle_comment(self, data)
- for each HTML comment
Also by me
Django Paddle Subscriptions app
For Django-based SaaS projects.
Django App for You
Django GDPR Cookie Consent app
For Django websites that use cookies.
Django App for You