sgmllib3k
sgmllib3k is a Python 3 port of the `sgmllib` module, which was deprecated in Python 2.6 and removed in Python 3.0. It provides a basic SGML/HTML parser for legacy applications. The current version is 1.0.0, released in 2011, and the project appears to be abandoned.
Warnings
- breaking sgmllib3k is not actively maintained and was last updated in 2011. It is designed for early Python 3 versions (e.g., 3.0-3.2) and may not be compatible or stable with modern Python 3.x releases (3.6+).
- deprecated This library itself is a port of a deprecated module (`sgmllib`). It lacks modern features like HTML5 support, robust error handling, and performance optimizations found in contemporary parsing libraries. Its use is strongly discouraged for new projects.
- gotcha Unlike `html.parser` or `BeautifulSoup`, `sgmllib3k` (and the original `sgmllib`) provides a very low-level, SAX-like parser. It requires manual implementation of handler methods (e.g., `start_tag`, `end_tag`, `handle_data`) which can be verbose and error-prone for complex parsing tasks.
Install
-
pip install sgmllib3k
Imports
- SGMLParser
from sgmllib3k import SGMLParser
Quickstart
import sgmllib3k
class MyParser(sgmllib3k.SGMLParser):
def __init__(self, verbose=0):
sgmllib3k.SGMLParser.__init__(self, verbose)
self.data = []
def handle_data(self, data):
self.data.append(data)
def unknown_starttag(self, tag, attrs):
# Example: print all start tags
pass
def unknown_endtag(self, tag):
# Example: print all end tags
pass
html_content = "<html><body><h1>Hello</h1><p>World</p></body></html>"
parser = MyParser()
parser.feed(html_content)
parser.close()
print("Extracted data:", parser.data)
# Expected output: Extracted data: ['Hello', 'World']