HTML to JSON Converter
The `html-to-json` Python library, currently at version 2.0.0, provides functionality to convert HTML strings into a JSON representation. It also includes intelligent conversion for HTML tables. The project is currently in maintenance mode, with the author seeking sponsorship for active development and ongoing upkeep.
Common errors
-
TypeError: convert() got an unexpected keyword argument 'some_old_param'
cause Attempting to use parameters that existed in older, potentially non-fhightower/html-to-json versions, or parameters that have been removed/renamed in version 2.0.0.fixRefer to the official GitHub README for `fhightower/html-to-json` version 2.0.0 for the correct function signature and available keyword arguments (e.g., `capture_element_values`, `capture_element_attributes`). -
KeyError: '_value' or KeyError: '_attributes' in output JSON
cause Expecting a flattened or different JSON structure, or not accounting for the library's specific output format which uses `_value` for text content and `_attributes` for element attributes.fixUnderstand that the library's JSON output for elements typically includes nested dictionaries with keys like `_value` for text and `_attributes` for HTML attributes. Adjust your JSON parsing logic accordingly. For example, `output_json['head'][0]['title'][0]['_value']` to access the title text. -
json.decoder.JSONDecodeError: Expecting value: line X column Y (char Z)
cause This error occurs when trying to parse the output of `html_to_json.convert` using `json.loads()` and the output is not valid JSON. This typically means the `html_to_json` library encountered highly malformed HTML that it couldn't convert into a well-formed JSON structure, or the conversion function itself returned an error or unexpected string.fixEnsure the input HTML is well-formed. Use a HTML validator if the source HTML is external or untrusted. Inspect the raw output of `html_to_json.convert()` before attempting `json.loads()` to identify any intermediate parsing issues.
Warnings
- maintenance The library is currently in a maintenance-only state. The author has indicated that active development requires sponsorship. Users should be aware that new features or rapid bug fixes may not be prioritized without community support.
- gotcha When upgrading from versions prior to 2.0.0, new parameters `capture_element_values` and `capture_element_attributes` were introduced to the `convert` function. While they default to `True`, explicitly setting them might be necessary to ensure consistent output if your downstream code relies on a specific JSON structure.
Install
-
pip install html-to-json
Imports
- html_to_json
import html_to_json
Quickstart
import html_to_json
html_string = """<head>
<title>Test site</title>
<meta charset="UTF-8">
<p>This is some <b>bold</b> text.</p>
<table>
<thead>
<tr><th>Header 1</th><th>Header 2</th></tr>
</thead>
<tbody>
<tr><td>Data 1</td><td>Data 2</td></tr>
</tbody>
</table>
</head>"""
# Convert HTML to JSON
output_json = html_to_json.convert(html_string)
print(output_json)
# Convert HTML to JSON without capturing element values
output_no_values = html_to_json.convert(html_string, capture_element_values=False)
print(output_no_values)
# Convert HTML to JSON without capturing element attributes
output_no_attributes = html_to_json.convert(html_string, capture_element_attributes=False)
print(output_no_attributes)