{"id":4098,"library":"maincontentextractor","title":"MainContentExtractor","description":"MainContentExtractor is a Python library designed to extract the core content from HTML documents. It aims to address limitations found in other extraction tools, such as the inability to output clean HTML directly. The library is useful for LLM-related tasks and for feeding data into frameworks like LangChain and LlamaIndex by providing output in HTML, Text, or Markdown formats. It is currently at version 0.0.4, with a relatively active development cadence.","status":"active","version":"0.0.4","language":"en","source_language":"en","source_url":"https://github.com/HawkClaws/main_content_extractor","tags":["html","content-extraction","web-scraping","llm","langchain","llamaindex","markdown"],"install":[{"cmd":"pip install MainContentExtractor","lang":"bash","label":"PyPI"}],"dependencies":[{"reason":"Used for HTML parsing and manipulation.","package":"beautifulsoup4","optional":false},{"reason":"Used for converting HTML to Markdown or plain text.","package":"html2text","optional":false},{"reason":"The core main content extraction relies on trafilatura internally.","package":"trafilatura","optional":false}],"imports":[{"note":"The PyPI package name is `MainContentExtractor` (capitalized), but the Python module and class name use `main_content_extractor` (lowercase with underscore). Ensure correct casing for import statements.","wrong":"from MainContentExtractor import MainContentExtractor","symbol":"MainContentExtractor","correct":"from main_content_extractor import MainContentExtractor"}],"quickstart":{"code":"import requests\nfrom main_content_extractor import MainContentExtractor\n\n# Example HTML content (or fetch from a URL)\nhtml_content = \"\"\"\n<html>\n<head><title>Example Page</title></head>\n<body>\n    <header>Navigation Bar</header>\n    <main>\n        <h1>Important Article Title</h1>\n        <p>This is the main content paragraph.</p>\n        <p>Another paragraph with <a href=\"#\">a link</a> inside.</p>\n    </main>\n    <footer>Footer content</footer>\n</body>\n</html>\n\"\"\"\n\n# Or, fetch from a URL (requires 'requests')\n# url = \"https://www.example.com\"\n# response = requests.get(url)\n# response.encoding = 'utf-8'\n# html_content = response.text\n\n# Extract main content as HTML\nextracted_html = MainContentExtractor.extract(html_content)\nprint(\"--- Extracted HTML ---\")\nprint(extracted_html)\n\n# Extract main content as Markdown\nextracted_markdown = MainContentExtractor.extract(html_content, output_format=\"markdown\")\nprint(\"\\n--- Extracted Markdown ---\")\nprint(extracted_markdown)\n\n# Extract main content as plain text\nextracted_text = MainContentExtractor.extract(html_content, output_format=\"text\")\nprint(\"\\n--- Extracted Text ---\")\nprint(extracted_text)","lang":"python","description":"This quickstart demonstrates how to extract the main content from an HTML string using MainContentExtractor. It shows output in HTML, Markdown, and plain text formats. If fetching HTML from a URL, ensure `requests` is installed (`pip install requests`)."},"warnings":[{"fix":"Be aware that the output HTML might not be an exact replica of the original main content. Validate the extracted output against your specific use case, especially if pixel-perfect fidelity to the original HTML is required.","message":"The library internally uses `trafilatura` and converts its XML output to HTML. This conversion is described as irreversible and may not perfectly match the original HTML structure.","severity":"gotcha","affected_versions":"0.0.1 - 0.0.4"},{"fix":"Always use `pip install MainContentExtractor` for installation and `from main_content_extractor import MainContentExtractor` for importing. Pay close attention to casing and underscores.","message":"There can be confusion between the PyPI package name (`MainContentExtractor` - capitalized) and the Python module name for import (`main_content_extractor` - lowercase with underscores). A `ModuleNotFoundError` will occur if the import statement uses the incorrect casing or formatting.","severity":"gotcha","affected_versions":"0.0.1 - 0.0.4"},{"fix":"Pin your project's dependency to a specific patch version (e.g., `MainContentExtractor==0.0.4`) and thoroughly test your application after any updates. Monitor the GitHub repository for release notes and changes.","message":"Given the library's early stage (version 0.0.4), API stability is not guaranteed. Minor version updates may introduce breaking changes or significant modifications to the API without extensive deprecation warnings.","severity":"breaking","affected_versions":"< 1.0.0"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}