MainContentExtractor

library 0.0.4 ·python

✓ verified May 22, 2026

MainContentExtractor is a Python library designed to extract the core content from HTML documents. It aims to address limitations found in other extraction tools, such as the inability to output clean HTML directly. The library is useful for LLM-related tasks and for feeding data into frameworks like LangChain and LlamaIndex by providing output in HTML, Text, or Markdown formats. It is currently at version 0.0.4, with a relatively active development cadence.

Traffic · last 30 days ↑183% vs prev 7d · indexed Sun Apr 12 · updated Wed May 27

total hits 25

actors 9 distinct systems

last hit 11h ago ByteDance

ByteDance

OAI-SearchBot

GPTBot

Script

ClaudeBot

ChatGPT-User

Search engines

top countries 🇸🇬 Singapore · 🇺🇸 United States · 🇫🇷 France · 🇩🇪 Germany · 🇫🇮 Finland

Resources

githubgithub.com/HawkClaws/main_content_extractor ↗

packagepypi.org/project/maincontentextractor/ ↗

API endpoints

full doc /v1/registry/maincontentextractor

install /v1/registry/maincontentextractor/install

compatibility /v1/registry/maincontentextractor/compatibility