LlamaIndex Confluence Reader
The `llama-index-readers-confluence` library provides a data loader for ingesting content from Confluence Cloud instances into LlamaIndex. It supports various authentication methods, including OAuth 2.0, API tokens, and basic authentication, and can retrieve pages by ID, space key, label, or Confluence Query Language (CQL). It also offers functionality to include and parse attachments from Confluence pages. This integration is part of the broader LlamaIndex ecosystem, known for its rapid development and frequent updates. The current version is 0.7.0 and requires Python versions >=3.10 and <4.0.
Common errors
-
HTTPError: Current user not permitted to use Confluence
cause Incorrect authentication parameters (e.g., wrong `base_url`, `api_token`, `user_name`, or `password`), or the authenticated user lacks the necessary permissions to access the specified Confluence instance or space.fixDouble-check `base_url` (must end with `/wiki`), ensure `api_token` or `user_name`/`password` are correct. Verify that the user associated with the credentials has the required read permissions for the Confluence instance and the specific space/pages you are trying to access. -
SSL error with OAuth2 when connecting to company Confluence; client_id not found.
cause When using OAuth 2.0 (3LO) for authentication in a corporate Confluence environment, the `client_id` is a mandatory component of the `oauth2` dictionary. This `client_id` is typically generated when setting up an OAuth 2.0 app in the Atlassian Developer Console. If your company's Confluence setup does not expose this or you are trying to use a personal access token incorrectly, this error may occur.fixFor OAuth 2.0, ensure you have correctly configured an OAuth 2.0 app in the Atlassian Developer Console to obtain a `client_id` and the necessary `access_token`/`token_type`. If your company's Confluence does not provide this option, consider using an API token (if Confluence Cloud) or basic authentication with username/password (if Confluence Server) as alternative methods. The `client_id` is *not* your email address for OAuth2. -
Retrievers perform poorly when querying by identifiers present only in page titles (e.g., 'Can you summarize ticket XYZ?')
cause By default, LlamaIndex retrievers primarily focus on the content within the document's text. If crucial identifiers like 'ticket numbers' are only in the page title (metadata) and not sufficiently present or emphasized in the main text, the retriever might struggle to find relevant chunks.fixConsider adding the page title (or relevant metadata) directly into the `Document` text during loading. You can also explore advanced retrieval techniques like `MetadataFilters` or `BM25` search alongside vector search, or use a `DocumentSummaryIndex` with a retriever focused on summaries.
Warnings
- gotcha The `base_url` parameter for `ConfluenceReader` must end with `/wiki` (e.g., `https://your-domain.atlassian.net/wiki`). Omitting `/wiki` or providing an incorrect format will lead to connection errors.
- deprecated The `limit` parameter in `load_data` is deprecated. Use `max_num_results` instead for specifying the maximum number of pages to return.
- gotcha When using environment variables for basic authentication, `CONFLUENCE_PASSWORD` expects an API token generated from your Atlassian profile's security settings, not your actual Confluence user password. Using the actual password will result in authentication failures.
- breaking With LlamaIndex v0.10+, the library underwent a significant packaging refactor. While namespace imports are generally preserved (e.g., `from llama_index.readers.confluence import ConfluenceReader`), all third-party integrations, including this Confluence reader, are now separate PyPI packages. Direct imports from `llama_index.readers` might require `pip install llama-index-readers-confluence` first.
- gotcha The `ConfluenceReader` currently supports attachment types like PDF, PNG, JPEG/JPG, SVG, Word, and Excel. PowerPoint attachments are not supported for parsing by default, though custom parsers can be provided.
Install
-
pip install llama-index-readers-confluence
Imports
- ConfluenceReader
from llama_index.readers.confluence import ConfluenceReader
Quickstart
import os
from llama_index.readers.confluence import ConfluenceReader
# Ensure CONFLUENCE_BASE_URL, CONFLUENCE_USERNAME, and CONFLUENCE_API_TOKEN are set as environment variables
# CONFLUENCE_BASE_URL should end with /wiki, e.g., 'https://your-domain.atlassian.net/wiki'
# CONFLUENCE_API_TOKEN is generated from your Atlassian profile security settings, not your password
base_url = os.environ.get('CONFLUENCE_BASE_URL', 'https://your-confluence-instance.atlassian.net/wiki')
username = os.environ.get('CONFLUENCE_USERNAME', 'your_email@example.com')
api_token = os.environ.get('CONFLUENCE_API_TOKEN', 'your_api_token')
# Initialize the reader with basic authentication (API token is preferred over password)
reader = ConfluenceReader(
base_url=base_url,
user_name=username,
api_token=api_token # Use api_token for Confluence Cloud
)
# Load data from a specific space key
try:
documents = reader.load_data(space_key='YOUR_SPACE_KEY', max_num_results=10, include_attachments=False)
for doc in documents:
print(f"Document ID: {doc.doc_id}")
print(f"Content snippet: {doc.text[:200]}...")
except Exception as e:
print(f"An error occurred: {e}")
print("Please check your Confluence URL, credentials, and permissions.")