LlamaIndex File Readers
The `llama-index-readers-file` library provides specialized data loaders for various local file formats (e.g., PDF, DOCX, CSV, TXT, Image) within the LlamaIndex ecosystem. It allows users to ingest different file types into LlamaIndex Document objects for indexing and retrieval. Current version is 0.6.0, with releases typically aligning with LlamaIndex core library updates.
Warnings
- breaking Prior to LlamaIndex v0.10.x, some file readers might have been directly available within the main `llama_index` package. With the modularization, specific readers now reside in sub-packages like `llama-index-readers-file` and require a separate installation.
- gotcha Many specific file type readers (e.g., PDFReader, DocxReader, CSVFileReader) rely on additional third-party libraries that are not installed by default with `llama-index-readers-file`. You must install these optional dependencies explicitly.
- gotcha When dealing with complex documents (e.g., PDFs with tables, scanned images, or nested structures), `FlatReader` or basic type-specific readers may not extract content optimally. For advanced parsing, consider `UnstructuredReader` (available in `llama-index-readers-unstructured`).
Install
-
pip install llama-index-readers-file -
pip install llama-index-readers-file[pdf,docx,xlsx]
Imports
- FlatReader
from llama_index.readers.file import FlatReader
- PDFReader
from llama_index.readers.file import PDFReader
- DocxReader
from llama_index.readers.file import DocxReader
- CSVFileReader
from llama_index.readers.file import CSVFileReader
Quickstart
import tempfile
from pathlib import Path
from llama_index.readers.file import FlatReader
# Create a dummy text file
file_content = "This is a sample document for LlamaIndex. It contains some text."
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as tmp_file:
tmp_file.write(file_content)
tmp_file_path = Path(tmp_file.name)
# Initialize the FlatReader
reader = FlatReader()
# Load data from the temporary file
documents = reader.load_data(file=tmp_file_path)
# Print the content of the first document
if documents:
print(f"Loaded document content: {documents[0].text[:100]}...")
print(f"Metadata: {documents[0].metadata}")
# Clean up the temporary file
tmp_file_path.unlink()