Internet Archive Python Library
internetarchive is a Python interface to archive.org, providing both a command-line interface (CLI) and a Python API. It allows programmatic access to search, download, upload, and interact with various Internet Archive services. The library is actively maintained, with version 5.8.0 being the latest stable release, and new versions released periodically to add features, improve performance, and address bugs or security vulnerabilities.
Common errors
-
HTTPError: 403 Client Error: Forbidden for url: https://s3.us.archive.org/...
cause Incorrect or missing Internet Archive S3 credentials (IA_ACCESS_KEY, IA_SECRET_KEY) for operations requiring authentication (upload, metadata modification).fixEnsure `IA_ACCESS_KEY` and `IA_SECRET_KEY` environment variables are correctly set with valid keys. Alternatively, use `internetarchive.configure()` or the `ia configure` CLI command to set them in the configuration file (`~/.ia`). -
All files in item 'my_item_id' were deleted unexpectedly after running 'ia delete --glob="*.txt"'
cause This describes the behavior of a critical bug in `ia delete` command versions 5.4.1 through 5.6.x, where glob patterns were ignored, leading to unintended mass deletions.fixUpgrade the `internetarchive` library to version 5.7.0 or newer. Always test delete commands with `ia delete --dry-run` first to verify the intended files are targeted. -
Error: Failed to download file 'malicious_path/../../etc/passwd'. Permission denied.
cause This indicates an attempt by a potentially malicious filename to perform directory traversal, which was a vulnerability in `File.download()` prior to v5.5.1, or a general permission issue.fixUpgrade to `internetarchive` v5.5.1 or higher to patch the directory traversal vulnerability. Ensure target download directories have appropriate write permissions for the user running the script.
Warnings
- breaking A critical bug in versions v5.4.1 to v5.6.x (fixed in v5.7.0) caused `ia delete --glob` and `ia delete --format` to delete *all* files in an item, regardless of the specified pattern, potentially leading to significant data loss.
- breaking Versions <=5.5.0 contain a critical directory traversal vulnerability in `File.download()` (fixed in v5.5.1). This allowed malicious filenames to write files outside the target directory, a severe risk, especially on Windows.
- gotcha The Internet Archive API has rate limits, especially for uploads. Exceeding these limits can lead to temporary (or sometimes persistent, requiring manual intervention) account lockout for API uploading privileges.
- gotcha Certain metadata fields (e.g., `mediatype`, `collection`) are 'write-once' and can only be set during the *initial* upload of an item. Subsequent attempts to modify them will be ignored or cause errors.
- gotcha Installing `internetarchive` via unsupported third-party package managers (e.g., Homebrew, MacPorts, Linux system packages like `apt` or `yum`) often results in severely outdated, incompatible, or broken versions.
Install
-
pip install internetarchive -
pip install "internetarchive[speedups]"
Imports
- get_item
from internetarchive import get_item
- search_items
from internetarchive import search_items
- upload
from internetarchive import upload
- download
from internetarchive import download
- configure
import internetarchive.configure
from internetarchive import configure
Quickstart
import os
from internetarchive import search_items, get_item, upload
import tempfile
# --- Authentication ---
# Set your IA S3 keys as environment variables for uploads and metadata modification.
# You can generate them at https://archive.org/account/s3.php
# For programmatic access, it's recommended to set IA_ACCESS_KEY and IA_SECRET_KEY
# as environment variables.
# Example: export IA_ACCESS_KEY='YOUR_ACCESS_KEY'
# export IA_SECRET_KEY='YOUR_SECRET_KEY'
# --- 1. Search for items ---
print("Searching for items tagged 'NASA'...")
search_results = search_items('subject:NASA')
for i, result in enumerate(search_results.iter_as_results()):
if i >= 3: # Limit to 3 results for brevity
break
print(f" - Identifier: {result['identifier']}, Title: {result.get('title')}")
# --- 2. Download a file from an item ---
# Using an example item known to exist with publicly downloadable files
print("\nAttempting to download a file from 'nasa_images_1960s'...")
try:
item_to_download = get_item('nasa_images_1960s') # Use a stable public item
# Try to find an image file, otherwise download the first available file
files = item_to_download.get_files(formats=['JPEG', 'PNG', 'image/jpeg'])
if files:
file_to_download = files[0]
print(f"Downloading {file_to_download.name}...")
# Use tempfile for a safe, temporary download location
with tempfile.TemporaryDirectory() as tmpdir:
downloaded_path = file_to_download.download(tmpdir)
print(f"Downloaded to: {downloaded_path}")
else:
print("No suitable files found to download from 'nasa_images_1960s'.")
except Exception as e:
print(f"Error during download: {e}")
# --- 3. Upload a dummy file ---
# Requires IA_ACCESS_KEY and IA_SECRET_KEY to be set as environment variables
access_key = os.environ.get('IA_ACCESS_KEY', 'YOUR_ACCESS_KEY')
secret_key = os.environ.get('IA_SECRET_KEY', 'YOUR_SECRET_KEY')
if access_key == 'YOUR_ACCESS_KEY' or secret_key == 'YOUR_SECRET_KEY':
print("\nSkipping upload example: IA_ACCESS_KEY or IA_SECRET_KEY not set.")
print("Please set environment variables or use 'ia configure' to enable uploads.")
else:
print("\nAttempting to upload a dummy file...")
temp_file_name = "my_dummy_file.txt"
temp_file_content = "This is a test upload from the internetarchive Python library."
identifier = "my_unique_test_item_12345" # Replace with a truly unique identifier
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f:
f.write(temp_file_content)
temp_file_path = f.name
metadata = {
'title': f'My Test Item {identifier}',
'description': 'A dummy item uploaded via Python library quickstart.',
'mediatype': 'data', # Required
'collection': 'test_collection' # Replace with a collection you have write access to
}
try:
print(f"Uploading {temp_file_path} to {identifier}...")
r = upload(identifier, files=[temp_file_path], metadata=metadata)
print(f"Upload successful! Status: {r[0].status_code}")
print(f"View item at: https://archive.org/details/{identifier}")
except Exception as e:
print(f"Error during upload: {e}")
finally:
os.remove(temp_file_path)