pgcopy
pgcopy is a Python library designed for fast data loading into PostgreSQL databases leveraging the PostgreSQL binary COPY protocol. It supports various database adaptors like psycopg2, psycopg, pg8000, and PyGreSQL, and handles a wide range of PostgreSQL data types, including arrays. The library is currently at version 1.6.2 and focuses on efficient bulk data insertion.
Common errors
-
ERROR: could not open file "/path/to/file.csv" for reading: Permission denied
cause The PostgreSQL server process (not the client application running pgcopy) does not have the necessary file system permissions to read the specified file. This is a common issue because the COPY command is executed from the server's perspective.fixEnsure the file is accessible and readable by the PostgreSQL server's user. Place the file in a universally accessible directory like `/tmp` (on Linux/macOS) or `C:\Users\Public` (on Windows), or adjust file/directory permissions for the PostgreSQL user. Alternatively, ensure the PostgreSQL user is a superuser (though less secure) or use client-side copy methods if available. -
ERROR: missing data for column "column_name"
cause The input data (e.g., CSV file) has a row with fewer columns than expected by the target table, or a non-numeric value is found in a numeric column, often an empty string. This can also happen if the column order in the input data does not match the table schema.fixVerify that the input data's structure (number and order of columns, data types) exactly matches the target table's schema. If empty strings should be treated as NULLs, use `NULL ''` in your `COPY` command options (if directly using SQL) or handle empty values in your data preparation for `pgcopy`. -
ERROR: could not extend file "base/..." or "pg_wal/...": No space left on device
cause The disk where PostgreSQL stores its data, WAL (Write-Ahead Log) files, or temporary files has run out of space. This can halt transactions, cause panics, and even lead to server shutdown.fixImmediately free up disk space by removing old log files, clearing temporary files, or truncating unnecessary data. Investigate WAL archiving and replication slot issues if `pg_wal` is full. Implement proactive disk usage monitoring and regular `VACUUM` operations to prevent bloat. -
server closed the connection unexpectedly
cause This error often indicates that the PostgreSQL server terminated abnormally before or during the processing of the request, or the connection was lost due to network issues, firewall blocking, or the server not running/listening on the expected address/port.fixVerify that the PostgreSQL server is running and listening on the correct IP address and port (`listen_addresses` and `port` in `postgresql.conf`). Check firewall rules to ensure port 5432 (or your custom port) is open. Review PostgreSQL logs for server-side errors that might have caused the unexpected termination.
Warnings
- gotcha When using `psycopg2` (a common dependency for `pgcopy`), be aware of the `psycopg2-binary` package. `psycopg2-binary` is for quick starts and development but is not recommended for production environments due to potential build issues or lack of specific optimizations. For production, `psycopg2` (which requires local build tools) or `psycopg` (Psycopg 3 with a clearer binary/source distinction) is generally preferred.
- gotcha The `pgcopy.Replace` context manager, which automates replacing an entire table for faster bulk updates, can lead to issues with concurrent queries on PostgreSQL versions 9.1 and earlier, as concurrent queries on the table will fail once the old table is dropped. Additionally, it's not suitable if other tables depend on the table being replaced.
- gotcha PostgreSQL's `COPY` command (which `pgcopy` uses) performs best when tables have no triggers, foreign keys, or other constraints during the bulk import. These can significantly slow down the process as they are validated for each inserted row.
- gotcha When dealing with large datasets, a single, very large `COPY` transaction can stress PostgreSQL by generating enormous Write-Ahead Log (WAL) segments, leading to rapid storage filling and replication falling behind. Conversely, very small batches can incur excessive transaction overhead.
Install
-
pip install pgcopy
Imports
- CopyManager
from pgcopy import CopyManager
- Replace
from pgcopy.util import Replace
from pgcopy import Replace
Quickstart
import os
from datetime import datetime
from pgcopy import CopyManager
import psycopg
# Ensure you have a running PostgreSQL instance and a 'weather_db' database
# You might need to create the database and a 'measurements_table'
# For example, using `psql -c 'CREATE DATABASE weather_db;'`
# And inside weather_db: `CREATE TABLE measurements_table (id INT, timestamp TIMESTAMP, location VARCHAR(255), temperature NUMERIC);`
db_name = os.environ.get('PG_DB_NAME', 'weather_db')
db_user = os.environ.get('PG_DB_USER', 'postgres')
db_password = os.environ.get('PG_DB_PASSWORD', '')
db_host = os.environ.get('PG_DB_HOST', 'localhost')
db_port = os.environ.get('PG_DB_PORT', '5432')
conn_string = f"dbname={db_name} user={db_user} password={db_password} host={db_host} port={db_port}"
try:
conn = psycopg.connect(conn_string)
print(f"Connected to database: {db_name}")
cursor = conn.cursor()
# Define columns and records
cols = ('id', 'timestamp', 'location', 'temperature')
now = datetime.now()
records = [
(0, now, 'Jerusalem', 72.2),
(1, now, 'New York', 75.6),
(2, now, 'Moscow', 54.3),
]
# Use CopyManager for fast insertion
mgr = CopyManager(conn, 'measurements_table', cols)
mgr.copy(records)
conn.commit()
print("Records copied successfully.")
# Verify data (optional)
cursor.execute("SELECT * FROM measurements_table ORDER BY id;")
print("\nData in table after copy:")
for row in cursor.fetchall():
print(row)
finally:
if conn:
cursor.close()
conn.close()
print("Database connection closed.")