{"id":9641,"library":"dataengine","title":"Dataengine","description":"Dataengine is a general-purpose Python package designed for streamlined data engineering tasks. It provides a unified API for working with various data processing backends such as Pandas, Polars, Spark, and Ray, along with integrations for common data formats (CSV, Parquet, JSON) and relational databases via SQLAlchemy and DuckDB. Currently at version 0.0.92, it is under active development with a focus on providing flexible and scalable data manipulation tools.","status":"active","version":"0.0.92","language":"en","source_language":"en","source_url":"https://github.com/leealessandrini/dataengine","tags":["data-engineering","dataframe","etl","pandas","polars","spark","ray","database"],"install":[{"cmd":"pip install dataengine","lang":"bash","label":"Base Install"},{"cmd":"pip install 'dataengine[spark]' 'dataengine[s3]' 'dataengine[postgres]'","lang":"bash","label":"Install with Spark, S3, and PostgreSQL extras"}],"dependencies":[{"reason":"Core data processing backend","package":"pandas","optional":false},{"reason":"Core data processing backend","package":"polars","optional":false},{"reason":"Underpins data handling for many formats (e.g., Parquet)","package":"pyarrow","optional":false},{"reason":"Embedded SQL engine for local data","package":"duckdb","optional":false},{"reason":"Unified database connectivity","package":"sqlalchemy","optional":false},{"reason":"Data validation and settings management","package":"pydantic","optional":false},{"reason":"Optional distributed computing backend","package":"ray","optional":true},{"reason":"Optional distributed computing backend for Apache Spark","package":"pyspark","optional":true},{"reason":"Optional for S3 storage integration","package":"s3fs","optional":true},{"reason":"Optional for Azure Blob Storage integration","package":"azure-storage-blob","optional":true},{"reason":"Optional for Google Cloud Storage integration","package":"google-cloud-storage","optional":true},{"reason":"Optional for MySQL database connectivity","package":"mysqlclient","optional":true},{"reason":"Optional for PostgreSQL database connectivity","package":"psycopg","optional":true}],"imports":[{"symbol":"SparkSession","correct":"from dataengine import SparkSession"},{"symbol":"PandasDataFrame","correct":"from dataengine import PandasDataFrame"},{"symbol":"CSV","correct":"from dataengine import CSV"}],"quickstart":{"code":"import dataengine as de\nimport os\nimport pandas as pd # Used for creating dummy data file\n\n# Create a dummy CSV file for the example\ndummy_csv_content = \"id,name,value\\n1,Alice,100\\n2,Bob,200\\n3,Charlie,150\"\ntemp_csv_file = \"temp_quickstart.csv\"\nwith open(temp_csv_file, \"w\") as f:\n    f.write(dummy_csv_content)\n\ntry:\n    # 1. Initialize a session (PandasSession is simplest for local execution)\n    # For Spark, you would use: session = de.SparkSession() (requires pyspark)\n    session = de.PandasSession()\n    print(f\"Initialized session type: {session.__class__.__name__}\")\n\n    # 2. Create a DataFrame directly from Python data\n    data = {'item': ['apple', 'banana', 'orange'], 'price': [1.0, 0.5, 0.75]}\n    df_from_dict = de.PandasDataFrame(data=data)\n    print(\"\\nDataFrame created from dict:\")\n    print(df_from_dict.to_pandas())\n\n    # 3. Read data from the temporary CSV file\n    csv_reader = de.CSV(session)\n    df_from_csv = csv_reader.read(temp_csv_file)\n    print(\"\\nDataFrame read from temporary CSV file:\")\n    print(df_from_csv.to_pandas())\n\n    # 4. Perform a simple transformation (e.g., filter)\n    filtered_df = df_from_csv.filter(df_from_csv['value'] > 100)\n    print(\"\\nFiltered DataFrame (value > 100):\")\n    print(filtered_df.to_pandas())\n\n    # 5. Example of writing (conceptual - actual write requires specific setup)\n    # For instance: filtered_df.write.parquet(\"output.parquet\")\n\nexcept Exception as e:\n    print(f\"An error occurred during quickstart: {e}\")\nfinally:\n    # Clean up the dummy file\n    if os.path.exists(temp_csv_file):\n        os.remove(temp_csv_file)\n\nprint(\"\\nQuickstart complete. Explore de.SparkSession, de.PolarsDataFrame, de.Parquet etc. for more advanced usage.\")","lang":"python","description":"This quickstart demonstrates initializing a `PandasSession` (suitable for local use without heavy dependencies), creating a DataFrame from a Python dictionary, reading data from a temporary CSV file, and performing a basic filter operation. It highlights Dataengine's unified API for data manipulation."},"warnings":[{"fix":"As a pre-1.0 library, `dataengine`'s API can change rapidly. Consult the latest GitHub README and example code for current usage. Pin your `dataengine` version to avoid unexpected updates: `pip install dataengine==0.0.92`.","message":"API Breaking Changes in Pre-1.0 Versions","severity":"breaking","affected_versions":"<1.0.0"},{"fix":"Attempting to use functionality for a specific backend (e.g., Spark, Ray) or database/cloud storage (e.g., S3, PostgreSQL) without installing its optional dependencies will result in `ModuleNotFoundError`. Ensure you install `dataengine` with the necessary extras, such as `pip install 'dataengine[spark]'` or `pip install 'dataengine[s3]'`. Refer to the `pyproject.toml` on GitHub for a complete list of available extras.","message":"Missing Optional Dependencies for Specific Backends/Connectors","severity":"gotcha","affected_versions":"*"},{"fix":"Dataengine supports multiple backends (Pandas, Polars, Spark, Ray), each with distinct performance characteristics. `PandasSession` is single-threaded, `PolarsSession` is multi-threaded, and `SparkSession`/`RaySession` are distributed. Understand which backend is active and choose the appropriate session type and operations for your data size and computational requirements to optimize performance.","message":"Performance Variance Across Different Backends","severity":"gotcha","affected_versions":"*"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Install `dataengine` with the Spark extra: `pip install 'dataengine[spark]'`.","cause":"Attempting to use `de.SparkSession` or `de.SparkDataFrame` without installing the `pyspark` optional dependency.","error":"ModuleNotFoundError: No module named 'pyspark'"},{"fix":"Check the `dataengine` GitHub repository for recent changes and update your code to use the new class/function names. Consider pinning your `dataengine` version if stability is critical for your project.","cause":"An API class, function, or module name has changed between `dataengine` versions, which is common during pre-1.0 development.","error":"AttributeError: module 'dataengine' has no attribute 'OldClassName'"},{"fix":"Install the required database driver as an extra (e.g., `pip install 'dataengine[postgres]'` for PostgreSQL, `pip install 'dataengine[mysql]'` for MySQL). Additionally, verify your connection string, host, port, user, and password for correctness.","cause":"Missing a database-specific driver or incorrect connection parameters when interacting with databases via `dataengine`'s SQL capabilities.","error":"dataengine.errors.DataEngineError: Failed to connect to database... (e.g., psycopg is not installed)"}]}