Dataengine

0.0.92 · active · verified Fri Apr 17

Dataengine is a general-purpose Python package designed for streamlined data engineering tasks. It provides a unified API for working with various data processing backends such as Pandas, Polars, Spark, and Ray, along with integrations for common data formats (CSV, Parquet, JSON) and relational databases via SQLAlchemy and DuckDB. Currently at version 0.0.92, it is under active development with a focus on providing flexible and scalable data manipulation tools.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates initializing a `PandasSession` (suitable for local use without heavy dependencies), creating a DataFrame from a Python dictionary, reading data from a temporary CSV file, and performing a basic filter operation. It highlights Dataengine's unified API for data manipulation.

import dataengine as de
import os
import pandas as pd # Used for creating dummy data file

# Create a dummy CSV file for the example
dummy_csv_content = "id,name,value\n1,Alice,100\n2,Bob,200\n3,Charlie,150"
temp_csv_file = "temp_quickstart.csv"
with open(temp_csv_file, "w") as f:
    f.write(dummy_csv_content)

try:
    # 1. Initialize a session (PandasSession is simplest for local execution)
    # For Spark, you would use: session = de.SparkSession() (requires pyspark)
    session = de.PandasSession()
    print(f"Initialized session type: {session.__class__.__name__}")

    # 2. Create a DataFrame directly from Python data
    data = {'item': ['apple', 'banana', 'orange'], 'price': [1.0, 0.5, 0.75]}
    df_from_dict = de.PandasDataFrame(data=data)
    print("\nDataFrame created from dict:")
    print(df_from_dict.to_pandas())

    # 3. Read data from the temporary CSV file
    csv_reader = de.CSV(session)
    df_from_csv = csv_reader.read(temp_csv_file)
    print("\nDataFrame read from temporary CSV file:")
    print(df_from_csv.to_pandas())

    # 4. Perform a simple transformation (e.g., filter)
    filtered_df = df_from_csv.filter(df_from_csv['value'] > 100)
    print("\nFiltered DataFrame (value > 100):")
    print(filtered_df.to_pandas())

    # 5. Example of writing (conceptual - actual write requires specific setup)
    # For instance: filtered_df.write.parquet("output.parquet")

except Exception as e:
    print(f"An error occurred during quickstart: {e}")
finally:
    # Clean up the dummy file
    if os.path.exists(temp_csv_file):
        os.remove(temp_csv_file)

print("\nQuickstart complete. Explore de.SparkSession, de.PolarsDataFrame, de.Parquet etc. for more advanced usage.")

view raw JSON →