Hyparquet
Hyparquet is a pure JavaScript library for parsing Apache Parquet files directly in web browsers and Node.js environments. It specializes in efficient data retrieval from cloud storage by leveraging HTTP range requests, allowing for direct querying of Parquet files over the network without requiring a server-side intermediary. The library is dependency-free since 2023, offering a lightweight solution. Its current stable version is 1.25.6, with a release cadence that follows active development. Key differentiators include its ability to minimize data fetches through selective row and column filtering, comprehensive support for all Parquet types, encodings, and compression codecs, and inclusion of TypeScript definitions for improved developer experience. It is particularly well-suited for data engineering, data science, and machine learning applications where large datasets stored in Parquet format need to be accessed and processed client-side.
Common errors
-
TypeError: require is not a function
cause Attempting to import hyparquet using CommonJS `require()` syntax in a Node.js environment or bundler that expects ES modules.fixChange `const { ... } = require('hyparquet')` to `import { ... } from 'hyparquet'`. Ensure your Node.js project or bundler is configured for ES modules (e.g., 'type': 'module' in package.json or using .mjs extension). -
TypeError: Cannot read properties of undefined (reading 'slice')
cause The `file` argument passed to `parquetReadObjects` or `parquetMetadataAsync` is not a valid `AsyncBuffer` instance or a compatible object.fixEnsure `file` is created using `asyncBufferFromUrl({ url })` for remote files or `asyncBufferFromFile('path/to/file.parquet')` for local Node.js files, or provide a custom object implementing the `AsyncBuffer` interface correctly. -
Uncaught (in promise) TypeError: x.map is not a function
cause This error often occurs when attempting to map over a non-array result, sometimes indicating that the Parquet file was malformed, or the schema parsing returned an unexpected structure.fixVerify the integrity and structure of your Parquet file. Ensure the `parquetSchema` function is used correctly, and its output (e.g., `schema.children`) is what you expect to iterate over. Inspect `metadata` and `schema` objects before mapping.
Warnings
- breaking Hyparquet is distributed exclusively as an ES module (ESM). Direct 'require()' calls for CommonJS environments are not supported.
- gotcha When reading remote Parquet files with `asyncBufferFromUrl`, efficient performance relies on the server supporting HTTP range requests. Without proper server support, the entire file might be downloaded.
- gotcha The `num_rows` property from Parquet metadata is returned as a `BigInt`. Direct arithmetic operations or comparisons with standard `Number` types may lead to errors or unexpected results.
- gotcha To optimize data fetching for large remote files, always specify `columns`, `rowStart`, and `rowEnd` parameters in `parquetReadObjects`. Failing to do so will result in downloading and parsing the entire file.
Install
-
npm install hyparquet -
yarn add hyparquet -
pnpm add hyparquet
Imports
- parquetReadObjects
const { parquetReadObjects } = require('hyparquet')import { parquetReadObjects } from 'hyparquet' - asyncBufferFromUrl
import { asyncBufferFromUrl } from 'hyparquet' - parquetMetadataAsync
import parquetMetadataAsync from 'hyparquet'
import { parquetMetadataAsync, parquetSchema } from 'hyparquet' - AsyncBuffer
import type { AsyncBuffer } from 'hyparquet'
Quickstart
import { asyncBufferFromUrl, parquetReadObjects } from 'hyparquet'
async function fetchData() {
const url = 'https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'
// Wrap the URL for asynchronous fetching with HTTP range requests
const file = await asyncBufferFromUrl({ url })
// Read objects, filtering by specific columns and rows for efficiency
const data = await parquetReadObjects({
file,
columns: ['Breed Name', 'Lifespan'],
rowStart: 10,
rowEnd: 20,
})
console.log('Fetched data:', data)
}
fetchData().catch(console.error)