StringZilla
StringZilla is a Python library that significantly accelerates string operations like search, hashing, sorting, and processing, leveraging SIMD (Single Instruction, Multiple Data) and SWAR (SIMD Within A Register) for performance. It's designed to handle large textual datasets (100+ GB) efficiently, offering a `Str` class similar to Python's built-in `str` and a `File` class for memory-mapping files. Currently at version 4.6.0, it sees frequent updates with multiple patch and minor releases per month.
Warnings
- gotcha Users often install `stringzilla` (the base package) but might expect parallel performance. For multi-CPU backends, `stringzillas-cpus` is required, and for Nvidia GPU, `stringzillas-cuda` must be installed separately. These are distinct packages, and `stringzilla` only provides serial algorithms.
- gotcha Optimal performance and certain advanced features (e.g., specific case-folding and case-insensitive search paths) heavily rely on modern CPU extensions like AVX-512, Arm Neon, or SVE. Running on older hardware or virtualized environments without these features may result in lower performance than expected or fallback to less optimized scalar implementations.
- gotcha The `stringzillas` (parallel) components are explicitly noted as being in 'beta' and 'under active development, and are likely to break in subsequent releases'. Users of these parallel backends should anticipate potential API changes or breaking changes between minor versions.
- gotcha StringZilla aims for full Unicode 17.0 compliance, particularly for case-folding and case-insensitive searches. Ensure that your input data is correctly encoded in UTF-8 to leverage these features accurately. Incorrect encodings can lead to unexpected search results or behavior.
Install
-
pip install stringzilla -
pip install stringzillas-cpus -
pip install stringzillas-cuda
Imports
- Str
from stringzilla import Str
- File
from stringzilla import File
- Strs
from stringzilla import Strs
Quickstart
from stringzilla import Str, File
# Using Str for in-memory strings
text_str = Str('hello stringzilla and the world!')
print(f"Length of text_str: {len(text_str)}")
print(f"Does 'stringzilla' exist? {'stringzilla' in text_str}")
print(f"Index of 'world': {text_str.find('world')}")
# Example with File (requires a dummy file)
# To run this, create a file named 'example.txt' with some content
import os
with open('example.txt', 'w') as f:
f.write('This is a test file for StringZilla.\n')
f.write('It demonstrates memory-mapped file usage.')
file_str = File('example.txt')
print(f"Length of file_str: {len(file_str)}")
print(f"Does 'test file' exist? {'test file' in file_str}")
print(f"Lines in file_str: {len(file_str.splitlines())}")
# Clean up the dummy file
os.remove('example.txt')