StringZilla

4.6.0 · active · verified Fri Apr 10

StringZilla is a Python library that significantly accelerates string operations like search, hashing, sorting, and processing, leveraging SIMD (Single Instruction, Multiple Data) and SWAR (SIMD Within A Register) for performance. It's designed to handle large textual datasets (100+ GB) efficiently, offering a `Str` class similar to Python's built-in `str` and a `File` class for memory-mapping files. Currently at version 4.6.0, it sees frequent updates with multiple patch and minor releases per month.

Warnings

Install

Imports

Quickstart

Demonstrates basic usage of `Str` for in-memory string operations and `File` for memory-mapped file handling, including length, substring checks, and finding substrings.

from stringzilla import Str, File

# Using Str for in-memory strings
text_str = Str('hello stringzilla and the world!')
print(f"Length of text_str: {len(text_str)}")
print(f"Does 'stringzilla' exist? {'stringzilla' in text_str}")
print(f"Index of 'world': {text_str.find('world')}")

# Example with File (requires a dummy file)
# To run this, create a file named 'example.txt' with some content
import os
with open('example.txt', 'w') as f:
    f.write('This is a test file for StringZilla.\n')
    f.write('It demonstrates memory-mapped file usage.')

file_str = File('example.txt')
print(f"Length of file_str: {len(file_str)}")
print(f"Does 'test file' exist? {'test file' in file_str}")
print(f"Lines in file_str: {len(file_str.splitlines())}")

# Clean up the dummy file
os.remove('example.txt')

view raw JSON →