Keras Preprocessing

1.1.2 · deprecated · verified Fri Apr 10

Keras Preprocessing is a standalone Python library that provided utilities for data preprocessing and augmentation for deep learning models, specifically for image, text, and sequence data. While its last official release (1.1.2) was in 2020, the functionality it provided has since been integrated directly into `tf.keras.preprocessing` and superseded by native Keras 3 preprocessing layers. The GitHub repository for this standalone package is officially marked as deprecated.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates key functionalities: image data augmentation using `ImageDataGenerator.flow_from_dataframe` (requiring a Pandas DataFrame and image directory), and text tokenization with `Tokenizer` followed by sequence padding using `pad_sequences`. It includes creation of dummy files and a DataFrame for a runnable example.

import numpy as np
import pandas as pd
import os

# Create dummy image files and a dataframe
if not os.path.exists('data/img_dir/cat'):
    os.makedirs('data/img_dir/cat')
if not os.path.exists('data/img_dir/dog'):
    os.makedirs('data/img_dir/dog')

# Create dummy image files
from PIL import Image
img = Image.new('RGB', (64, 64), color = 'red')
img.save('data/img_dir/cat/cat1.jpg')
img = Image.new('RGB', (64, 64), color = 'blue')
img.save('data/img_dir/dog/dog1.jpg')

df = pd.DataFrame({
    'filename': ['cat/cat1.jpg', 'dog/dog1.jpg'],
    'class': ['cat', 'dog']
})

from keras_preprocessing.image import ImageDataGenerator
from keras_preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

# Image Preprocessing and Augmentation
datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

# Using flow_from_dataframe (requires pandas)
image_generator = datagen.flow_from_dataframe(
    dataframe=df,
    directory='data/img_dir',
    x_col='filename',
    y_col='class',
    target_size=(64, 64),
    batch_size=1,
    class_mode='categorical'
)

print(f"First batch of images shape: {next(image_generator)[0].shape}")

# Text Preprocessing
sentences = [
    "This is a sample sentence",
    "Another example sentence here",
    "Keras preprocessing is useful"
]

tokenizer = Tokenizer(num_words=10, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)

sequences = tokenizer.texts_to_sequences(sentences)
print(f"Original sequences: {sequences}")

# Sequence Padding
padded_sequences = pad_sequences(sequences, maxlen=5, padding='post')
print(f"Padded sequences: {padded_sequences}")

# Cleanup dummy directories (optional)
import shutil
shutil.rmtree('data', ignore_errors=True)

view raw JSON →