Keras Preprocessing
Keras Preprocessing is a standalone Python library that provided utilities for data preprocessing and augmentation for deep learning models, specifically for image, text, and sequence data. While its last official release (1.1.2) was in 2020, the functionality it provided has since been integrated directly into `tf.keras.preprocessing` and superseded by native Keras 3 preprocessing layers. The GitHub repository for this standalone package is officially marked as deprecated.
Warnings
- breaking The `keras-preprocessing` PyPI package and its GitHub repository are deprecated. All core functionalities have been moved to `tf.keras.preprocessing` within the TensorFlow package, or replaced by Keras 3 preprocessing layers (e.g., `keras.layers.TextVectorization`, `tf.keras.utils.image_dataset_from_directory`). It is highly recommended to migrate to `tf.keras` imports or Keras 3 layers for active development.
- gotcha Import paths frequently cause `ImportError`. Users often mistakenly try to import from `keras.preprocessing` or `tensorflow.keras.preprocessing` when targeting the standalone `keras-preprocessing` package, or vice-versa. The correct import for the standalone package is `keras_preprocessing.*` (note the underscore).
- gotcha The `num_words` argument in `Tokenizer` acts as a vocabulary cutoff during the `texts_to_sequences` conversion, not when `fit_on_texts` is called. `tokenizer.word_index` will still contain all discovered words, but only words with an index less than `num_words` (or `num_words-1` if `oov_token` is used) will be included in the sequences.
- breaking In version 1.1.0, the `DataFrameIterator` (used by `ImageDataGenerator.flow_from_dataframe`) had its `class_mode` argument modified. The value `"other"` was removed, and new values `"raw"` and `"multi_output"` were added to support multi-label or regression tasks directly from dataframes. Additionally, the `drop_duplicates` argument was removed, and `weight_col` was added. [cite: 1.1.0 release notes]
- deprecated In version 1.0.6, the `has_ext` argument in `flow_from_dataframe` and the `sort` argument in `DataFrameIterator` were deprecated. Relying on these arguments is discouraged. [cite: 1.0.6 release notes, 21]
Install
-
pip install keras-preprocessing
Imports
- ImageDataGenerator
from keras_preprocessing.image import ImageDataGenerator
- Tokenizer
from keras_preprocessing.text import Tokenizer
- pad_sequences
from keras_preprocessing.sequence import pad_sequences
Quickstart
import numpy as np
import pandas as pd
import os
# Create dummy image files and a dataframe
if not os.path.exists('data/img_dir/cat'):
os.makedirs('data/img_dir/cat')
if not os.path.exists('data/img_dir/dog'):
os.makedirs('data/img_dir/dog')
# Create dummy image files
from PIL import Image
img = Image.new('RGB', (64, 64), color = 'red')
img.save('data/img_dir/cat/cat1.jpg')
img = Image.new('RGB', (64, 64), color = 'blue')
img.save('data/img_dir/dog/dog1.jpg')
df = pd.DataFrame({
'filename': ['cat/cat1.jpg', 'dog/dog1.jpg'],
'class': ['cat', 'dog']
})
from keras_preprocessing.image import ImageDataGenerator
from keras_preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
# Image Preprocessing and Augmentation
datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
)
# Using flow_from_dataframe (requires pandas)
image_generator = datagen.flow_from_dataframe(
dataframe=df,
directory='data/img_dir',
x_col='filename',
y_col='class',
target_size=(64, 64),
batch_size=1,
class_mode='categorical'
)
print(f"First batch of images shape: {next(image_generator)[0].shape}")
# Text Preprocessing
sentences = [
"This is a sample sentence",
"Another example sentence here",
"Keras preprocessing is useful"
]
tokenizer = Tokenizer(num_words=10, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
print(f"Original sequences: {sequences}")
# Sequence Padding
padded_sequences = pad_sequences(sequences, maxlen=5, padding='post')
print(f"Padded sequences: {padded_sequences}")
# Cleanup dummy directories (optional)
import shutil
shutil.rmtree('data', ignore_errors=True)