{"id":2555,"library":"keras-preprocessing","title":"Keras Preprocessing","description":"Keras Preprocessing is a standalone Python library that provided utilities for data preprocessing and augmentation for deep learning models, specifically for image, text, and sequence data. While its last official release (1.1.2) was in 2020, the functionality it provided has since been integrated directly into `tf.keras.preprocessing` and superseded by native Keras 3 preprocessing layers. The GitHub repository for this standalone package is officially marked as deprecated.","status":"deprecated","version":"1.1.2","language":"en","source_language":"en","source_url":"https://github.com/keras-team/keras-preprocessing","tags":["deep learning","keras","tensorflow","preprocessing","data augmentation","image processing","text processing","sequence processing","deprecated"],"install":[{"cmd":"pip install keras-preprocessing","lang":"bash","label":"PyPI"}],"dependencies":[{"reason":"Required for numerical array operations, especially with image and sequence data.","package":"numpy"},{"reason":"Required for image processing utilities like ImageDataGenerator.","package":"Pillow","optional":true},{"reason":"Required for flow_from_dataframe method in ImageDataGenerator.","package":"pandas","optional":true}],"imports":[{"note":"For newer Keras/TensorFlow, prefer `tf.keras.preprocessing.image.ImageDataGenerator` or `tf.keras.utils.image_dataset_from_directory` with Keras 3 preprocessing layers. The standalone `keras-preprocessing` uses `keras_preprocessing` prefix.","wrong":"from keras.preprocessing.image import ImageDataGenerator","symbol":"ImageDataGenerator","correct":"from keras_preprocessing.image import ImageDataGenerator"},{"note":"For newer Keras/TensorFlow, prefer `tf.keras.preprocessing.text.Tokenizer` or `keras.layers.TextVectorization`. The standalone `keras-preprocessing` uses `keras_preprocessing` prefix.","wrong":"from keras.preprocessing.text import Tokenizer","symbol":"Tokenizer","correct":"from keras_preprocessing.text import Tokenizer"},{"note":"For newer Keras/TensorFlow, prefer `tf.keras.utils.pad_sequences`. The standalone `keras-preprocessing` uses `keras_preprocessing` prefix.","wrong":"from keras.preprocessing.sequence import pad_sequences","symbol":"pad_sequences","correct":"from keras_preprocessing.sequence import pad_sequences"}],"quickstart":{"code":"import numpy as np\nimport pandas as pd\nimport os\n\n# Create dummy image files and a dataframe\nif not os.path.exists('data/img_dir/cat'):\n    os.makedirs('data/img_dir/cat')\nif not os.path.exists('data/img_dir/dog'):\n    os.makedirs('data/img_dir/dog')\n\n# Create dummy image files\nfrom PIL import Image\nimg = Image.new('RGB', (64, 64), color = 'red')\nimg.save('data/img_dir/cat/cat1.jpg')\nimg = Image.new('RGB', (64, 64), color = 'blue')\nimg.save('data/img_dir/dog/dog1.jpg')\n\ndf = pd.DataFrame({\n    'filename': ['cat/cat1.jpg', 'dog/dog1.jpg'],\n    'class': ['cat', 'dog']\n})\n\nfrom keras_preprocessing.image import ImageDataGenerator\nfrom keras_preprocessing.text import Tokenizer\nfrom keras_preprocessing.sequence import pad_sequences\n\n# Image Preprocessing and Augmentation\ndatagen = ImageDataGenerator(\n    rescale=1./255,\n    rotation_range=20,\n    width_shift_range=0.2,\n    height_shift_range=0.2,\n    horizontal_flip=True\n)\n\n# Using flow_from_dataframe (requires pandas)\nimage_generator = datagen.flow_from_dataframe(\n    dataframe=df,\n    directory='data/img_dir',\n    x_col='filename',\n    y_col='class',\n    target_size=(64, 64),\n    batch_size=1,\n    class_mode='categorical'\n)\n\nprint(f\"First batch of images shape: {next(image_generator)[0].shape}\")\n\n# Text Preprocessing\nsentences = [\n    \"This is a sample sentence\",\n    \"Another example sentence here\",\n    \"Keras preprocessing is useful\"\n]\n\ntokenizer = Tokenizer(num_words=10, oov_token=\"<OOV>\")\ntokenizer.fit_on_texts(sentences)\n\nsequences = tokenizer.texts_to_sequences(sentences)\nprint(f\"Original sequences: {sequences}\")\n\n# Sequence Padding\npadded_sequences = pad_sequences(sequences, maxlen=5, padding='post')\nprint(f\"Padded sequences: {padded_sequences}\")\n\n# Cleanup dummy directories (optional)\nimport shutil\nshutil.rmtree('data', ignore_errors=True)","lang":"python","description":"This quickstart demonstrates key functionalities: image data augmentation using `ImageDataGenerator.flow_from_dataframe` (requiring a Pandas DataFrame and image directory), and text tokenization with `Tokenizer` followed by sequence padding using `pad_sequences`. It includes creation of dummy files and a DataFrame for a runnable example."},"warnings":[{"fix":"Migrate imports from `keras_preprocessing.*` to `tensorflow.keras.preprocessing.*` or, for modern workflows, utilize Keras 3's native preprocessing layers (`keras.layers.*`) and `tf.data` utilities.","message":"The `keras-preprocessing` PyPI package and its GitHub repository are deprecated. All core functionalities have been moved to `tf.keras.preprocessing` within the TensorFlow package, or replaced by Keras 3 preprocessing layers (e.g., `keras.layers.TextVectorization`, `tf.keras.utils.image_dataset_from_directory`). It is highly recommended to migrate to `tf.keras` imports or Keras 3 layers for active development.","severity":"breaking","affected_versions":"All versions, especially when used with TensorFlow 2.x and Keras 3."},{"fix":"Always use `from keras_preprocessing.<module> import <Symbol>` for this standalone package. If using TensorFlow's integrated Keras, use `from tensorflow.keras.preprocessing.<module> import <Symbol>`.","message":"Import paths frequently cause `ImportError`. Users often mistakenly try to import from `keras.preprocessing` or `tensorflow.keras.preprocessing` when targeting the standalone `keras-preprocessing` package, or vice-versa. The correct import for the standalone package is `keras_preprocessing.*` (note the underscore).","severity":"gotcha","affected_versions":"All versions, especially during migration or mixed environment setups."},{"fix":"Be aware that `tokenizer.word_index` may be larger than `num_words`. `num_words` effectively limits the vocabulary size during the actual sequence generation, dropping less frequent words or mapping them to an OOV token.","message":"The `num_words` argument in `Tokenizer` acts as a vocabulary cutoff during the `texts_to_sequences` conversion, not when `fit_on_texts` is called. `tokenizer.word_index` will still contain all discovered words, but only words with an index less than `num_words` (or `num_words-1` if `oov_token` is used) will be included in the sequences.","severity":"gotcha","affected_versions":"All versions of `keras-preprocessing.text.Tokenizer`."},{"fix":"Update `class_mode` usage: replace `\"other\"` with `\"raw\"` or `\"multi_output\"` as appropriate. Adjust code for `drop_duplicates` removal and consider `weight_col` if needed.","message":"In version 1.1.0, the `DataFrameIterator` (used by `ImageDataGenerator.flow_from_dataframe`) had its `class_mode` argument modified. The value `\"other\"` was removed, and new values `\"raw\"` and `\"multi_output\"` were added to support multi-label or regression tasks directly from dataframes. Additionally, the `drop_duplicates` argument was removed, and `weight_col` was added. [cite: 1.1.0 release notes]","severity":"breaking","affected_versions":"1.1.0 and later."},{"fix":"Ensure the `x_col` in your dataframe for `flow_from_dataframe` contains full filenames including extensions (e.g., 'image.jpg') instead of relying on `has_ext`. Avoid using the `sort` argument.","message":"In version 1.0.6, the `has_ext` argument in `flow_from_dataframe` and the `sort` argument in `DataFrameIterator` were deprecated. Relying on these arguments is discouraged. [cite: 1.0.6 release notes, 21]","severity":"deprecated","affected_versions":"1.0.6 and later."}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}