{"id":6241,"library":"sklearn-crfsuite","title":"sklearn-crfsuite","description":"sklearn-crfsuite is a thin wrapper around the `python-crfsuite` library, providing an interface similar to scikit-learn. It enables the use of scikit-learn's model selection utilities (like cross-validation and hyperparameter optimization) with Conditional Random Field (CRF) models, and allows saving/loading models using joblib. The library is actively maintained, with its latest major release (0.5.0) in June 2024.","status":"active","version":"0.5.0","language":"en","source_language":"en","source_url":"https://github.com/TeamHG-Memex/sklearn-crfsuite","tags":["machine-learning","crf","sequence-labeling","scikit-learn-compatible","nlp"],"install":[{"cmd":"pip install sklearn-crfsuite","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core CRF engine. Version 0.9.7 or higher required for sklearn-crfsuite >= 0.4.0.","package":"python-crfsuite","optional":false},{"reason":"Required for using scikit-learn compatible metrics and scorers, or for integration with scikit-learn pipelines. Optional for basic CRF fitting and prediction. Version 0.24.0 or higher required for sklearn-crfsuite >= 0.4.0.","package":"scikit-learn","optional":true},{"reason":"Required by sklearn-crfsuite >= 0.4.0, often used for displaying metrics.","package":"tabulate","optional":false},{"reason":"Commonly used in tutorials and examples for tokenization, POS tagging, and accessing corpora like CoNLL-2002/2003 for sequence labeling tasks.","package":"nltk"}],"imports":[{"symbol":"CRF","correct":"from sklearn_crfsuite import CRF"},{"symbol":"metrics","correct":"from sklearn_crfsuite import metrics"},{"symbol":"scorers","correct":"from sklearn_crfsuite import scorers"}],"quickstart":{"code":"import sklearn_crfsuite\nfrom sklearn_crfsuite import metrics\n\n# Dummy data for a simple sequence labeling task (e.g., POS tagging)\n# Each sentence is a list of (word, pos_tag)\n# Features are extracted for each word, labels are the expected tags\n\ndef word2features(sent, i):\n    word = sent[i][0]\n    postag = sent[i][1]\n\n    features = {\n        'bias': 1.0,\n        'word.lower()': word.lower(),\n        'word.isupper()': word.isupper(),\n        'word.istitle()': word.istitle(),\n        'word.isdigit()': word.isdigit(),\n        'postag': postag,\n        'postag[:2]': postag[:2],\n    }\n    if i > 0:\n        word1 = sent[i-1][0]\n        postag1 = sent[i-1][1]\n        features[' -1:word.lower()'] = word1.lower()\n        features[' -1:word.istitle()'] = word1.istitle()\n        features[' -1:word.isupper()'] = word1.isupper()\n        features[' -1:postag'] = postag1\n        features[' -1:postag[:2]'] = postag1[:2]\n    else:\n        features['BOS'] = True # Beginning of Sentence\n\n    if i < len(sent)-1:\n        word1 = sent[i+1][0]\n        postag1 = sent[i+1][1]\n        features['+1:word.lower()'] = word1.lower()\n        features['+1:word.istitle()'] = word1.istitle()\n        features['+1:word.isupper()'] = word1.isupper()\n        features['+1:postag'] = postag1\n        features['+1:postag[:2]'] = postag1[:2]\n    else:\n        features['EOS'] = True # End of Sentence\n\n    return features\n\ndef sent2features(sent):\n    return [word2features(sent, i) for i in range(len(sent))]\n\ndef sent2labels(sent):\n    return [label for word, label in sent]\n\ntrain_sents = [\n    [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')],\n    [('I', 'PRP'), ('love', 'VBP'), ('Python', 'NNP')],\n    [('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fun', 'JJ')],\n]\n\nX_train = [sent2features(s) for s in train_sents]\ny_train = [sent2labels(s) for s in train_sents]\n\n# Initialize and train the CRF model\ncrf = sklearn_crfsuite.CRF(\n    algorithm='lbfgs',\n    c1=0.1,  # L1 regularization\n    c2=0.1,  # L2 regularization\n    max_iterations=100,\n    all_possible_transitions=True\n)\n\ncrf.fit(X_train, y_train)\n\n# Make predictions on new data\ntest_sents = [\n    [('A', 'DT'), ('fast', 'JJ'), ('red', 'JJ'), ('car', 'NN'), ('drives', 'VBZ'), ('by', 'IN')],\n]\nX_test = [sent2features(s) for s in test_sents]\ny_pred = crf.predict(X_test)\n\nprint(\"Predicted labels for test sentence:\")\nfor sent_idx, labels in enumerate(y_pred):\n    print(f\"Sentence {sent_idx+1}: {labels}\")\n\n# Example of using metrics (requires scikit-learn)\ny_true = [sent2labels(s) for s in test_sents] # In a real scenario, this would be actual ground truth\nif y_true:\n    print(\"\\nClassification Report:\")\n    print(metrics.flat_classification_report(y_true, y_pred))\n","lang":"python","description":"This quickstart demonstrates how to prepare data, extract basic features, train a Conditional Random Field (CRF) model using `sklearn_crfsuite.CRF`, and make predictions. It also shows how to leverage `sklearn_crfsuite.metrics` for evaluating model performance. The example uses a small, self-contained dummy dataset to illustrate a part-of-speech (POS) tagging task."},"warnings":[{"fix":"Update code to expect and handle NumPy array outputs from `predict()` and `predict_marginals()`.","message":"In version 0.5.0, the `CRF.predict()` and `CRF.predict_marginals()` methods now return a NumPy array instead of a list of lists, aligning with expectations from newer scikit-learn versions.","severity":"breaking","affected_versions":">=0.5.0"},{"fix":"Ensure your Python environment is 3.8+ and update `python-crfsuite` and `scikit-learn` to their specified minimum versions or newer. Consider pinning dependencies in your `requirements.txt`.","message":"Version 0.4.0 dropped official support for Python 3.7 and lower, and explicitly added support for Python 3.8 and higher. It also increased minimum versions for dependencies like `python-crfsuite` (0.9.7) and `scikit-learn` (0.24.0).","severity":"breaking","affected_versions":">=0.4.0"},{"fix":"Update any code referencing `crf.tagger` to `crf.tagger_`. If relying on exceptions for untraining state, adapt logic to check for `None`.","message":"In version 0.2, the `crf.tagger` attribute was renamed to `crf.tagger_`. Additionally, accessing `crf.tagger_` before training no longer raises an exception but returns `None`.","severity":"breaking","affected_versions":">=0.2"},{"fix":"Each component of an array feature (like a word embedding vector) must be flattened and passed as a separate dictionary feature (e.g., `{'v0': value_0, 'v1': value_1, ...}`). This can significantly increase the number of features and training time.","message":"`python-crfsuite` and `sklearn-crfsuite` do not natively support array-like features (e.g., word embeddings) directly. Attempting to pass a NumPy array as a single feature will result in errors.","severity":"gotcha","affected_versions":"All"},{"fix":"Always apply the exact same feature extraction logic and transformations to both your training and inference data. Consider encapsulating feature extraction in a consistent pipeline or utility function.","message":"As with general scikit-learn practices, inconsistent preprocessing between training and test data (e.g., feature extraction functions) can lead to unexpected model performance.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}