Itemloaders

1.4.0 · active · verified Fri Apr 10

Itemloaders is a base library for Scrapy's ItemLoader, providing a robust and flexible way to parse and populate Scrapy Items. It handles data extraction from various sources (XPath, CSS, regular expressions, JMESPath) and processes it through a chain of input and output processors. The current version is 1.4.0, and the library maintains an active release cadence, frequently updating Python version support.

Warnings

breaking Python version compatibility has changed frequently, dropping support for older versions. For example, v1.4.0 dropped Python 3.8-3.9, v1.2.0 dropped Python 3.7, and v1.1.0 dropped Python 3.6.
Fix: Ensure your project's Python version meets the minimum requirements for the `itemloaders` version you are using. Check the release notes for specific version requirements before upgrading.
gotcha Version 1.3.0 introduced a regression where nested loaders would raise an error when encountering empty matches.
Fix: Upgrade to version 1.3.1 or newer, which includes a fix for this issue.
gotcha In version 1.0.5, passing a compiled regular expression pattern (e.g., `re.compile('...')`) to the `re` parameter of methods like `ItemLoader.add_xpath` or `add_css` could cause an exception due to it being passed directly to `lxml`.
Fix: Upgrade to version 1.0.6 or newer, which fixed this regression. If constrained to 1.0.5, ensure the `re` parameter is always a string pattern, or avoid using compiled patterns.
gotcha JMESPath support, introduced in v1.1.0 with methods like `ItemLoader.add_jmes`, requires `parsel` version 1.8.1 or newer. While `itemloaders` itself might declare a lower minimum `parsel` dependency, using JMESPath features necessitates the newer `parsel` version.
Fix: If using JMESPath features, ensure your `parsel` dependency is explicitly set to `parsel>=1.8.1`.

Install

pip install itemloaders Install stable version

Imports

ItemLoader
```
from itemloaders import ItemLoader
```

TakeFirst

from itemloaders.processors import TakeFirst

MapCompose

from itemloaders.processors import MapCompose

Quickstart

This quickstart demonstrates how to define a simple Item, create an `ItemLoader` inheriting from `itemloaders.ItemLoader`, and use CSS selectors, XPath, and custom processors (`MapCompose`, `TakeFirst`) to extract and process data from an HTML string using `parsel.Selector` to populate the item fields.

import re
from itemloaders import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose

# A minimal Scrapy-like Item (often defined as scrapy.Item)
class MyItem:
    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            setattr(self, k, v)

    def __repr__(self):
        return str(self.__dict__)

# Define an ItemLoader for MyItem
class ProductLoader(ItemLoader):
    default_item_class = MyItem
    default_output_processor = TakeFirst()

    name_in = MapCompose(lambda x: x.strip(), str.title)
    price_out = MapCompose(lambda x: x.replace('$', ''), float)
    description_in = MapCompose(lambda x: x.strip())

# Example HTML fragment
html_data = '''
<div class="product">
    <h1 class="name">  product a  </h1>
    <span class="price">$12.99</span>
    <div class="description">A really good product.</div>
</div>
'''

# Using parsel.Selector for data extraction
from parsel import Selector
selector = Selector(text=html_data)

# Instantiate the loader and populate the item
loader = ProductLoader(selector=selector)
loader.add_css('name', '.name::text')
loader.add_xpath('price', '//span[@class="price"]/text()')
loader.add_value('description', 'Short description from custom source.') # Add a fixed value
loader.add_css('description', '.description::text') # Can add multiple sources for the same field

# Load the item
item = loader.load_item()

print(item)
# Expected output: {'name': 'Product A', 'price': 12.99, 'description': 'A really good product.'}

view raw JSON →