Itemloaders
Itemloaders is a base library for Scrapy's ItemLoader, providing a robust and flexible way to parse and populate Scrapy Items. It handles data extraction from various sources (XPath, CSS, regular expressions, JMESPath) and processes it through a chain of input and output processors. The current version is 1.4.0, and the library maintains an active release cadence, frequently updating Python version support.
Warnings
- breaking Python version compatibility has changed frequently, dropping support for older versions. For example, v1.4.0 dropped Python 3.8-3.9, v1.2.0 dropped Python 3.7, and v1.1.0 dropped Python 3.6.
- gotcha Version 1.3.0 introduced a regression where nested loaders would raise an error when encountering empty matches.
- gotcha In version 1.0.5, passing a compiled regular expression pattern (e.g., `re.compile('...')`) to the `re` parameter of methods like `ItemLoader.add_xpath` or `add_css` could cause an exception due to it being passed directly to `lxml`.
- gotcha JMESPath support, introduced in v1.1.0 with methods like `ItemLoader.add_jmes`, requires `parsel` version 1.8.1 or newer. While `itemloaders` itself might declare a lower minimum `parsel` dependency, using JMESPath features necessitates the newer `parsel` version.
Install
-
pip install itemloaders
Imports
- ItemLoader
from itemloaders import ItemLoader
- TakeFirst
from itemloaders.processors import TakeFirst
- MapCompose
from itemloaders.processors import MapCompose
Quickstart
import re
from itemloaders import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
# A minimal Scrapy-like Item (often defined as scrapy.Item)
class MyItem:
def __init__(self, **kwargs):
for k, v in kwargs.items():
setattr(self, k, v)
def __repr__(self):
return str(self.__dict__)
# Define an ItemLoader for MyItem
class ProductLoader(ItemLoader):
default_item_class = MyItem
default_output_processor = TakeFirst()
name_in = MapCompose(lambda x: x.strip(), str.title)
price_out = MapCompose(lambda x: x.replace('$', ''), float)
description_in = MapCompose(lambda x: x.strip())
# Example HTML fragment
html_data = '''
<div class="product">
<h1 class="name"> product a </h1>
<span class="price">$12.99</span>
<div class="description">A really good product.</div>
</div>
'''
# Using parsel.Selector for data extraction
from parsel import Selector
selector = Selector(text=html_data)
# Instantiate the loader and populate the item
loader = ProductLoader(selector=selector)
loader.add_css('name', '.name::text')
loader.add_xpath('price', '//span[@class="price"]/text()')
loader.add_value('description', 'Short description from custom source.') # Add a fixed value
loader.add_css('description', '.description::text') # Can add multiple sources for the same field
# Load the item
item = loader.load_item()
print(item)
# Expected output: {'name': 'Product A', 'price': 12.99, 'description': 'A really good product.'}