What are scrapy Item and ItemLoader objects and how to use them?

Scrapy's Item and ItemLoader classes are a convenient way to store and managed scraped data.

The Item class is a dataclass similar to Python's @dataclass or pydantic.BaseModel where data fields defined:

import scrapy 

class Person(scrapy.Item):
    name = Field()
    last_name = Field()
    bio = Field()
    age = Field()
    weight = Field()
    height = Field()

Whereas ItemLoader objects are used to populate the items with data:

import scrapy

class PersonLoader(ItemLoader):
    default_item_class = Person
    # <fieldname>_out is used to define parsing rules for each item
    name_out = lambda values: values[0]
    last_name_out = lambda values: values[0]
    bio_out = lambda values: ''.join(values).strip()
    age_out = int
    weight_out = int
    height_out = int

class MySpider(scrapy.Spider):
    ...
    def parse(self, response):
        # create loader and pass response object to it:
        loader = PersonLoader(selector=response)
        # add parsing rules like XPath:
        loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
        loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
        loader.add_xpath('age', "//div[@class='age']/text()")
        loader.add_xpath('weight', "//div[@class='weight']/text()")
        loader.add_xpath('height', "//div[@class='height']/text()")
        # call load item to parse data and return item:
        yield loader.load_item()

Here we defined parsing rules in the PersonLoader definition, like:

  • taking the first found value for the name.
  • converting numeric values to integers.
  • joining all values for the bio field.

Then, to parse the response with these rules the loader.load_item() forming our final item.

Using Item and ItemLoader classes is the standard way to structure spider data structures in scrapy
and is a convenient way to keep the data process tidy and understandable.

Provided by Scrapfly

This knowledgebase is provided by Scrapfly data APIs, check us out! 👇