Scrapy's Item
and ItemLoader
classes are a convenient way to store and managed scraped data.
The Item
class is a dataclass similar to Python's @dataclass
or pydantic.BaseModel
where data fields defined:
import scrapy
class Person(scrapy.Item):
name = Field()
last_name = Field()
bio = Field()
age = Field()
weight = Field()
height = Field()
Whereas ItemLoader
objects are used to populate the items with data:
import scrapy
class PersonLoader(ItemLoader):
default_item_class = Person
# <fieldname>_out is used to define parsing rules for each item
name_out = lambda values: values[0]
last_name_out = lambda values: values[0]
bio_out = lambda values: ''.join(values).strip()
age_out = int
weight_out = int
height_out = int
class MySpider(scrapy.Spider):
...
def parse(self, response):
# create loader and pass response object to it:
loader = PersonLoader(selector=response)
# add parsing rules like XPath:
loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
loader.add_xpath('age', "//div[@class='age']/text()")
loader.add_xpath('weight', "//div[@class='weight']/text()")
loader.add_xpath('height', "//div[@class='height']/text()")
# call load item to parse data and return item:
yield loader.load_item()
Here we defined parsing rules in the PersonLoader
definition, like:
- taking the first found value for the name.
- converting numeric values to integers.
- joining all values for the bio field.
Then, to parse the response with these rules the loader.load_item()
forming our final item.
Using Item
and ItemLoader
classes is the standard way to structure spider data structures in scrapy
and is a convenient way to keep the data process tidy and understandable.