Parsing datetime strings when web scraping is one of the most commonly encountered challenges as the web is full of different date and time formats.
In this article, we'll be taking a look at a Python package called dateparser which can automatically parse datetime strings from almost any text format. In short, we'll be converting Python string to datetime object with the most hands-off approach available in Python.
What is Dateparser?
Dateparser is a smart date and time text parsing library for Python. It's a popular community package that can extract real datetime objects from almost any text containing date or time data.
How to use Dateparser?
The key function of Dateparser is the dateparser.parse function.
It takes any string as an argument and tries to find any datetime data in it. For example, let's say we scraped a bunch of date strings from a website and let's put them through dateparser:
import dateparser
# Parsing dates in different formats
dates_in_different_formats = [
"2023-06-07", # ISO 8601 format
"06/07/2023", # US format
"07/06/2023", # European format
"June 7, 2023", # Long format
"7 Jun 2023", # Another common format
"2023-06-07T15:25:10", # ISO 8601 with time
"2023-06-07 15:25:10", # Space separated date and time
"2023-06-07 15:25:10.555", # Time with milliseconds
]
for date_string in dates_in_different_formats:
parsed_date = dateparser.parse(date_string)
print(f"Original: {date_string}\n Parsed: {parsed_date}")
"""
Original: 2023-06-07
Parsed: 2023-06-07 00:00:00
Original: 06/07/2023
Parsed: 2023-06-07 00:00:00
Original: 07/06/2023
Parsed: 2023-07-06 00:00:00
Original: June 7, 2023
Parsed: 2023-06-07 00:00:00
Original: 7 Jun 2023
Parsed: 2023-06-07 00:00:00
Original: 2023-06-07T15:25:10
Parsed: 2023-06-07 15:25:10
Original: 2023-06-07 15:25:10
Parsed: 2023-06-07 15:25:10
Original: 2023-06-07 15:25:10.555
Parsed: 2023-06-07 15:25:10.555000
"""
Using Dateparser we can parse all of these different formats without having to specify any datetime formats explicitly ourselves which makes date parsing when web scraping much more accessible.
Common Problems
Datetime objects are still complicated and highly contextual. So, not all problems can be solved by Dateparser automatically without specifying parsing preferences. Here are the top problems encountered with Dateearser and how to address them useing dateparser settings.
Date Object Order
The most common issue with parsing datetime strings using Dateparser is that it can't parse dates with ambiguous date object order.
For example, if we have a date like 07/06/2023 it's impossible to know if it's the 7th of June or the 6th of July. To address this the DATE_ORDER setting can be used:
import dateparser
# Day first and year first dates
dates = [
"13/06/2023",
"06/13/2023",
"23/06/13",
"13/06/23",
"2023/06/13",
]
# Parsing with 'DMY' order
print("Parsing with 'DMY' order")
for date_string in dates:
parsed_date = dateparser.parse(date_string, settings={'DATE_ORDER': 'DMY'})
print(f"Original: {date_string}, Parsed: {parsed_date}")
# Parsing with 'YMD' order
print("\nParsing with 'YMD' order")
for date_string in dates:
parsed_date = dateparser.parse(date_string, settings={'DATE_ORDER': 'YMD'})
print(f"Original: {date_string}, Parsed: {parsed_date}")
"""
Parsing with 'DMY' order
Original: 13/06/2023, Parsed: 2023-06-13 00:00:00
Original: 06/13/2023, Parsed: None
Original: 23/06/13, Parsed: 2013-06-23 00:00:00
Original: 13/06/23, Parsed: 2023-06-13 00:00:00
Original: 2023/06/13, Parsed: None
Parsing with 'YMD' order
Original: 13/06/2023, Parsed: 2023-06-13 00:00:00
Original: 06/13/2023, Parsed: 2023-06-13 00:00:00
Original: 23/06/13, Parsed: 2023-06-13 00:00:00
Original: 13/06/23, Parsed: 2013-06-23 00:00:00
Original: 2023/06/13, Parsed: 2023-06-13 00:00:00
"""
The date order can be usually guessed by geolocation or the language of the scraped website. For example, US websites usually use MDY format while the rest of the world uses the DMY or YMD format.
Handling Implicit Timezones
Many scraped websites often use implicit timezones. For example, if we scrape a website that shows content relative to New York, the scraped datetime strings are likely to be in New York time.
For this, the timezone can be specified manually using the TIMEZONE setting:
Some datetime strings can be implicitly incomplete. For example, if we scrape "December 2023" we don't know the exact date. For this, the PREFER_DAY_OF_MONTH setting can be used:
For cases where the year is implicit the PREFER_DATES_FROM setting can be used:
import dateparser
# default implies the date is from the current year
dateparser.parse('March')
datetime.datetime(2023, 3, 7, 0, 0)
# to imply the date is from the future
dateparser.parse('March', settings={'PREFER_DATES_FROM': 'future'})
datetime.datetime(2024, 3, 7, 0, 0)
# to imply the date is from the past
dateparser.parse('March', settings={'PREFER_DATES_FROM': 'past'})
datetime.datetime(2022, 3, 7, 0, 0)
Summary
Dateparser is a powerful library for parsing datetime strings. It can parse dates in many different formats without having to specify any datetime formats ourselves. It also has many settings that can be used to handle common problems like implicit timezones and incomplete dates.
To see a real-life example of Dateparser in web scraping see our how to scrape ebay tutorial where we use Dateparser to parse dates used in Ebay listings.
In this tutorial we'll take a look at website change tracking using Python, Playwright and Wand. We'll build a tracking tool and schedule it to send us emails on detected changes.