Scrapfly Web Scraping Academy
Learn Everything about Web Scraping!
Scrapfly Academy covers modern web scraping issues and their solutions walking you through them step-by-step.
Web scraping can be hard and confusing at times, and we know it! We have been doing it for years and we have been through all the issues you can imagine. We have built Scrapfly to make web scraping easy and accessible and here we are sharing everything we know! (well almost 😉)
Web Scraping Roadmap
For a quick overview of all web scraping topics, challenges and everything that makes web scraping such a fascinating subject see our interactive academy mindmap 👇
CSS Selectors are the golden standard when it comes to parsing HTML. Most commonly they are used to apply styles in HTML pages but they can also be used to parse HTML for data.
XPath is a very powerful HTML query language that can be used to parse HTML in web scraping. Compared to CSS selectors XPath is significantly more powerful.
HTML is designed to be machine parsable which means extracting content when web scraping is easy even for very complex pages. For this usually XPath or CSS Selectors are used.Read More
JMESPath is a query language for JSON documents. It's simple and easy to use and available in almost any programming language.
Static Page Scraping
Dynamic Page Scraping
Dynamic Pages are very different from classic static HTML pages as the page rendering is done client side, by the web browser. This becomes an issue in web scraping as web scrapers are not web browsers, well, not usually.
An example of a dynamic page would be web-scraping.dev/testimonials where more testimonials are being loaded as the user scrolls down the page.Read More
Headless browsers are special version of web browsers that contain no UI elements and run in the background. This makes them ideal for web automation and web scraping dynamic pages.
Scraping using headless browsers has some clear advantages and disadvantages:
- Easier to scrape dynamic pages as scrapers see everything the browser sees.
- Can help with scraper blocking.
- Extremely resource intense. Browsers use significantly more sources from processing and memory to bandwidth use.
- Prone to bugs as browsers are very complicated.
Hidden API Scraping
When web pages need to load data on demand background requests to hidden data APIs are often used. These hidden APIs can be scraped directly.Read More
Hidden Web Data Scraping
Some data can be hidden in the invisible parts of HTML pages. Often this data is
<script> tags in JSON format that can be
conveniently extracted as whole datasets.
Reverse Engineering Websites
Understanding how websites work can help to scrape them or even discover hidden APIs and data. This is called reverse engineering and there are tools that can assist with this.Read More
Pages can require secret HTTP attributes to load successfully. This details need to be discovered and included in web scrapers which can done through browser developer tools through reverse engineering.
For an example, see this Referer-lock
Scrapeground exercise - loads page only when correct
Referer is provided.
Pages can require secret tokens to load successfully. This details need to be discovered and included in web scrapers which can done through browser developer tools through reverse engineering.
For an example, see this CSRF-lock Scrapeground exercise where page only loads when correct secret token is provided.More on Reverse Engineering
Web scraper connections can be fingerprinted through various means which can lead to web scraper blocking or even feeding scrapers with false data.Read More
TLS handshake is the first step to every
https (secure) connection
and it can be fingerprinted. This fingerprint is used to track and identify web scrapers.
intro to TLS fingerprinting.
Scrapfly automatically bypasses TLS fingerprint.More on Scraper Blocking
HTTP v2 is complex enough of a protocol that it can be fingerprinted to track and identify web scrapers.
Scrapfly automatically bypasses http2 fingerprint.More on Scraper Blocking
HTTP request headers provide metadata about outgoing requests. Scrapers that are sending headers that are different compared to the real web browser users can be easily identified and blocked. For more see Introduction to request headers for more.More on Scraper Blocking
There are many software as a service tools that try to identify and block web scrapers. These tools use all fingerprinting and IP tracking techniques to identify web scrapers.
Here's a list of the most popular ones and introduction articles on how they work and what can be done to bypass them:
Scrapfly automatically bypasses anti scraping services when ASP feature is used.More on Scraper Blocking
There are 3 primary proxy types:
- Datacenter - IP addresses given out to datacenter corporations (AWS, Google etc.)
- Residential - IP addresses given out to residential homes.
- Mobile - IP addresses given out to mobile 3/4/5G towers
Datacenter proxies are easy to identify and block while residential and mobile appear as natural traffic thus better for web scraping.
Scrapfly offers millions of residential proxies from 50+ countries.More on Proxies
To fully take advantage of proxies the addresses need to be rotated for each scraping request. Rotation logic can impact overall scraper blocking when scraping targets that use behavior analysis for scraper blocking.
For an example, see our intro to proxy rotation article.
Scrapfly automatically rotates proxies for you from over 50+ countries of your choice.More on Proxies
Proxy IP addresses are tracked and analyzed globally which can lead to scraper blocking. Proxies are being identified by subnet, ASN (owner number), country, city, ISP and more.
Scrapfly has millions of proxies that are specifically made for web scraping.More on Proxies
Geographically Locked Content
Some websites can only be accessed in specific countries which means proxies are needed to scrape this content. Scraping from the natural country of the target website can also drastically reduce scraper blocking rate.
Scrapfly has millions of proxies from 50+ countries.More on Proxies
Scraped Data Parsing
The web is full of different data formats: HTML, XML, JSON just to name a few. These data formats need to be parsed using robust parsing techniques like Xpath, CSS selectors and JMESPath.Read More
Scraped Data Validation
For long-term web scraping projects data output validation tests are a great way to keep an eye on scraper performance. Static data models can instantly capture any changes but are fragile while schema-based validators are more flexible.
Read more about data validation techniques in web scraping.More on Data Processing
Web Scraper Tests
Testing web scrapers is a bit different from other programs. Schema based result testing is the most common way to test web scrapers.
Read more about data validation using schema.More on Data Processing
Continuously scraping websites to detect website changes is a popular way to keep in touch with new data and website changes that could break scraping.More on Data Processing
As scrapers are collecting unknown data from the internet data-cleanup process is an important step of the delivery process. This involves natural language parsing, data normalization and so on.Read More
Not all scraped data is structured. Tools like regular expressions and natural language processing and even AI can be used to make sense from unstructured scraped data.
For some examples see these articles:More on Data Processing
Converting currencies, units (especially date) can be a major pain point in scraper development.
For some examples see these articles:More on Data Processing
Scraping tasks can be processing intensive, especially when it comes to data parsing. Multi-processing allows distributing of scrape tasks through multiple processes that can take advantage of multiple CPU cores.
See the Multi Processing Section of this tutorial.
HTTP caching is an important scraper optimization step especially when it comes to repeated scraping, testing and debugging.
Scrapfly can store and manage web scraping cache using the cache feature.
As web scraping relies on IO-bound operations (HTTP requests) asynchronous programming can speed up scrapers hundreds to thousands of times. For more see intro to asynchronous requests in scraping article.
Separating scraper performance into different services is the most common way of scaling up web scrapers. Using full web scraping services like Scrapfly is an option but it's also possible to host parts of web scraping tasks as services yourself. Here are some examples: