🚀 We are hiring! See open positions

Knowledge Base

Quick answers to common web scraping questions 161 answers

? Answers

24 answers
Q

Web scraping - what is HTTP 499 status code?

Response error 499 generally means the server has closed the connection unexpectedly. This could mean the client is being blocked. Here's how to fix i...

blocking
Q

Web scraping - what is HTTP 503 status code?

Response error 503 generally means the server is temporarily unavailable however it could also mean blocking. Here's how to fix it.

blocking
Q

Web scraping - what is HTTP 520 status code?

Response error 502 generally means the server cannot create a valid response. This could also mean the client is being blocked. Here's how to fix it.

blocking
Q

What are Cloudflare Errors 1006, 1007, 1008?

Cloudflare is a popular anti web scraping service and errors 1006, 1007 and 1008 are popular web scraping blocking errors. Here's how to avoid them.

blocking
Q

What is Cloudflare Error 1009?

Cloudflare is a popular web scraping blocking service and error 1009 access denied is a popular error for web scraper blocking. Here's how to avoid it...

blocking
Q

What is Cloudflare Error 1010?

Cloudflare is a popular web scraping blocking service and error 1010 access denied is a popular error for web scraper blocking. Here's how to avoid it...

blocking
Q

What is Cloudflare Error 1020?

Cloudflare error 1020 access denied is a common web error when web scraping caused by Cloudflare anti scraping service. Here's how to avoid it.

blocking
Q

3 ways to install Python Requests library

Python requests library is a popular HTTP client and here's how to install it using pip, poetry and pipenv.

requests
Q

How to scrape Perimeter X: Please verify you are human?

Perimeter X is a popular anti-scraping protection service - here's how to avoid it when web scraping.

blocking
Q

XPath vs CSS selectors: what's the difference?

CSS selectors and XPath are both path languages for HTML parsing. Xpath is more powerful but CSS is more approachable - which is one is better?

css-selectors xpath
Q

How to save and load cookies in Python requests?

To save session between script runs we can save and load requests session cookies to disk. Here's how to do in Python requests.

http python requests
Q

How to download a file with Playwright and Python?

To download files using Playwright we can either simulate the button click or extract the url and download it using HTTP. Here's how.

playwright
Q

How to get file type of an URL in Python?

There are 2 ways to determine URL file type: guess by url extension using mimetypes module or do a HTTP HEAD request. Here's how.

python crawling
Q

How to load local files in Playwright?

To load local files as page URLs in Playwright we can use the file:// protocol. Here's how to do it.

playwright
Q

How to save and load cookies in Playwright?

To persist playwright connection session between program runs we can save and load cookies to/from disk. Here's how.

playwright
Q

How to take a screenshot with Playwright?

To take page screenshots in playwright we can use page.screenshot() method. Here's how to select areas and how to screenshot them in playwright.

python playwright
Q

How to block image loading in Selenium?

To increase Selenium's performance we can block images. To do that with Chrome browser "prefs" launch option can be used. Here's how.

python selenium
Q

Scrapy vs Beautifulsoup - what's the difference?

Scrapy and BeautifulSoup are two popular web scraping libraries though very different. Scrapy is a framework while beautifulsoup is a HTML parser

beautifulsoup scrapy
Q

How to scroll to an element in Selenium?

In Selenium, the scrollIntoView JavaScript function can be used to scroll to a specific HTML element. Here's how to use it in Selenium.

python selenium
Q

How to find elements by XPath selectors in Playwright?

To execute XPath selectors in playwright the page.locator() method can be used. Here's how.

playwright xpath
Q

How to block resources in Playwright and Python?

Blocking non-vital resources can drastically speed up Playwright. To do that page interception feature can be used. Here's how.

python playwright
Q

How to capture background requests and responses in Playwright?

To capture background requests and response in Playwright we can use request/response interception feature through page.on() method. Here's how.

python playwright
Q

How to find elements by CSS selectors in Playwright?

To execute CSS selectors on current HTML data in Playwright the page.locator() method can be used. Here's how.

python playwright
Q

How to parse dynamic CSS classes when web scraping?

Dynamic CSS can make be very difficult to scrape. There are a few tricks and common idioms to approach this though.

data-parsing

Ready to scale your web scraping?

Anti-bot bypass, browser rendering, and rotating proxies — all in one API.