Is Web Scraping Legal?

6 Ways to Make Your Data Scraping Is Legal and Compliant with the GDPR

How to Perform Web Scraping Legally?

When the General Data Protection Regulation (GDPR) came into effect in May 2018, companies working with personal data of European Union residents were concerned about no longer being allowed to perform web scraping. Their general disquiet was justified, because GDPR did put some legal limitations to data scraping, aside from the ethical and technical ones.

At the same time, data scraping is a more common practice than you think – according to some mates, more than 50% of all website visits are for data scraping purposes. So, if data scraping is vital for your business, you need to be aware of the legal issues related to web scraping and comply with regulations so you can continue collecting useful data without breaking the law.

What is web scraping?

Web scraping is a form of data scraping that is conducted exclusively online – It consists of harvesting publicly available data from online sources. The data collected automatically by scraping software is used for identifying trends, helping with recruitment, assessing credit risk, determining customer sentiment, selling, etc. How do you ensure you are doing web scraping legally, though?

1. The purpose of web scraping must be legal

A legal and ethical plan for extracting and using data needs to meet the following criteria:

  1. The data must be collected only for your company’s purpose and not made public
  2. The data must not cause financial or reputational losses to its owners

Web scraping is legal and ethical when you extract data only for personal use and analysis. Things are completely different when you want to republish the collected data, in which case you need to ask for the data subjects permission and check website policies before scraping – otherwise you may face personal data protection laws infringement. Web crawlers do not have the freedom to use the obtained data for unlimited commercial purposes and the copyright for data it's will enforceable no matter how the data was obtained.

2. The data you want to get must be publicly available information

Even if data is published on a website for everyone to have access to it, copying it may not be legal. In this case the solution is to check website policies in order to make sure all the data you access and acquire is authorized for scraping.

The rule of thumb is that you can collect information that does not include personal data and does not violate website terms of service. The terms of service or ToS sec on is usually located in the footer of the page, stating what data you can collect and what data can put you at risk of being fined for web scraping without the owner’s permission. There is also information on that is secured, such as usernames, passwords, and access codes, which you are also not allowed to collect. Regulations regarding data scraping usually limit the freedom to obtain data from sites that require auth ca on.

3. Check copyrights

Another tool that websites use to control web scraping is copyright rules, which users need to respect as well. In other words, before copying any kind of content, such as text, images, trademarks, and databases, you need to make sure the information you want to scrape is not copyrighted. Without consent from the copyright holder, you cannot republish scrapped data. It may still be possible, however, to use facts from a creative work when only its format is copyrighted as long as you modify those elements and deliver them in an original manner.

4. Pay a en on to the web scraping rate

Web scraping is performed by powerful software which can put a heavy load on website servers. You should make sure that you achieve the optimal rate of the web scraping process so the bandwidth and performance of the web server is not affected. The robots.txt file usually mentions the crawl delay settings you need to respect and without one, you should stick to an average scraping rate of approximately 1 request every 10-15 seconds. Otherwise, the web server could automatically block your IP and prevent you from accessing the page again.

5. Use a path similar to the search engine when performing web scraping

In order to avoid damaging the website coding and interfere with its normal operation, the best web scraping method is using crawlers which access website data as a visitor and follow the same path as a search engine. Another advantage of this method is it allows you to scrape without registering as a user and accepting any terms of use and thus have access to any public information available to the typical user.

6. Let the website know you are scraping

You can identify your web scraper with a legitimate user agent string. This method allows you to create a page that informs the website owners about your activity, its purpose, and the organization in whose name you are scraping. Not only are you showing respect to the website owner, but you are creating a link back to the page in your user agent string too.

In order to enjoy the benefits of this useful and affordable method for collecting data for your business, you need to conduct web scraping in a responsible and respectful manner that prevents problems and keeps your business legal and protected.

Can you legally scrape data from LinkedIn?

One of the most famous web scraping disputes is that between LinkedIn and hiQ, a data scraping business from Silicon Valley, and its stake was ruling whether LinkedIn can prevent other businesses from accessing data that is publicly available on its social network or they should allow it even if those businesses are their competitors.

LinkedIn's first step was to send a cease and desist letter to the startup scraping its data, asking it to immediately stop scraping data from its server. Its main argument was that scraping was a violation of the CFAA and of the Digital Millennium Copyright Act.

But LinkedIn failed to prevent hiQ from scraping data from their pla orm as the other company filled its own suit against LinkedIn and obtained from the court an injunction on forcing LinkedIn to provide access to its servers un l the case is decided. The court eventually ruled against the well-known professional network as selectively banning potential competitors from using data that is publicly available can be considered unfair competition. LinkedIn latest plans were to escalate the case to the Supreme Court.

Useful terms for legal web scraping:

Although web scraping is not necessarily illegal, the purpose you are using this data collection method can make it either legal or illegal. Review the following terms for clarification:

GDPR

GDPR or the General Data Protection Regulation has been reinforced starting with 2018 in the European Union to enable residents to control their own data. The regulation prevents businesses from doing whatever they want with personally identifiable data such as names, addresses, phone numbers, and emails. So, data scraping itself is not illegal, but the use of personal information is limited. For instance, businesses can scrape data and use it for various purposes only if they have the explicit consent of consumers. One of the practices that GDPR does not allow is scraping names and emails from a website to generate leads without the consent of those customers.

Terms of Service Aside from the GDPR, you also need to take into consideration the requirements imposed by websites. When you accept certain terms of service, you are closing a contract between you and that website and you cannot conductivity that the website has prohibited you from doing.

Robots.txt Regarding scraping practices that are accepted by websites, you can easily find out their requirements from the robots.txt file. Aspects you can find out from this file are the access that your scraping tools can have, the amount of me you are allowed on the website, and the number of information requests you can make. Ignoring the robots.txt file on a website and its provisions is not illegal, but very unethical and doing so can cause you to have your IP blocked on that specific server.