The internet is crawling with bots. A bot is a software program that runs automated tasks over the internet, typically performing simple, repetitive tasks at great speeds unattainable, or undesirable by humans. They are responsible for many small jobs that we take for granted such as search engine crawling, website health monitoring, fetching web content, measuring site speed and powering APIs. They can also be used to automate security auditing by scanning your network and websites to find vulnerabilities and help remediate them.

According to our 2015 Bot Traffic Report, almost half of all web traffic is bots, and two thirds of bot traffic we’ve analyzed is malicious. One of the ways that bots can harm businesses is by engaging in web scraping. We work with customers often on this issue and wanted to share what we’ve learned. This post discusses what web scraping is, how it works, and why it’s a problem for website owners.

What is scraping?

Web scraping is the process of automatically collecting information from the web. The most common type of scraping is site scraping, which aims to copy or steal web content for use elsewhere. This repurposing of content may or may not be approved by the website owner.

Typically, bots do this by crawling a website, accessing the source code of the website and then parsing it to remove the key pieces of data they want. After obtaining content, they typically post it elsewhere on the internet.

Web scraping 1

A more advanced type of scraping is database scraping. Conceptually this is similar to site scraping except that hackers will create a bot which interacts with a target site’s application to retrieve data from its database. An example of database scraping is when a bot targets an insurance website to receive quotes on coverage. The bot will try all possible combinations in the web application to obtain quotes and pricing for all scenarios.

Web scraping 2

In this example, the bot tells the application it is a 25-year-old male looking for a quote for a Honda, then for a Toyota, then a Ferrari. Each time the bot gets a different result back from the application. Given enough tries, it is possible to obtain entire datasets. Clearly with the number of permutations available in this scenario, a bot would be preferable to a human.

Database scraping can be used to steal intellectual property, price lists, customer lists, insurance pricing and other datasets that would require an effort prohibitively tedious for humans, but perfectly within the range of what bots routinely do.

Consider the case of a rental car agency, if a company created a bot that regularly checked the price of its competitor and slightly undercut them at every price point, it would have a competitive advantage. This lower price would appear in all aggregator sites that compare both companies, and would likely result in more car rental conversions and higher search engine rankings.

Web scraping 3

To deal with the threat that scraping poses to your business, it’s advisable to employ a solution that adequately detects, identifies and mitigates bots. ­

Not all web scraping is bad

Scraping isn’t always malicious. There are many cases where data owners want to propagate data to as many people as possible. For example, many government websites provide data for the general public. This data is frequently available over APIs but because of the scale of work required to achieve this scrapers must sometimes be employed to gather that data.

Another example of legitimate scraping – which is often powered by bots – includes aggregation websites such as travel sites, hotel booking portals and concert ticket websites. Bots that distribute content from these sites obtain data through an API or by scraping, and tend to drive traffic to the data owners’ websites. In this case bots may function as a critical part of their business model.

Are bots legal? According to Eric Goldman, a professor of law at Santa Clara University School of Law, who writes about internet law,

Although scraping is ubiquitous, it’s not clearly legal. A variety of laws may apply to unauthorized scraping, including contract, copyright and trespass to chattels laws. (“Trespass to chattels” protects against unauthorized use of someone’s personal property, such as computer servers). The fact that so many laws restrict scraping means it is legally dubious.

Since scraping bots may also harm your business as we mentioned, it’s important to create an ecosystem that is both bot-friendly and also able to block malicious automated clients. Website owners can significantly improve security of their website by blocking bad bots without excluding legitimate bots.

Four things you can do to detect and stop site scraping

Site scraping can be a powerful tool. In the right hands, it automates the gathering and dissemination of information. In the wrong hands, it can lead to theft of intellectual property or an unfair competitive edge.

Over the last two decades, bots have evolved from simple scripts with minimal capabilities to complex, intelligent programs that are sometimes able to convince websites and their security systems that they are humans.

We use the following process to classify automated clients and determine next steps.

layer-7-ddos-client-classification

You can use the following methods to classify and mitigate bots, including detecting scraping bots:

Use an analysis tool — You can identify and mitigate bots including site scapers by using a static analysis tool that examines structural web requests and header information. By co-relating that information with what a bot claims to be, you can determine its true identity and block as needed.

Employ a challenge-based approach — This approach is the next step in detecting a scraping bot. Use proactive web components to evaluate visitor behavior such as does it support cookies and JavaScript? You can also use scrambled imagery like CAPTCHA, which can block some attacks.

Take a behavioral approach — A behavioral approach to bot mitigation is the next step. Here you can look at the activity associated with a particular bot to determine if it is what it claims to be. Most bots link themselves to a parent program like JavaScript, Internet Explorer or Chrome. If the bot’s characteristics behave differently from the parent program, you can use the anomaly to detect, block and mitigate the problems in the future.

Using robots.txt

You can use robots.txt to shield your site from scraping bots, but it may not be effective in the long run. robots.txt works by telling a bad bot that it’s not welcome. Since bad bots don’t adhere to rules, they will ignore any commands. In some situations, some malicious bots will look inside robots.txt for hidden gems (private folders, admin pages) the site owner is trying to hide from Google’s index and exploit them.

So it’s even more important than ever that your bot defense solution can fully assess the impact of a specific bot before deciding whether or not to allow it to access your website. To see if your current solution is adequate ask these questions: Does this automated client add or subtract value to your business? Is it driving traffic toward your website, or away from your site? Answering these questions will help you determine which course to take to build bot detection and mitigation into your security systems.


Would you like to write for our blog? We welcome stories from our readers, customers and partners. Please send us your ideas: blog@incapsula.com