Today’s botnets take web scraping to a whole new and elusive level.“The reward-to-risk ratio in cybercrime is actually pretty good, and that’s why we're seeing an uptick in volume in web scraping, says Morgan Gerhart, vice president of product marketing at cyber security company Imperva. “Generally 60 percent of traffic that hits a website is bots. Not all bot traffic is bad, but we know half of that is from somebody that is up to no good.”Random spikes in any bot traffic reduces website performance and increases infrastructure cost, which impact the user’s experience.
What you can do
Web scraping is not going away, Bercovich says, but companies can take several steps to fight back. Botnets come fast and furiously in large volumes and usually slow down systems. “If they’re filtering at superhuman speeds, or paginating quickly or never scroll down the page,” then it’s probably a bot. Even after bots are detected and blocked, the fight is rarely over. “They’ll try a few variations to see if they can escape detection, but by then we’re totally on top of it,” Bercovich adds.
A multi-layer defense is the best offense to combat web-scraping bots, Gerhart says.
Application level intelligence
A product with application level intelligence can look at traffic to determine if it’s a browser on the other end or a bot. “People who are good at [scraping] can make it look like a browser,” Gerhart says. “That’s when you get into more sophisticated behavior.”
Not all web scraping is inherently bad, Gerhart says, “but you don’t want web scraping traffic to interfere with other users.” On a per-connection basis, limit a user’s actions to no more than x number of actions in x amount of time, he says. “Even if you’re OK with scraping of your site, you may not want it at a rapid pace so it overwhelms your CPUs.”
Other deterrents include constantly changing html tags to deter repeated scraping attacks, and using fake web content, images, or links to catch site scrapers who republish the content, Gerhart says.
Safety in numbers
Some companies combat distributed botnets by partnering with large service providers that have exposure to a big portion of all of the requests on the Internet. They’re able to see attack patterns, collect those IP addresses and block them for all of their clients. Graphiq chose to outsource its bot protection to a provider with broader knowledge of scraping attacks.
Sign up for Computerworld eNewsletters.