Data Scraping 101

Everything you need to know about data/web scraping and how you could (perhaps) protect yourself...

data scraping

Every time you post something on the Internet, there is a chance that someone will get that data and perhaps use it against you — or use it in some other way that may not cause direct harm to you but you may not like it at all. For instance, someone may create his/her own piece of content by scraping your work. Or could create a compilation of user profiles where your personal information could be found.

This practice, called data scraping, is the subject of today’s article. Read on for details…

Data scraping 101

Data scraping presumes using software to record information that was meant for human eyes. Most commonly it comes in the form of web scraping, where an algorithm copies data from a web page while posing as a human.

Companies use web scraper software (or just “web scrapers”) for different purposes, some of which are perfectly legal. For instance, they could employ web scrapers to keep tabs on their competitors’ websites — scanning for updates, inventory and price changes.

Another good example is from travel websites, which scrape data from different airlines and hotels to show users price comparisons. Also, there are retailers which scrape Twitter and review sites like Yelp for sales leads.

Finally, even search engines use a form of data scraping, called web crawling, to build an index of the world wide web and make it searchable to the users.

So far, nothing wrong with that. The problem arises when someone uses web scraping to copy en masse publicly available information of individuals on social media and then use it to create large, organized collections of the data for sale.

As far as you (don’t) know, your profile could be included in that data collection.

More problems with data scraping

It’s not just individuals – companies operating web services also don’t like when their users’ data is obtained with a scraper software. Some do provide integration options in the form of APIs, but that’s for legitimate developers.

Therefore, you are seeing so many CAPTCHAs around the Internet, asking for confirmation that it’s actually a human being who is accessing the site. This technique is meant to prevent automated software from scraping all the data. Or to prevent it from making some other request, like sending a message via the website.

Which leads us to the logical question…

Is data scraping legal?

In theory, nothing’s wrong with web scraping. After all, we are talking about publicly available information. Whether it’s morally ok to do – that’s unclear and I guess it all depends on what someone is trying to accomplish.

But… It’s not that simple. Some services do include terms of service that explicitly prohibit data scraping. But even here, the consequences of violating them can vary. For instance, a small scale data scraping may simply prevent the user from accessing the service.

On the other hand, those who go overboard may even face legal action. This has already happened when eBay sued Bidder’s Edge, a service that aggregated auction data scraped from eBay — resulting in approximately 100,000 extra server requests per day. Other examples include Craigslist (v. Padmapper), QVC (v. Resultly), and LinkedIn (v. hiQ), setting more and more precedents for legal action against data scrapers.

What about privacy?

The problem with data scraping becomes personal when someone wants to get data from social media websites, where “you” (all of us) is the product. Here, the practice can be a real problem for personal privacy.

In case you wonder, this has already happened with Facebook when personal data from more than 533 million users — including their phone numbers, email addresses, and full names — appeared on a hacking forum. Mind you, this wasn’t due to a hack, but it was “thanks” to a loophole in Facebook’s contact import feature and was simply scrapped. We’re happy to say that the mentioned issue has been addressed afterward.

Even more controversial application of data scraping comes from a company called Clearview AI, which is a joint venture between an Australian tech developer and an American politician. The technology that the company has developed uses facial recognition to provide police departments with access to a database of over 3 billion photos of faces scraped from social media. It enables the police department to input a photo of a suspect’s face, and get every available post containing that face.

In order to prevent Clearview from getting your photo — if it already doesn’t have it — you’ll have to set your sharing settings to “private.” But since they already have a huge pile of photos, chances are they can already identify you. Talk about privacy intrusion.

What’s ahead for data scraping

We can’t say the good days are ahead, quite the contrary. With machine learning algorithms getting better with the day, soon every tech-savvy government agency with proper access will be able to identify every one of us. Except if you do something about it. We can vote differently, contact our representatives and so on, but chances are – they can do little about it. Even if they do manage to move a needle, that still leaves the tech giants like Google and Facebook with a ton of data about every one of us.

We can use a VPN to hide our digital footsteps, but that’s not all. We have to stop oversharing stuff on social media and elsewhere — all while hoping that some future legislation will put the government agencies and tech giants in their proper place. And give back the power to the people.