A Rundown On Web Crawling For The Buyside

by Dallán RyanJul 19, 2020Blog

Background:

Multiple surveys and our own experience at Eagle Alpha highlights web crawled data as one of the most popular categories of alternative data. There are multiple reasons for this, including breadth of applications, ease of access and low price points. In this article, I will touch on what I consider to be the most important aspects of this important alternative data category.

What is Web Crawled Data?

One of our monthly alpha workshops recently focused on the topic of web crawled data. Web crawling was defined as a means of aggregating data “via a computer program which requests information from public URLs. The data can be collected in-house or by companies that specialize in customized data collection”.

For some time now funds have been gathering data from the websites of large online retailers including autos retailers, e-commerce sites, real estate listings, employment data and Online Travel Agencies. Web-crawled data also constitutes a large portion of the data for other alternative data categories such as social media data, employment data, store location data, pricing data and ratings/reviews data.
In a survey by law firm Lowenstein Sandler in 2019 49% of funds responded that they used web crawled data and 57% plan to use the data in the next 12 months. It’s worth noting that 67% of respondents of the survey used Social media data which is frequently collected via web crawling.
Analysis of aggregated and anodized user click data from the Eagle Alpha platform reveals that web crawling datasets market share is declining but employment data and pricing data categories, which also utilize web crawling techniques, have seen an increase in market share.

The Value of Web Crawled Data:

On the same alpha workshop, we explored the most common use cases for web crawled data

Figure 1.png — Figure 1 is based on data that has been scraped from Lululemon’s website. The dataset tracks key KPI’s for the company’s online presence such as pricing, discounting and SKU count. These metrics are measured for We Made Too Much (Inventory on sale), What’s New (new products), Bestsellers and aggregate level data for the Men’s and Women’s categories. The dataset has shown to be useful for tracking growth in the expanding men’s category and to measure the influence of pricing and discounting on margins.

Another common use for web crawled data is tracking online job listings. Figure 2 shows an analysis of a job listings data-set that looked at the hiring of legal personnel at some of the largest technology companies. The analysis showed that Facebook was hiring legal-related staff at a much higher rate than its peers. This proved to be insightful as Facebook mentioned on its December quarter conference call that “G&A grew 87% largely driven by higher legal fees and settlements”. Backing out a settlement charge of $550m, G&A still grew 31% YoY in the quarter.

A final popular use case for web crawled data is social media analysis. Figure 3 shows social media mentions and sentiment data for online streaming platforms. The plot on the left shows social media mentions excluding Netflix. The data revealed that Apple TV+ garnered a short-lived bounce in consumer interest when it launched in March of 2019. In contrast, Disney+ saw a much higher count of mentions for several months. This was an indication of consumer interest in Disney+. The social media data was indicative of subscriber growth for Apple TV+ and Disney+ when the companies updated investors on quarterly conference calls.

[Data Strategy clients can click here to access an archive of 25 web scraping case studies from our Alpha Center.]

Challenges When Working with Web Crawled Data:

On our web crawling alpha workshop, we also discussed the challenges of working with web crawled data. The two greatest challenges highlighted were history and legal considerations.

Challenge #1: Lack of History

When engaging in an internal web crawling project we need to accept that historical data is typically not available. This is particularly challenging for eCommerce sites where historical pricing and availability are important KPIs a user might track.

There are some databases provided by organisations such as https://archive.org/ or https://commoncrawl.org/ but coverage is typically not sufficient for an investment application. Some sites will include historical data, most notably forums or rating and reviews sites where historical posts are available and clearly time-stamped.

An alternative to an internal web crawling project is to work with a third-party provider. Frequently these will have historical data, but typically only for very niche applications. One example being employment data as highlighted earlier. Where web crawling providers do not have historical data they will most likely be able to help on a go-forward basis.

[Data Strategy clients can click here to access the full archive of alpha workshops.]

Next, we’ll address the second challenge of working with web crawled data – legal considerations.

Challenge #2: Legal Considerations

Web crawling is one of the most common topics raised on our monthly legal workshops. So much so that we dedicated an entire session to the topic.

In that workshop, Peter Greene from Lowenstein Sandler placed particular emphasis on the question of whether data on the web is public data. He concluded that it can be argued that web crawled data is in the public domain, as long as you don’t need to put a password to view the information. As long as it’s considered public, the legal analysis takes you out of the insider trading realm in Peter’s opinion.

Lowenstein Sandler draws the line on password-protected content. Data from a section of a website that is behind a password is not public data in their opinion.

It is also important to note that the website operators have a lot of tools they can use to block someone from scraping. Lowenstein Sandler doesn’t recommend clients make efforts to circumvent these obstacles.

One of the highest-profile cases involving web crawling is between Linkedin and a company called HiQ.

HiQ’s business is based on working with corporations with respect to the job moves of employees. LinkedIn profiles had been the primary source of its data, and HiQ would search the entirety of LinkedIn’s database. However, HiQ received a cease and desist letter from LinkedIn in May 2017. HiQ complied and started to scrape only publicly available data.

LinkedIn then decided to prevent any kind of scraping – even of public information – and put technological barriers in place.

In June 2017, HiQ commenced an action for an injunction to allow it to continue to scrape public profiles. The United States District Court for the Northern District of California agreed with HiQ. LinkedIn appealed that decision to the United States Court of Appeals for the Ninth Circuit. On September 9, 2019, the Ninth Circuit rejected LinkedIn’s effort to stop HiQ from using information crawled from LinkedIn’s website.

Most observers have taken the rulings in the HiQ vs. LinkedIn case as evidence that web crawling is legal. We have written multiple articles on the case and we even dedicated an entire legal workshop to the topic.

It’s also worth noting that regulations regarding web scraping vary by region. For instance, in the past, we published an article discussing guidance on web crawling from The National Commission on Informatics and Liberty (CNIL), a French regulatory body. The guidance indicates that even if individual contact details are collected from public posts, it doesn’t mean that individuals were expecting their data to be harvested for “prospecting”. Therefore, the CNIL treats these public posts as personal data which cannot be used without consent as specified under the GDPR.

Peter Greene highlighted what he suggests to his clients who engage in web crawling:

Develop a one-pager scraping permission sheet.
Carefully negotiate the agreements with scrapers and crawlers and negotiate the reps and the data provenance

This process will be proof to the regulator that a firm took the necessary steps when engaging in web crawling.

[Data Strategy clients can click here to access the archive of legal workshops and click here to access our archive of legal articles.]

Conclusion:

Web-crawled data has consistently ranked as one of the most popular categories of alternative data due to its broad applications, ease of use and relative inexpensiveness. Although datasets tagged as web crawling have been losing share of clicks on Eagle Alpha’s platform, other datasets that rely on web crawled data such as employment data and pricing data are gaining share. The lack of historical data can sometimes be overcome through public databases or niche datasets from specialist vendors. The major legal consideration is whether the data is public and expert opinion suggests that web data is public as long as it’s not behind a password.

[If you would like to learn more about any of the data sources mentioned or how Eagle Alpha is helping clients in the current environment then please contact us at inquiries@eaglealpha.com]

A Rundown On Web Crawling For The Buyside

Recent Category Posts

Useful Links