Randori and IBM Plan to Join Forces to Tackle Growing Attack Surface Risks

August 19, 2019

Building an Attack-Focused Web Scraper

By: Tell Hause

Share on facebook
Share on twitter
Share on linkedin

Why Hacker’s Need A Different Kind of Web Scraper

Humans are generally poor number parsers; we are much more suited to process pictures and language. This is pretty clear from our engineering of solutions like DNS, and the amount of visually impactful websites in our modern web. Hacker’s are no different. Let’s take a look at the difference between 1990’s Yahoo and 2019’s Yahoo. These images are very effective in eliciting an attackers reaction of “This is going to be fun” and conversely the defenders “Oh, no. That’s on my perimeter”. To make this possible at scale, we had to build a web scraper designed with hackers in mind. 

Unfortunately, gathering all the information needed to come to these conclusions is non-trivial. Those feelings can come from a page being rendered without CSS, a default landing page, or something else obvious to a human, but difficult for a computer. These ideas were what spurred the design and functionality for our reconnaissance web scraper, which is both a screenshot collector as well as an http session recorder.

Validating the data

Fundamentally, web scraping is a trivial task. Opening a headless browser, navigating to a website, and rendering an image are all solved and documented problems. But there is more to solving this problem in a scalable way that might not be clear in a first implementation of a program like this. For Randori’s case, we are concerned with things like attribution and fingerprinting, which invariably make engineering more complicated than a simple approach. Providing reasonable variances in things like browser versions, display sizes, or the location of the computers taking these screenshots are all things implemented for our scrapers, that wouldn’t be reasonable asks for most individuals.

One concern in collecting these data is in the truth of the data returned. For example, if we were to reach out to a hostname multiple times, depending on whether a user has set up load balancing, one’s browser can be routed to many different IP addresses, resulting in different data each time. Functionally, this behavior is different when hitting the IP and port hosting this content independently. These kinds of routing concerns also become more complex when considering the way internet traffic can be redirected. It behooved us to collect this routing information in transit, and using it later to determine whether our result was routed to something we, as attackers, find interesting. By setting up trusted certs on our browser and proxying our connection on our collector, man-in-the-middle-ing ourselves, we can save the routing information, images, javascript, and full web pages in transit. And use this information after collection to ensure that our results correspond to our intended target.

On a more subtle level, we had to consider what a screenshot should contain. Extending a screenshot to a full webpage’s output can get complicated, but would save a scraper redundant visits to a page. Modern websites often dynamically load new content and apply it to the dom when a user hits the last of the loaded content. Many elements on a webpage can remain static, or behave differently in different contexts. Modern web frameworks, like react, don’t provide any guarantees that all the elements on a page finished loading. These are just a few of the many different challenges we faced in implementing this design our scraper.

On a practical level, we resolved this by acting like a regular use would, scroll through the pages, then stitching together the various renders of the screen into one image. This is done by collecting the pages initial height within JavaScript, taking a screenshot of the webpage, removing all static dom elements with javascript, scrolling to the next location and collecting a new image. Because we know an initial length and our window height we can calculate how many image renders we will need to stitch together. The goal is to blend in with the noise of the internet, preventing any possible observer from detecting the scraper’s traffic.

Having the complete image also allows one more flexibility when presenting these results to the end user. Rendering a single shot with the static elements in a standard 16:9 ratio, or squaring the first two renders and scaling them down for a thumbnail can all be performed depending on your design constraints.

Screenshots aside, the scraper also uses an HTTP proxy to capture everything the browser does during the HTTP session. This allows for the detailed programmatic analysis of the HTML, Javascript, HTTP headers, CSS, cookies, or any other information passed. The HTTP proxy also allows us to determine if the browser followed any redirects and ensure the data was collected from the requested location.

Having all of these pieces working together on a single scraper allows an automated system to collect a wealth of information that provides an adversary the information necessary to begin planning attacks defenders don’t expect. This information only becomes more powerful in aggregate, where usage information for specific services can be gleaned easily, giving attacks even more insight as to what the juiciest targets are.


To sum it up

Data has become the basis of all decision-making processes, whether it be a business or a security decision. When building an attack-focused scraper, one must be prepared for dealing with extra levels of complexity in remaining invisible to defender’s eyes. As businesses become increasingly dependent on data, and security resources become even more strained on time, it is now essential for any security professional to have access to the most complete data available — attacker and defender.

Gain an Attacker's Perspective

Uncover your true attack surface with the only ASM platform built by attackers. Stay one step ahead of cyber-criminals, hacktivists and nation-state attackers, by seeing your perimeter as they see it.