Why Hacker’s Need A Different Kind of Web Scraper
Humans are generally poor number parsers; we are much more suited to process pictures and language. This is pretty clear from our engineering of solutions like DNS, and the amount of visually impactful websites in our modern web. Hacker’s are no different. Let’s take a look at the difference between 1990’s Yahoo and 2019’s Yahoo. These images are very effective in eliciting an attackers reaction of “This is going to be fun” and conversely the defenders “Oh, no. That’s on my perimeter”. To make this possible at scale, we had to build a web scraper designed with hackers in mind.
Unfortunately, gathering all the information needed to come to these conclusions is non-trivial. Those feelings can come from a page being rendered without CSS, a default landing page, or something else obvious to a human, but difficult for a computer. These ideas were what spurred the design and functionality for our reconnaissance web scraper, which is both a screenshot collector as well as an http session recorder.
Validating the data
Fundamentally, web scraping is a trivial task. Opening a headless browser, navigating to a website, and rendering an image are all solved and documented problems. But there is more to solving this problem in a scalable way that might not be clear in a first implementation of a program like this. For Randori’s case, we are concerned with things like attribution and fingerprinting, which invariably make engineering more complicated than a simple approach. Providing reasonable variances in things like browser versions, display sizes, or the location of the computers taking these screenshots are all things implemented for our scrapers, that wouldn’t be reasonable asks for most individuals.
On a more subtle level, we had to consider what a screenshot should contain. Extending a screenshot to a full webpage’s output can get complicated, but would save a scraper redundant visits to a page. Modern websites often dynamically load new content and apply it to the dom when a user hits the last of the loaded content. Many elements on a webpage can remain static, or behave differently in different contexts. Modern web frameworks, like react, don’t provide any guarantees that all the elements on a page finished loading. These are just a few of the many different challenges we faced in implementing this design our scraper.
Having the complete image also allows one more flexibility when presenting these results to the end user. Rendering a single shot with the static elements in a standard 16:9 ratio, or squaring the first two renders and scaling them down for a thumbnail can all be performed depending on your design constraints.
Having all of these pieces working together on a single scraper allows an automated system to collect a wealth of information that provides an adversary the information necessary to begin planning attacks defenders don’t expect. This information only becomes more powerful in aggregate, where usage information for specific services can be gleaned easily, giving attacks even more insight as to what the juiciest targets are.
To sum it up
Data has become the basis of all decision-making processes, whether it be a business or a security decision. When building an attack-focused scraper, one must be prepared for dealing with extra levels of complexity in remaining invisible to defender’s eyes. As businesses become increasingly dependent on data, and security resources become even more strained on time, it is now essential for any security professional to have access to the most complete data available — attacker and defender.