Lookyloo is the glue that puts together and sometimes repurposes a bunch of tools developed by others.
Redis is the database used to store the cached information visualized on the web interface.
To learn more about Splash, see the on the official website.
In Lookyloo, Splash loads a web page in a similar fashion to a web browser. When the page is done loading, it scrolls down and waits a little bit longer. When it is done, it returns an HTTP Archive (HAR):
the page rendered at the end of the capture (that would be the same as what you would see when viewing the source of the page in your browser)
a screenshot of the whole page
all the cookies what were received and created during the capture
the URL in the address bar of the browser
The capture is controlled by a LUA script.
In order to extract information from the page and instrument the capture itself, we use Scrapy:
Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
We use Scrapy along with a lightweight connector called scrapy-splash.
The whole system is bundled in an home-made standalone wrapper called ScrapySplashWrapper.
The core feature of Lookyloo is the visualization of what is happenig in the browser when you load a page. Our approach was to use a tree, and who does trees better than researchers working in phylogenetics and genomics?
This is the reason we used ETE Toolkit:
ETE (Environment for Tree Exploration) is a Python programming toolkit that assists in the automated manipulation, analysis and visualization of phylogenetic trees. Clustering trees or any other tree-like data structure are also supported.
To create that tree, we use the HAR output generated by Splash, and pass it to another standalone library, har2tree.
We won’t go in too many details regarding the process of building a tree out of a HAR file, but you can read the comments directly in the code.
We use D3.js and implemented a compatible JSON output into har2tree.
|We’re not re-using the SRIs for a pretty silly reason: Many resources used on websites are the same as the ones in CDNJS, but with an extra new line at the end of the file. SaneJS computes hashes for every file with and without a newline, allowing us to match more resources.|
Virus Total allows users to submit and query URLs in their huge dataset. Lookyloo queries do exactly that and inform the user if the content is malicious.
Phishing Initiative allows users to submit and query URLs to identify malicious content. As it requires an API key, the feature is enabled by default.
To facilitate the deployment of Lookyloo, we use Poetry:
Poetry helps you declare, manage, and install dependencies of Python projects, ensuring you have the right stack everywhere.