Implementation Details

Lookyloo is the glue that puts together and sometimes repurposes a bunch of tools developed by others.

Backend

Redis

Redis is the database used to store the cached information visualized on the web interface.

Splash and Scrapy

Splash is a JavaScript rendering service. It is a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. The (twisted) QT reactor is used to make the service fully asynchronous allowing users to take advantage of webkit concurrency via QT main loop.

To learn more about Splash, see the on the official website.

In Lookyloo, Splash loads a web page in a similar fashion to a web browser. When the page is done loading, it scrolls down and waits a little bit longer. When it is done, it returns an HTTP Archive (HAR):

  • the page rendered at the end of the capture (that would be the same as what you would see when viewing the source of the page in your browser)

  • a screenshot of the whole page

  • all the cookies what were received and created during the capture

  • the URL in the address bar of the browser

The capture is controlled by a LUA script.

In order to extract information from the page and instrument the capture itself, we use Scrapy:

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

We use Scrapy along with a lightweight connector called scrapy-splash.

The whole system is bundled in an home-made standalone wrapper called ScrapySplashWrapper.

ETE Toolkit & har2tree

The core feature of Lookyloo is the visualization of what is happenig in the browser when you load a page. Our approach was to use a tree, and who does trees better than researchers working in phylogenetics and genomics?

This is the reason we used ETE Toolkit:

ETE (Environment for Tree Exploration) is a Python programming toolkit that assists in the automated manipulation, analysis and visualization of phylogenetic trees. Clustering trees or any other tree-like data structure are also supported.

To create that tree, we use the HAR output generated by Splash, and pass it to another standalone library, har2tree.

We won’t go in too many details regarding the process of building a tree out of a HAR file, but you can read the comments directly in the code.

Frontend

D3JS

Initially, we used the webplugin developed by ETE Toolkit, but it turned out to be too limited for what we wanted to do, so instead of that, we decided to use D3.js:

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.

We use D3.js and implemented a compatible JSON output into har2tree.

Flask & Bootstrap

For the web interface, we use Flask and bootstrap 4.

Everything related to the website is in a subdirectory in the Lookyloo repository.

Modules and Third Party components

In order to give more context about the URLs given by the users, we use some third party services. Some are used to match received content with known resources (JavaScript libraries, CSS, images), while others allow us to match requests with known malicious content.

SaneJS

It uses the CDNJS repository to build a hash database of a vast amount of resources used all over the internet.

The project is standalone and can be used outside of Lookyloo (even if it was primarily developed for it) with a python client called pysanejs.

We’re not re-using the SRIs for a pretty silly reason: Many resources used on websites are the same as the ones in CDNJS, but with an extra new line at the end of the file. SaneJS computes hashes for every file with and without a newline, allowing us to match more resources.

Virus Total

Virus Total allows users to submit and query URLs in their huge dataset. Lookyloo queries do exactly that and inform the user if the content is malicious.

Phishing Initiative

Phishing Initiative allows users to submit and query URLs to identify malicious content. As it requires an API key, the feature is enabled by default.

Packaging

Poetry

To facilitate the deployment of Lookyloo, we use Poetry:

Poetry helps you declare, manage, and install dependencies of Python projects, ensuring you have the right stack everywhere.