Lookyloo is the glue that puts together and sometimes repurposes a bunch of tools developed by others.
Redis is the database used to store the cached information visualized on the web interface.
Playwright is a Python library to automate Chromium, Firefox and WebKit browsers with a single API. Playwright delivers automation that is ever-green, capable, reliable and fast.
To learn more about Playwright, see the on the official website.
In Lookyloo, the Playwright capture module loads a web page in a headless web browser. When the page is done loading, it scrolls down and waits a little bit longer. When it is done, it returns an HTTP Archive (HAR):
the page rendered at the end of the capture (that would be the same as what you would see when viewing the source of the page in your browser)
a screenshot of the whole page
all the cookies what were received and created during the capture
the URL in the address bar of the browser
The capture is controlled by a Python script.
ETE Toolkit & har2tree
The core feature of Lookyloo is the visualization of what is happenig in the browser when you load a page. Our approach was to use a tree, and who does trees better than researchers working in phylogenetics and genomics?
This is the reason we used ETE Toolkit:
ETE (Environment for Tree Exploration) is a Python programming toolkit that assists in the automated manipulation, analysis and visualization of phylogenetic trees. Clustering trees or any other tree-like data structure are also supported.
To create that tree, we use the HAR output generated by Splash, and pass it to another standalone library, har2tree.
We won’t go in too many details regarding the process of building a tree out of a HAR file, but you can read the comments directly in the code.
Initially, we used the webplugin developed by ETE Toolkit, but it turned out to be too limited for what we wanted to do, so instead of that, we decided to use D3.js:
We use D3.js and implemented a compatible JSON output into har2tree.
Flask & Bootstrap
For the web interface, we use Flask and bootstrap 4.
Everything related to the website is in a subdirectory in the Lookyloo repository.
Modules and Third Party components
It uses the CDNJS repository to build a hash database of a vast amount of resources used all over the internet.
The project is standalone and can be used outside of Lookyloo (even if it was primarily developed for it) with a python client called pysanejs.
|We’re not re-using the SRIs for a pretty silly reason: Many resources used on websites are the same as the ones in CDNJS, but with an extra new line at the end of the file. SaneJS computes hashes for every file with and without a newline, allowing us to match more resources.|
Virus Total allows users to submit and query URLs in their huge dataset. Lookyloo queries do exactly that and inform the user if the content is malicious.
Phishing Initiative allows users to submit and query URLs to identify malicious content. As it requires an API key, the feature is enabled by default.
To facilitate the deployment of Lookyloo, we use Poetry:
Poetry helps you declare, manage, and install dependencies of Python projects, ensuring you have the right stack everywhere.