Proposal for a Distributed Content Verification System

Date
2020-05-16 14:49 CET

#100DaysToOffload >> 2020-05-16 >> 018/100

Preamble

This is going to be a long post. In it, I will describe a system that I have been thinking of for the last week. The way I see it, there are three possible outcomes: a) it's genius, I've outdone myself and I should build it; b) it's genius but other people have already solved this issue (perhaps in a different way); c) it's a mediocre/inadequate solution to a problem that doesn't need solving. I need help in figuring out which description suits this idea the best. Let me know on the fediverse.

Background

Story 1 - Linux Mint hack

Two short stories are required. The first is based around Linux Mint and what happened in 2016. TLDR from the blog post: "Hackers made a modified Linux Mint ISO, with a backdoor in it, and managed to hack our website to point to it". In addition to just linking to the modified ISO file, they also changed the MD5 hash to match their modified version.

The web is fragile. If you post MD5 hashes on your website so people can trust your software and your website gets hacked and the hashes changed, there's no trust. This is not Linux Mint's fault, this is the way the internet works. I have had hackers on my shared hosting servers who uploaded a whole bunch of suspicious files. Because of this, the fix was easy. But what if they just made a minor change in a single file? I would have been none the wiser.

Story 2 - Keybase

You need an external source of truth and this is what the second short story is about: Keybase. I verified my website and my accounts on various services through their website. If you know me through my fosstodon.org account, you could check if that Keybase account was really mine, and if so, you could verify that this website is really mine as well as some other accounts. A nifty solution for authenticity proof of my distributed online presence.

But there are drawbacks. The actual content on my website is not verified. The system is centralised and also not FOSS. Lastly, due to their recent acquiring, I will no longer be using Keybase.

So my new authenticity proof? My website. The links on my website are who I am on various online services. I curated those links. I checked for each one if they link to what I intended them to link to.

But that's not enough. What if my website gets hacked? And a social link gets replaced? "Well, that doesn't happen to me", some might say. Fine, let's look at a second example. Visit someone else's personal site and click their social links. How do you know if you can trust those links? "So what", you say? Let's go further. You want to donate to someone using a cryptocurrency. They have their wallet on their website. Is that really their wallet though?

I think we can solve this issue the way we would want to: using a distributed system.

Proposal for a Distributed Content Verification System

Overview

The concept is based around a network of two different types of nodes: the "content" nodes and the "truth" nodes. The "content" nodes are websites with content that need to be verified. The "truth" nodes are servers that periodically check all known pages for changes.

The idea is that a hacker needs to obtain a developer's cryptographic keypair and infiltrate both a "content" node and one or more "truth" nodes in order to get away with their malicious activity.

Step 1 - Linking a "content" node to a "truth" node

First, the website owner needs to make their website ("content" node) known to the network of "truth" nodes. This can either be done manually by asking someone they trust and owns a "truth" node, or "truth" nodes could implement some sort of registration form.

A valid contact method like an email address is mandatory to communicate irregularities to the website owner. A public cryptographic key is also required in order to check the signature of the hashes (see below).

Step 2 - Updating the "truth"

During the process of uploading the updated content of their website to their server, the website owner also sends the hashes of the updated files to the "truth" node they registered with. This could be done by using a command-line tool on the server, on the developer's machine, it could even be part of the CI/CD/CD pipeline. Hashes are signed using a cryptographic keypair to make sure the website owner is the one who updated the content.

Alternatively, the "truth" node could have a web interface with a button to trigger the (near-)immediate download of the pages and computation of the hashes. However, this method does not easily allow for the cryptographic signing of the hashes and should therefore not be advised or even accepted.

Once updated, the "truth" nodes exchange the updated hashes with each other.

One thing to consider: websites can be dynamic, for example by including posts from social networks. HTML tags that contain dynamically generated content should get a specific tag so they get excluded from the hashing.

Step 3 - Verification of content

On a regular basis, the "truth" nodes download the pages and compute the hashes. If they match with the hashes in their database, all is well.

If a discrepancy is found, a "truth" node should ask other "truth" nodes if updates exist for this particular website and the updated hashes simply haven't propagated yet. If so, fetch the new hashes and run this step again.

If no updated hashes are found or the new hashes still don't match, contact the website owner and let them know something has changed on their website that they haven't told the "truth" nodes about.

Optional step 4 - User benefits

In addition to the measures taken in step 3 when detecting anomalies, browser plugins could warn visitors of websites that the content they are seeing may not be what the website owner intended it to be.

Possible attack surfaces

If a "truth" node is hacked, hashes could easily be changed. However, signing the hashes using a cryptographic keypair should mitigate this problem. Other nodes will not trust the newly propagated hashes and will flag that "truth" node as corrupted.

If a "truth" node is hacked and the website owner's credentials are changed, they would no longer receives notifications. Credentials should also be signed by the cryptographic keypair to make changes like these detectable.

If a "truth" node is hacked and the stored public key is modified, we have a problem. "Truth" nodes should verify each other as well to make sure no funny business like this happens.

If a "truth" node is hacked and the "content verification" code is changed, we have a problem. Again, some form of collaboration between "truth" nodes should prevent hacked "truth" nodes from doing harm to the system.

If a "content" node is hacked and new files are uploaded, the "truth" node will not be triggered as it won't handle these files. But at least, the content displayed to visitors remains unchanged.

If a "content" node is hacked and existing files are modified, the "truth" nodes will be triggered and there's no code on the "content" node that could prevent this from happening.

If a "content" node is hacked and existing files are modified in such a way that the hashes match, we have a problem. Proper research needs to be done to correctly implement cyptographic hashing functions to avoid this issue.

Things that need to be worked out

How exactly does a new website enter the network?
How to coordinate page downloading and hash computation to avoid redundancy and load on the hosting server?
How to measure credibility among "truth" nodes and detect corruption of individual nodes?
How to prevent hash collision?

Federated AND peer-to-peer

The concept described above is technically based on federation. However, I initially imagined several websites hosting both their own websites and the hashes of websites they selected. This is still possible: the concept described above should support both a federated content verification system and a peer-to-peer content verification system.