How to verify that a legitimate (but unknown) remote asset from an unknown source has not been compromised and that its integrity remains intact?

Question

I am familiar with and see the benefits of Subresource Integrity (SRI).

I understand that with SRI, once you've added a script reference with the correct integrity attribute, if the remote script is subsequently compromised, then the SRI Hash will not match the remote asset and the remote script will not download.

That's an effective safeguard as long as the remote script was uncompromised at the point when you first referenced it.

But what if, at that point, the remote script was already compromised (and a new SRIHash generated to match the compromised asset)?

The asset must be able to self-verify. My principle concern here is self-verification.

That is, an asset needs to be able to authenticate itself in the absence of third-party verification.

(An SRIHash will verify that the asset is a data-match, but that doesn't help if a bad actor is able to alter both the asset and the SRI Hash.)

At the very least, we need more than just an SRIHash.

To verify the integrity of a new unknown remote asset from a new unknown source, when we cannot say for certain if either the remote asset or the remote source has not been compromised, we need something that cannot be plausibly altered without giving the game away.

Can this be achieved (?) by adding to the asset and the SRIHash the combination of:

a timestamp
a geostamp

N.B. I'm specifically talking about a legitimate asset which has been compromised, not a bad asset which was bad from the outset. I recognise there's nothing that can be done about the latter.

This is what I've come up with so far in terms of an unknown remote asset self-verifying its integrity:

The remote asset has a conventional SRI Hash.

the asset has a name and a version
the named, versioned asset contains a Unix Timestamp giving the time it was first published
the named, versioned asset contains Lat and Long Geo-coordinates giving the location it was first published

The three pieces of data above are used to derive from the SRI Hash:

a 256-character key.

That 256-character key is then used to derive from the asset itself:

a 16-character slug

That 16-character slug is then inserted into the named, versioned filename and becomes a canonical part of that filename:

https://example.com/remote-folder/remote-script--2-4--39dsoe26shw82czm.js

Wherever Remote Script 2.4 is hosted on the web, it will always have the filename:

remote-script--2-4--39dsoe26shw82czm.js

Nearly all pieces of information (the SRIHash, the timestamp, the geo-coordinates, the 16-character slug) are referenced in the data itself so that everything can either:

i) be checked as identical by the computer (eg. Is the SRIHash in the attribute the same as the one listed in the asset? Does the filename duplicate the same 16-character sequence from the asset? Is the Asset Name the same as listed? Is the Asset Version the same as listed?); or else

ii) the technician referencing the remote script can be asked if they trust the asset's signature (eg, This data reports that it was published [in the middle of the Pacific] in 1981? Do you wish to continue?)

It occurs to me that a determined exploiter who wants to generate a 256-character key for their own compromised version, where the key is consistent with a plausible publishing time and location will need to manipulate the asset itself by inserting comments. Consequently a semi-automated check would also be required to verify that the asset doesn't contain any unusual comments.

Is this level of self-verification of integrity (using a timestamp and a geostamp) enough? Or is this too easily circumvented?

This is mostly as far as I've got on my own, but I'm happy to answer any further questions to clarify any detail I may have inadvertently left out.

Currently reading about Phil Zimmermann's concept of a [Web of Trust](https://en.wikipedia.org/wiki/Web_of_trust) but I don't think it's intended for precisely this kind of scenario. — Rounin, Feb 03 '22 at 15:53
Similar question here: https://softwareengineering.stackexchange.com/questions/327805/how-to-verify-that-a-file-has-not-been-tampered-with — Rounin, Feb 03 '22 at 16:42
Now reading about [W3C Decentralized Identifiers (DIDs) v1.0](https://www.w3.org/TR/did-core/) — Rounin, Feb 03 '22 at 16:46
For a cryptographically-assured untampered history, perhaps this question would have been better submitted to https://crypto.stackexchange.com instead. — Andreas ZUERCHER, Feb 03 '22 at 17:05
I'm still trying to figure out what the 16 character slug is giving you in world where: two sites can disagree on the canonical filename, timestamps and geocords can be faked, hashes can be calculated by anyone on anything. I can see this defeating the legitimate asset author attempt to sneak in a change without a version increment since now they aren't in control of republishing the hash. But that is only caught by people outside your use case. Who have seen the asset before and already know its "canonical" file name. — candied_orange, Feb 03 '22 at 17:32
@candied_orange - for avoidance of ambiguity, in this situation the canonical filename is a _known entity_. If the SRI Hash were shorter I would include the SRIHash in the canonical filename and be done. I needed something shorter than a typical SRI Hash and it occurred to me that if I generated something from the asset data using the SRI Hash plus a timestamp plus a geostamp then I could have something of fixed length which included two "sniff-test" checks on top of the SRI Hash. [1/2] — Rounin, Feb 03 '22 at 19:11
@candied_orange - Re: _"I can see this defeating the legitimate asset author attempt to sneak in a change without a version increment"_ I hear what you're saying and it's a good point - and something I hadn't thought of. Thinking about it now, I'm okay with a legitimate asset author not being able to sneak in a change without a version increment. The assets _are_ dynamic (obviously, otherwise, a single verifiable SRIHash would suffice), but I don't want them to be _that_ dynamic with legions of unversioned-but-slightly-distinct alternatives. [2/2] — Rounin, Feb 03 '22 at 19:16
@Rounin Let me make the point this way: remote-script--2-4--39fakymcfakehash.js I say that's the canonical file name. If you've never seen it before then how do you know that I'm full of it? — candied_orange, Feb 03 '22 at 19:20
If it were me in this scenario, I'd search for `remote-script--2-4--` in Google / Bing / DuckDuckGo / Brave etc. and see how many search result snippets cite `remote-script--2-4--39fakymcfakehash.js`, especially if they're trusted sites, directories, code repositories and blogs. If that filename is only coming up on unusual URLs or not coming up at all and - especially - if the same name prefix is appearing _a lot_ with a consistently different slug, I'd regard this slug (`39fakymcfakehash`) as fishy and the latter as legit. — Rounin, Feb 03 '22 at 19:25
We do all that already with separate file hashes. What are you buying us with this? Because you're costing us the ability to go back and hash files created and named before adoption of this scheme. — candied_orange, Feb 03 '22 at 19:29
I may be missing something but what does the geolocation identifier accomplish? I don't see anything here that prevents an attacker from using whatever geolocation they like such as the one that was used in the original hash. It's not like someone trying to subvert this scheme is going to tell you their actual location while they are perpetrating a crime unless they are really very stupid. Same goes for the timestamp too. — JimmyJames, Feb 03 '22 at 20:34
@JimmyJames - The 256 character key (which helps determine the 16-character slug integrated into the canonical filename) is generated from the timestamp, geostamp and the SRIHash. If a bad actor changes the data but doesn't change anything else, then they will have a different SRIHash but an identical timestamp and geostamp. This will generate _a different 256 character key_ which, combined with _the different data_, will generate a **radically different** 16-character slug. The resulting filename will be immediately recognised as non-canonical. — Rounin, Feb 05 '22 at 14:54
The time + geo stamp has to be provided by the same actor generating the resource, making it untrustworthy. Alternatively, a “trusted third party” like a CA would have to act as a timestamping server but that is a high-cost strategy that doesn't actually provide that much more security. TOFU (trust on first use) is a sensible and widespread security stance (e.g. SSH or HSTS key pinning, Signal messenger) that makes subsequent manipulations detectable without lots of infrastructure and overhead. SRI is another instance of that idea. Of course it's imperfect, but it's better than nothing. — amon, Feb 05 '22 at 19:32
@amon - Re: _"The time + geo stamp has to be provided by the same actor generating the resource, making it untrustworthy."_ See the note in the question above: "N.B. I'm specifically talking about a _legitimate_ asset which has been compromised, _not_ a _bad asset_ which was bad from the outset. I recognise there's nothing that can be done about the latter." — Rounin, Feb 06 '22 at 12:16
@Rounin Talking about “good” or “bad” assets is meaningless unless there is a way to distinguish them. Here, both good and bad actors can create resources that pass your proposed verification mechanism. You may be concerned only about subsequent manipulations, but SRI is just as good for that as your approach – how do I know if your canonical filename wasn't already manipulated before I first accessed it? A HMAC (keyed hash) could solve this, but would require the asset creator to have a trustworthy public key (e.g. with a certificate from a CA). — amon, Feb 06 '22 at 13:36
@amon - Re: _"Talking about “good” or “bad” assets is meaningless unless there is a way to distinguish them."_ Let's take 1) `jQuery`, 2) `kQuery` and 3) an asset that calls itself `jQuery` but has been compromised. 2) and 3) are not the same. There isn't much demand for 2) and where there is, _because it's not well-known_, it will be inspected by people who can read the code and (advisedly) left alone by others who cannot. There _is_ lots of demand for 1). Consequently it's important for anyone downloading it to know they are actually downloading 1) and not 3). — Rounin, Feb 06 '22 at 20:19
"I'm specifically talking about a legitimate asset which has been compromised, not a bad asset which was bad from the outset." But how is knowing the 'good' filename different from knowing the 'good' hash value? — JimmyJames, Feb 07 '22 at 15:19
Also, the geolocation and timestamp don't seem to add much to this scheme and actually provide inputs that an attacker controls and therefore make is easier to create a corrupted version with the same filename. There are many locations on earth and many timestamps that could be 'reasonable'. — JimmyJames, Feb 07 '22 at 15:22
@JimmyJames - Re: _" But how is knowing the 'good' filename different from knowing the 'good' hash value"_ That's a fair question. There isn't a substantial difference in terms of utility. But the filename slug is 16 characters whereas SRIHashes can be considerably longer - which would lead to unwieldy filenames if the SRIHash were to be included in the filename. — Rounin, Feb 07 '22 at 16:05
@JimmyJames - Re: _"There are many locations on earth and many timestamps that could be 'reasonable'"_ This metadata is both verified inside the asset and known outside the asset. E.g. Stack Exchange is based in NYC, right? If a product from Stack Exchange tells you it was published in Lagos or Odessa, when every other product you've seen from Stack Exchange has told you it was published in NYC, you'd want to verify further, right? If a publisher has a known headquarters, I don't think there are many geostamps which could be reasonable. — Rounin, Feb 07 '22 at 16:15
Are we talking about an automated check? There are some practical complications such as the fact that companies move their physical headquarters or don't have one. But that aside, if the filename hasn't changed, you would need to store the original values somewhere and verify them. And if you are doing that, any known value could serve the same purpose. I'm just not seeing how this is an improvement over existing solutions. — JimmyJames, Feb 07 '22 at 16:24
@JimmyJames - Re: _"Are we talking about an automated check?"_ No. See above: _"ii) the technician referencing the remote script can be asked if they trust the asset's signature (eg, This data reports that it was published [in the middle of the Pacific] in 1981? Do you wish to continue?)"_ — Rounin, Feb 07 '22 at 16:47
Maybe I'm misunderstanding the context. The SRI MDN page you link to is for browsers. Who is the technician here? Are we talking about a different scenario? — JimmyJames, Feb 07 '22 at 16:51
@JimmyJames - Re: _"[...] companies move their physical headquarters or don't have one [...]"_ Sure. But this doesn't matter. The geostamp and the timestamp are an extended signature - they may but are not obliged to represent factual accuaracy. If a publisher wishes to consistent publish their geostamp as the Great Pyramid of Giza, that's their choice. — Rounin, Feb 07 '22 at 16:52
@JimmyJames - Re: _"you would need to store the original values somewhere and verify them"_ The original values are stored within the asset. Those values are hashed in the SRIHash along with the rest of the asset. The values are then used again to derive from the SRIHash a 256 character key which, when combined with the asset, generates a 16-character slug to go in the filename. — Rounin, Feb 07 '22 at 16:55
"If a publisher wishes to consistent publish their geostamp as the Great Pyramid of Giza" Right, so again, any known value can be used for this. Saying we are using a geolocation but it can be anywhere doesn't really make much sense. — JimmyJames, Feb 07 '22 at 16:55
@JimmyJames - Re: _"Who is the technician here?"_ The person responsble for adding a third-party script to an organisation's website who may well not be able to read the language the script is written in, but needs to know if the asset is what it says it is, rather than being compromised asset masquerading as a legitimate asset. — Rounin, Feb 07 '22 at 16:57
@JimmyJames - Re: _"Saying we are using a geolocation but it can be anywhere doesn't really make much sense."_ The intention is that the publishing datetime and the publishing geo-coordinates represent an extended part of the asset's signature. They are details that can be verified within the asset and known outside the asset. — Rounin, Feb 07 '22 at 17:31
At present, there are a number of challenges to the question, but few attempts to answer either a) _"How to verify that a legitimate (but unknown) remote asset from an unknown source has not been compromised and that its integrity remains intact?"_ or b) _"Is this level of self-verification of integrity (using a timestamp and a geostamp) enough? Or is this too easily circumvented?"_ As suggested early on, should I transfer this question to https://crypto.stackexchange.com/ so that the question might be tackled rather than challenged? Thanks for your advice / recommendations. — Rounin, Feb 08 '22 at 09:22
Re: _"[...] is this too easily circumvented?"_ It occurred to me that a URL containing the canonical filename (ie. with the correct 16-character slug) could be 302 Redirected (via .htaccess, PHP or whatever else) to a different URL with an incorrect 16-character slug and, if undetected, this would represent an almost trivial circumvention. (That said, I'm not sure if I'm thinking this through correctly...) — Rounin, Apr 28 '22 at 07:29
So, on the basis that URL redirection is elementary, we can rule out any part of the URL as a reliable piece of signature data. However, we _can_ keep the i) publishing datetime and ii) the publishing geo-coordinates as two parts of the asset's signature alongside iii) the SRI Hash, iv) the name of the publisher, v) the name of the digital asset, vi) the version of the digital asset, vii) the length and / or size of the digital asset and... viii) a hash of some kind of fragment from the digital asset. This last part of the asset signature is what I'm working on currently. — Rounin, May 24 '22 at 17:02
I see [**JSON Web Signatures**](https://en.wikipedia.org/wiki/JSON_Web_Signature) are another method (similar to but not the same as **Subresource Integrity**) by which data can self-authenticate. Also: 1) [How to digitally sign or verify third party javascript files?](https://stackoverflow.com/questions/68993226/how-to-digitally-sign-or-verify-third-party-javascript-files) and 2) [How can I make sure that my JavaScript files delivered over a CDN are not altered?](https://stackoverflow.com/questions/38700923/how-can-i-make-sure-that-my-javascript-files-delivered-over-a-cdn-are-not-altere) — Rounin, Jun 05 '22 at 18:26

candied_orange · Answer 1 · 2022-02-03T16:16:12.653

I'm trying to understand how it will be possible (I refuse to believe it isn't possible) to verify the integrity of a new unknown remote asset from a new unknown source, when you cannot say for certain if either the remote asset or the remote source has not been compromised.

Sure you can do this. It's called the Wayback Machine.

But all that will tell you is if the hash has been changed. There is no way to know if it was changed because the site hosting the hash got hacked to hide that the asset is compromised or if the legitimate asset author decided to sneak in a change without changing the version number.

And of course the Wayback Machine can get hacked, the legitimate asset author can turn to the dark side, and cosmic rays can flip your bits. Honestly, when this stuff works the way it's supposed to, consider yourself lucky.

Thanks, @candied_orange, that's smart thinking. – Rounin Feb 03 '22 at 16:01 — Rounin, Feb 03 '22 at 16:01

Ewan · Answer 2 · 2022-02-03T16:22:22.450

0

So my understanding of SRI

you write a script
you generate the hash
you upload it to third party CDN
you reference the CDN version in your webpage, hosted by you
someone loads your webpage
the browser requests the script
the browser checks that the script it got from the CDN has the same hash as the link in your webpage.

Here the browser is checking that the script the CDN serves is the same as the one your webpage wants.

The scenario you seem to be asking about is different if I understand it correctly.

you write a website and want to use someone else's script which is hosted on a CDN
you link to the script and add the hash that the CDN supplies.
...

Here the browser is doing the same thing. But you haven't checked that the script you referenced is actually the one you want.

You need to add a step at the beginning, where you download the script and check the source code does what you think it does, look for security problems and backdoors etc.

All of your ideas fall down when you only have the one source for all the data. Who is telling you what the file name and version should be, when it was published etc why is the middle of the pacific 1981 a red flag compared to anywhere/when else? why 2.4 and not 3.1?

You must have external information, that you trust, about the code that you expect, to compare against your untrusted source.

edited Feb 03 '22 at 16:22

answered Feb 03 '22 at 15:25

Ewan

70,664
5
76
161

well they have to just trust whoever told them that link was good then – Ewan Feb 03 '22 at 15:34
you could for example, meet the author in person and get them to give you the hash in the form of a paper note, which you then carry back to your desk being extra careful to avoid any "now you see me" style hijinks – Ewan Feb 03 '22 at 15:37
3

do you see that what you are asking is the same as "i have been given a box, how do i know whats inside without looking?" – Ewan Feb 03 '22 at 15:39
1

or really more like, "ive looked at the thing, but I dont know what i ordered. it is what i ordered?" – Ewan Feb 03 '22 at 15:40
sure, I guess that would be like, i linked the wrong script and i get errors when i call it. it doesnt help if i make fake coins to send you. A bank would check the coin against a real coin and see if they are the same – Ewan Feb 03 '22 at 15:46
Perhaps I can explain the situation a different way? Imagine a CDN containing one script gets hacked. Imagine that the script is changed and the SRI Hash is changed. Imagine that, having never heard of the CDN before, you now find the CDN and its script via a search engine. The SRI Hash alone is not going to tell you anything is wrong with the script, is it? – Rounin Feb 03 '22 at 15:50
no. but if you google the author of the script you will find their git hub or whatever and it will hopefully state a signature or hash. Do you trust this information? – Ewan Feb 03 '22 at 15:57
I think I understand the question. – Ewan Feb 03 '22 at 16:14

How to verify that a legitimate (but unknown) remote asset from an unknown source has not been compromised and that its integrity remains intact?

2 Answers2