Tool to identify (and remove) unnecessary website files?

Question

Inevitably I'll stop using an antiquated css, script, or image file. Especially when a separate designer is tinkering with things and testing out a few versions of images. Before I build one myself, are there any tools out there that will drill through a website and list unlinked files? Specifically, I'm interested in ASP.NET MVC sites, so detecting calls to (and among many other things) @Url.Content(...) is important.

There seems to be a similar question on SO (http://stackoverflow.com/q/5665979/866172) that doesn't have any answer for more than a year now, suggesting there is no such tool yet. The only attempt at an answer explains why such a tool does not exist yet. — Jalayn, Oct 19 '12 at 06:17

Arseni Mourzenko · Answer 1 · 2012-10-19T10:56:09.147

Aside strictly static website, the task would be rather random:

You can't scan the source code in order to find the links, since links can be generated. Imagine the following case:

On a page, when a user effectuates an action, an image is added to the DOM (so you actually don't have any <img/> element in HTML originally). The link to a image is assigned by JavaScript. In order to find a part of this link, JavaScript does an AJAX request; the other part is hardcoded in JavaScript code. The final URI is http://example.com/photos/nature/polar-bear.jpg?width=800

The server receives the request for the image and rewrites the URL to http://example.com/generate-photo.aspx?category=nature&name=polar-bear.jpg&width=800. It appears that the new URI points to a dynamic resource which generates the image by taking an existent one (/photos/catalog/133d6566-3c98-4690-be4a-caad41c0e21d.jpg) and adding a copyright.

Could you possibly track this situation automatically?
You can't rely on logs, since the fact that the resource was not requested for a while doesn't mean that it will never be requested.

The only viable alternative is to:

List every resource on the website,
Collect the statistics from the logs in order to filter the resources which were used for the past N months. Don't forget about a huge amount of small issues which can arise: remember that there is URL rewriting, that you need to canonize the requests, that there are default pages (http://example.com/index.html will mostly be called http://example.com/), etc.
Based on those statistics, forget about the resources which are in use: you don't need to remove them.
For the remaining resources, try to guess for each one the context in which it could be used, and check if it is. This last step is extremely complex for a program and requires human brain (or years and years of R&D).

As a side note, do you know that instead of Url.Content, ASP.NET MVC 4 allows to use ~ directly, like this:

<a href="~/Products/Edit/458">Edit</a>

These are valid points. But I'm thinking of a flexible solution where it allows one to extend the engine with custom regexs or plugin providers. So if you have a non-standard way to point to a resource then you can handle your specific case. In any case, I'll probably just create a tool for my specific needs. — xanadont, Oct 22 '12 at 14:07

Tool to identify (and remove) unnecessary website files?

1 Answers1