7

I always see sites that only keeps fresh content on the home or subsections, and the rest of the content is kept in a separate section called 'archive'.

Recently I have also heard that NoSQL DB's like MongoDB are good for archiving (which makes me think this is related to performance)

So why do sites archive their content? What's the benefit over say, a simple paginator through which you could reach all the content?

Is archiving done for performance? Or SEO? Or just user experience?

HappyDeveloper
  • 902
  • 1
  • 8
  • 11

2 Answers2

6

True story

The simplest reason to separate newer content from older content is that your database is getting to big. A few years ago I was involved with building a huge financial application that was based on an pre-existing database. It kept a variety of fiscal data, and it already had several years worth of data in it, when I first became involved with the project. Having a single database store years of data only makes sense when there you actually need all this data, at any given time. The rest of the team said that was the case, so I didn't think much of it.

A few weeks into the team I realised that the reality was a little bit different. Our users only needed to access the current fiscal year's data, except:

  • When producing yearly reports, something that happened once a year
  • When producing three year reports, something that happened once every three years

We decided to split the database into different databases, per year. It wasn't an easy decision but it worked: The current fiscal year database had lightning fast responses, as it was only scanning through a very small subset of the whole data. The yearly reports where also generated on the fly, for the same reason. The three year reports were a little bit slower to generate than before and it took a lot of creativity to combine three archives, but that was a very small disadvantage of the process.

So our database was split into small archive databases, and everyone was happy. (Not really, lot of rough edges and of course this is an oversimplified version of the story, but overall the decision to archive was a good one).

Heterogeneous data

Another reason to archive is when you have lack of data homogeneity over time. When the schema of the database changes, especially for huge databases, your best bet is a document-oriented database like MongoDB. Document-oriented databases have flexible schemas, which means that they don't care if you don't use the same fields to describe a record, hence you don't have empty fields as you would with a relational database.

And as Jeff O correctly notes in a comment to another answer, archived data by definition won't change, so you don't need to care about transactions and other relational functionality. (Added here, in case comment or answer goes AWOL)

Archiving on news oriented websites

News oriented websites with lots of data may opt in to archive their older content into a document-oriented database, because since they deal in news, their fresher content is a lot more valuable to them from a business point of view.

SEO & pagination

Lastly, it has nothing to do with SEO and / or pagination (Pagination will work regardless of where your content is stored). A bot will scan through your paginated content, following the pagination links like any other link. If you adopt a sensible URI schema for all your content, you have nothing to worry about. For example, imagine you have a blog with ten years of articles and you've decided to move all articles that are older than 2010-12-31 to a document storage archive.

Your homepage would probably have a list of newest articles, something like

http://example.com/articles/2011-11-1/title
http://example.com/articles/2011-10-30/title
http://example.com/articles/2011-10-20/title

Going through the pages you finally stumble upon a page:

http://example.com/articles/2011-1-1/title
http://example.com/articles/2010-12-30/title
http://example.com/articles/2010-12-25/title

Same URI schema, regardless of whether the article is stored on your current database or your archive database. All you have to do is a simple server side check when your visitor (human or bot) clicks on the 2010-12-30 article:

if(date <= 2010-12-31) {
    // get article information from archive
} else {
    // get article information from current database
}

Now why some sites may choose to move archived content into a special archive section, is something that's answerable only by those who built them. There may be a few user experience factors involved, but that's off topic for Programmers, you can try inquiring the folks over at User Experience Stack Exchange.

yannis
  • 39,547
  • 40
  • 183
  • 216
-2

"NoSQL" is actually a pretty broad category, so the designation doesn't mean much. And no, those DBs aren't any better at archiving old content than relational DBs. You should just keep the old content in the same data store that you're using for new content - no need to complicate things.

Mike Baranczak
  • 2,614
  • 16
  • 16
  • -1 `And no, those DBs aren't any better at archiving old content than relational DBs.` That's just wrong. – yannis Nov 15 '11 at 17:50
  • Why is it wrong? – yfeldblum Nov 15 '11 at 17:54
  • 2
    @YannisRizos - If you're gonna tell me I'm wrong, you need to explain why I'm wrong. – Mike Baranczak Nov 15 '11 at 18:16
  • 2
    Sorry it took me so long, I was writing my answer. Document oriented databases are better at archiving content because they don't care for lack of data homogeneity over time, a prime reason for archiving. When archiving your best bet is something that won't have empty fields, and flexible schemas do that for you. – yannis Nov 15 '11 at 18:44
  • 3
    Archiving also doesn't require managing transactions (the data aren't going to change by definition.). – JeffO Nov 15 '11 at 18:53