4

I have 500.000 files with the combined size of 350GB. So an average file has 0.7MB size. Each file has metadata: 1 to 100 keywords and optionally a short description. I need to filter the files and find the keywords of the search expression in the keywords of the files and in the descriptions. Note that some of the files are text files, so by those, I need to find the keywords in the file as well, which means full-text search.

  1. Should I store the metadata and the files in the same database, or should I store the metadata and the text files in the same database and the binary files somewhere else?
  2. What type of database can store so many files?

Note that the databases are RAID protected, but I can have a not RAID protected filesystem cache, for the case they would make streaming slow.

I am concerned only about the performance of search and file access, not consistency, convenience, security or resource utilization. I can use even the file system if that makes things faster.

Robert Harvey
  • 198,589
  • 55
  • 464
  • 673
inf3rno
  • 1,209
  • 10
  • 26
  • 1
    Have you looked into Elastic search? If you need full-text search, I'm not familiar with any other options actually. – gardenhead Sep 24 '16 at 20:16
  • I dont think you will have a problem putting all the data and files in a single db eg ms sql. But it may not be optimal depending on how you intend to use the data. Will every search search all the files? Or will the searches end up being grouped in some way? Ie only search the text files for x and the binary files for y – Ewan Sep 24 '16 at 20:19
  • @Ewan I will use semantic search by keywords, so it is very complicated and I would rather not talk about it. This question is more about where to store the binary files. Yes, it will search the whole database. I am not sure whether an SQL database is the best solution here, what about noSQL databases? – inf3rno Sep 24 '16 at 20:23
  • Respond to Christophe in a comment below his answer, not in your question. @replies do not work in posts. – Robert Harvey Sep 24 '16 at 23:01
  • @RobertHarvey I added the edit because it makes the question less broad... I know that they don't work in posts. They should btw. :-) – inf3rno Sep 24 '16 at 23:02
  • Sure, but @replies don't work in questions. He'll never get notified of your reply, and it's just noise for people trying read your post. They *shouldn't.* Stack Exchange is not a forum environment. – Robert Harvey Sep 24 '16 at 23:03
  • @RobertHarvey Actually it is important. By removing the `@Christophe` you moved the pharagraph out of context. So people won't know that I am not concerned about consistency, convenience and security, since they are in Christophe's answer... – inf3rno Sep 24 '16 at 23:05
  • Then move the paragraph to wherever it makes sense in your question to make it coherent whole. But please don't put things like @someone, "EDIT:" or similar monikers in your posts. Your question should be able to stand on its own without these things, and without referring to someone's answer. Stack Exchange already has first-class mechanisms to handle those things, and subverting those mechanisms just hurts the quality of the site overall. – Robert Harvey Sep 24 '16 at 23:11
  • @RobertHarvey Is this documented somewhere in e.g. in SO rules and recommendations, meta, etc..? If not, then this is not valid, just your personal opinion. – inf3rno Sep 24 '16 at 23:14
  • 1
    Of course it's documented. Why would I be wasting my breath on stuff that is not official policy? It's also common sense; why make people jump all over the page to try and figure out what you're talking about? See my latest edit to your question. – Robert Harvey Sep 24 '16 at 23:14
  • Also, for what it's worth, your question has been [asked many, many times before](https://www.google.com/search?q=store+files+on+disk). The usual answer is that, probably 80 to 95 percent of the time, you're better off storing files in the file system, *especially with respect to performance.* That is, after all, what the file system is specifically designed to do. The only way to know for sure on performance questions is, of course, to *measure the performance yourself.* – Robert Harvey Sep 24 '16 at 23:19
  • @RobertHarvey Actually it is more complex in general as you can read in Christophe's answer, but yes, in the current case it is probably better to use the filesystem for the read model, since only performance is required. – inf3rno Sep 24 '16 at 23:24
  • @RobertHarvey "It's also common sense; why make people jump all over the page to try and figure out what you're talking about?" - Actually what you just described here is the same as browsing the web or adding a reference to other chapter of a text. Can you give me a link about this policy? – inf3rno Sep 24 '16 at 23:29
  • Related: http://serverfault.com/a/95454 – Robert Harvey Sep 25 '16 at 00:04

2 Answers2

6

Where to store the files ?

The question of whether or not to store the files in the database has to be considered under several angles:

  • Consistency: storing metadata and the files (as BLOB) together in the database ensures that what belongs together remain together. No fear of inconsistency if insertion is interupted, no separate storage location to manage with absolute or relative url in the database.

  • Convenience: you may move/backup/replicate/monitor your database if you need just using the database tools. With separate files, you have to organize all theser operations. It is not necessarily difficult, but you have to take care of it.

  • Security: most DBMS offer you some authorization mechanismsfor user access, and even encryption if needed. So having the file in the DMBMS ensures that nobody tampers with the files, and only those having the necessary DB priviledges can access it. With separate files outside the database, its much more difficult to organize this (unless you're on a server and the clients can't access directly to the folders).

  • Performance: this is something that you have to check carefully with the DBMS that you'll choose: the API for accessing BLOBs might require some overhead to transfer from/to the database in smaller chunks. So you need to be careful to request this object only if necessary. Here with files in the file system, it's faster to access the raw data when it is needed. However with so many files, you might have to distribute them across several folders, in order not to to suffer from the search performance of every filename in a huge directory.

  • Resources: If you would consider using some in-memory database for accelerating your "semantic" work on metadata, then it would be very costly to store as well all the plain data in memory. There, separate files could really be of advantage.

Not knowing what exactly your application is doing, it wouldn't be wise to advise you firmly on one way or the other.

Real life examples

  • At my company we use a huge ERP. The records of the financial transactions in the database refer to scanned financial documents that are stored outside the DB on a distinct content server. The content server is a kind of web server, that locally stores the image files (JPG, PDF, ...) in its local file system. The security of access is organized via a complex URL validation scheme.
  • Another system stores scanned documents for an activity not covered by the ERP. The images are stored directly in the database.

So in practice, both approaches will work. The first is based on standard software products. The second was developed in-house. From the point of view of performance, they both are very similar because images are accessed from the client (i.e. the potential overhead in BLOB management on DB side, are compensated by the overhead of an additional transfer with the additional webserver).

Relational or not ?

If you go relational, you may want to manage:

  • the file-records (e.g.identification, some unique metadata, and the BLOB).
  • the keywords (metadata + filtered list of plain text words)
  • the association of keywords to file-records (many to many).

There is no doubt that performance and the flexibility will be there, because searching for keys, merging several searches, etc.. is the core business of an RDBMS. But you'll have to work out how to best structure the metadata.

You could also opt for a NoSQL database. They are more flexible on the data structure. Intuitively I'd suggest to start to have a look at the document databases. If however you prefer to keep the files out of the database, you could be more interested in a key-value store, or even a large column store if you'd manage different kind of keywords for different kind of metadata.

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • Thanks Christophe! I edited my question to make it less broad... – inf3rno Sep 24 '16 at 23:02
  • I just read your edit. I won't have many resources here, at most about 8GB memory. I am not sure whether that is enough to store the text content in memory. Maybe partially. – inf3rno Sep 24 '16 at 23:09
  • 1
    @inf3rno use an in memory db only with machines having very large RAM (e.g. larger that hte size of all your metadata) and if the processing of metadata is very complex. With a more limited 8GB your DB will do very well, using caching and other mechanisms – Christophe Sep 24 '16 at 23:22
  • Thanks! Your answer is good for beginning, I'll investigate this question further. Maybe I'll separate the full-text search from the metadata search. – inf3rno Sep 24 '16 at 23:57
1

I am concerned only about the performance of search and file access, not consistency, convenience, security or resource utilization.

The performance of reading and writing files of that size (average 0.7Mb), a file system is probably faster. (For smaller files, e.g. 0.7Kb, a database would probably be faster. Typical file systems are bad at handling lots of tiny files.)

Searching is a different issue. The way to get fast search over a large amount of data is to construct indexes.

  • For structured data (e.g. the file metadata) you typically create database tables to hold the searchable data and then add database indexes to make the common (simple) searches go faster. And avoid searches that require the use of LIKE, regexes and so on because that requires linear scans.

  • For non-structured data (especially text), the standard solution is to use a free-text search engine. This works by constructing a reverse index for the entire corpus. This gives you orders of magnitude better performance than doing a linear scan of the files on each search. It can efficiently do searches with multiple terms or phrase search.

It sounds to me like the free-text search engine approach is what you need. A good free-text search engine will also provide a way to hold and search the file metadata.

Stephen C
  • 25,180
  • 6
  • 64
  • 87
  • Thanks, this is an interesting answer. I realized meanwhile, that what I really need is a new index. I mean I need to process the text files and make more meta-data and search only based on the structured meta-data instead of using full-text search. I guess the full-text search works a similar way, so if the meta-data would be a simple series of keywords, then a full-text search would suffice. In the current case it is not. :-) – inf3rno Sep 25 '16 at 03:17