Improving the performance of a db import process

Question

I have a program in Microsoft Access that processes text and also inserts data in MySQL database. This operation takes 30 mins or less to finished.

I translated it into VB.NET and it takes 2 hours to finish.

The program goes like this:

A text file contains individual swipe from a corresponding person, it contains their id, time and date of swipe in the machine, and an indicator if it is a time-in or a time-out.
I process this text, segregate the information and insert the time-in and time-out per row.
I also check if there are double occurrences in the database.
After checking, I simply merge the time-in and time-out of the corresponding person into one row only.

This process takes 2 hours to finished in VB.NET considering I have a table to compare which contains 600,000+ rows.

~~Now, I read in the internet that python is best in text processing, i already have a test but i doubt in database operation.~~

What do you think is the best programming language for this kind of problem?

How can I speed up the process? My first idea was using python instead of VB.NET, but since people here telling me here on PSE that this most probably won't help I am searching for different solutions.

Here is a link to the test i've done.

Your original question was likely to be closed soon due to the FAQ of this site (language comparisons are off-topic!), so I changed it in a way to ask for you what you really wanted to know. — Doc Brown, Nov 05 '13 at 07:08
Are your text file contents *additions* to the database, or an accumulating list? This very much determines the number of select/update/insert queries that your app performs. Adding numbers to your question may help. — Jan Doggen, Nov 05 '13 at 08:24
"I also check if there are double occurrences in the database." Do you mean you are checking for duplicates of the row you just inserted? How are you doing this checking? This could potentially be a massive bottleneck. — Evicatos, Nov 05 '13 at 23:55

score 3 · Answer 1 · answered Nov 05 '13 at 07:19

3

I find it very unlikely that changing the programming language will bring you any further. So apply the topmost rule of performance optimization:

Find the bottleneck by measuring the running time for the individual steps!.

It may be either the text processing, the DB operations or both. You did not show us any code, but in situations like this the DB operations would be the first suspect to me.

For example, you may consider to comment the DB operations out, leave only the text processing active and then measure and compare the running times.

answered Nov 05 '13 at 07:19

Doc Brown

199,015
33
367
565

Primary suspect: insert queries. If you have many and your database allows it, do bulk inserts from a text file. Reading a text file, then writing a text file, and then inserting that may seem 'extra work' but with many RDBMSs it is much faster than individual inserts when you have many. – Jan Doggen Nov 05 '13 at 08:26
before i am inserting it into the database, i am first checking whether the record has a duplicate...i already remove this method, could you give me an idea what is the best method for checking duplicates.? – mmr Nov 05 '13 at 09:25
@mmr: come on, if you want help for this question, tell us **what you have measured for the individual parts of your code**, show us the details of the **related code of the bottleneck** (after you spotted where it is exactly). Currently it looks like you are still refusing to provide any helpful information and let us all stray in the dark. – Doc Brown Nov 05 '13 at 21:13
@DocBrown please see above link – mmr Nov 06 '13 at 04:35
1

@mmr: The code has more than 1200 lines, seriously, you don't expect me to review that for you? The hard part is yours: identify the parts which needs the most running time. Either add yourself some time measuring code around the functions in stake, or use a profiler like "Slim Tune" (http://code.google.com/p/slimtune/). Then, when you spotted the problematic part, and you don't know why it is slow, come back and present the ( <10) lines of code on www.stackoverflow.com. Then you will surely get help. – Doc Brown Nov 06 '13 at 07:40
@DocBrown, no, i used dottrace to profile my application, i attached the results as well as the code for more info, anyway thanks for pointing slimtune.. – mmr Nov 06 '13 at 08:46
@mmr: I am not going to interpret the dottrace logs for you, either. – Doc Brown Nov 06 '13 at 14:12

score 0 · Answer 2 · answered Nov 05 '13 at 13:54

I agree with Doc Brown that you should first try to find the bottleneck in your application, otherwise we are all only hypothesing and there's no real use to it. Anything could be suspect.

The first thing you could is to measure the duration of:

the entire process, consisting in reading a line in the file, checking for duplicates, inserting a row
the entire process again, but when you found a duplicate and as such, when you are not inserting a line

Also, please take measures of individual operations. How long does it take to read from the file on average ? How long does it take to check for duplicates ? How long does it take to insert a record ? Maybe the duration keeps growing along the size of the table ?

The only way to answer these questions is get the numbers.

Finally, I think there are several possible optimizations depending on the location of the bottleneck. For example, you could:

read more than 1 line at once, say 10 for example, and dispatch the processing of these lines to 10 separate threads.
more generally, separate the process of reading the file from writing to the database by using the publish/subscribe pattern.
optimize indexes in your database table.
instead of checking every time if there is no duplicate, maybe store the entire table in a map in memory before you start processing the file. This could take a few seconds (and some memory size) but you avoid one IO round trip per line.

To conclude, I think you are right about not changing your language, the langue is not the problem.

Improving the performance of a db import process

2 Answers2