3

I need to import 100,000 products from one database to another using the results from a RESTful API. But I can only do this overnight. So if I can't get the job done in one day, I need to spread it out over multiple nights.

I have an endpoint that will return a list of ids for a given set of criteria. I have an endpoint that will return the full product information for one or more products (I can specify n number of ids to return full objects of).

I don't want to get 100,000 ids and just start iterating with a for-loop. That seems like it could break unexpectedly and then I wouldn't know where I ended. I need some way to break it a part and save my progress. Does anyone have any experience with this sort of thing?

The software I'm using: Flask, SQLAlchemy, and Huey (for scheduling).

Colton Allen
  • 131
  • 2
  • 1
    Make a nested `for` loop, and take 100 at a time. – Robert Harvey May 20 '16 at 00:10
  • @RobertHarvey That was my thought as well. If I can't complete the process in time I was just going to write the position to a text file. – Colton Allen May 20 '16 at 00:43
  • Judicious use of logging will also help you to know what completed and where in the cycle you were. – Adam Zuckerman May 20 '16 at 02:56
  • Why don't you paginate the calls to the API? Just persist the last page processed for next overnight run. The important is to reproduce at every call the very same criterias. Problem is between processes there's any change on data. Then The results of every call may or may not continue as expected – Laiv May 20 '16 at 12:53

1 Answers1

1

If you import the entries one by one, then you should be able to know when the import was terminated simply by looking at the destination database (I imagine that there is some unique identifier in the destination database which makes it possible for you to link every product there to a product in the source database, right?)

If you do a bulk insert, you may split 100,000 products into pages. How many pages will depend on the impact on performances and how many products you are ready to lose if the page import is stopped. A page containing few products will probably lead to slow imports. A page containing too many products will lower your visibility on the import progress.

Make sure you don't use an external data source, such as a text file, to record the progress. A safer approach is think about the problem more like it was an incremental operation. In a given moment, there are 100,000 products in the source database, and 23,600 products in the destination database: this means that you should grab the mapping of every 23,600 products within the source database, and import only the products which weren't imported yet.

Arseni Mourzenko
  • 134,780
  • 31
  • 343
  • 513