0

I have a lot of data. Objects with other objects nested like this (JSON):

{
  id: 12,
  Code: "a code",
  country: "at",
  prices: 
  [{
     { price: 100, from "2017-05-05", to "2017-06-05" },  
     { price: 150, from "2017-05-06", to "2017-06-06" },  
     { price: 200, from "2017-05-07", to "2017-06-07" },  
  }]
}

etc. I have about 60k of these main/parent objects. The prices are about 3 million objects.

Current situation

I have made a relational Azure SQL DB with tables for the objects and a price table with foreign key. Same for country table etc. I have no issues with relational databases. I have had an S3 instance of an Azure SQL DB and also a LocalDB. Both are not fast enough. To insert or update, I use stored procedure. One per table. So to store just 1 object, I execute an stored proc to find out of it needs an insert or update and store the main/parent object. Return the (new) ID and set it as foreign key for the price object. Also store about 100 price objects and move on to the next. This takes for over an hour to store 100 main/parent objects with their related prices.

The solutions I consider

  1. upscale to Azure SQL Premium and scale back to S0 (standard) when I am done. docs
  2. move to in-memory (which does not have foreign keys...)
  3. move to NoSQL (cosmos db)

I tried solution 3 and came up with this:

I have made a proof of concept storing the objects as json in cosmos db (it was still document db back then) and it was fast. However, querying it was hard because you cannot do a subquery or avg or sum. https://stackoverflow.com/questions/43521000/documentdb-get-min-of-subarray

or should I store the price rows as new documents and have a 'foreign key' to the main/parent object?

Should I dig in more in option 3, or would you recommend 1 or 2?

Edit for Ewan My stored proc's: for main/parent object:

ALTER PROCEDURE [dbo].[InsertUpdateParent]
(
    @code nvarchar(max),
    @name nvarchar(max),
    @lat float,
    @lng float,
    @country int,
    @countryCode nvarchar(2)
)
AS
BEGIN
DECLARE @id int;
DECLARE @err int;
DECLARE @newObject bit;
    SET NOCOUNT ON

    IF (@country = 0)
    begin
         select @country = id from countries where code = @countrycode;
         SELECT @err = @@error;
    end
    IF EXISTS (SELECT ID FROM ParentObject where code = @code)
    begin
        SELECT @id = ID FROM ParentObject where code = @code;
        SELECT @err = @@error;
        SELECT @newObject = 0;
        UPDATE ParentObject SET 
            [Name] = @name,
            Latitude = @lat,
            Longitude = @lng,
            CountryId = @country
        WHERE code = @code;
        SELECT @err = @@error;
    end
    else
    begin
        SELECT @newObject = 1;
        INSERT INTO [dbo].[ParentObject]
           ([Code]
           ,[Name]
           ,[Latitude]
           ,[Longitude]
           ,[CountryId]
     VALUES
           (@code, @name, @lat, @lng, @country);
            select @id = SCOPE_IDENTITY();
            SELECT @err = @@error;
    end
    IF @err <> 0
        PRINT '@err is ' + ltrim(str(@err)) + '.' 
    else
    return @id;
END

and for prices:

ALTER PROCEDURE [dbo].[InsertUpdatePrices]
(
    @from as datetime,
    @to as datetime,
    @parentId as int,
    @price as decimal(18,2)
)
AS
BEGIN
    declare @id int;
    SET NOCOUNT ON

    IF EXISTS (SELECT ID FROM prices where ArrivalDateFrom = @from and ArrivalDateTo= @to and ParentObjectId = @parentId )
    begin

        SELECT @id = ID FROM prices where ArrivalDateFrom = @from and ArrivalDateTo= @to and ParentObjectId = @parentId;

            UPDATE [dbo].[Prices]
            SET 
                [ArrivalDateFrom] = @from,
                [ArrivalDateTo] = @to,
                [Price] = @price,
                [ParentObjectId] = @parentId
            WHERE ID = @id

    end
    else
    begin

        INSERT INTO [dbo].[Prices]
               ([ArrivalDateFrom]
               ,[ArrivalDateTo]
               ,[Price]
               ,[ParentObjectId])
         VALUES
               (@from
               ,@to
               ,@price
               ,@parentId);
        select @id = SCOPE_IDENTITY();
    end
    return @id;
END
JP Hellemons
  • 123
  • 10
  • 1
    why is it so slow? can you post sql? – Ewan Jul 18 '17 at 12:07
  • I have added the stored proc code. – JP Hellemons Jul 18 '17 at 12:29
  • I'd focus more on why the SQL is slow rather than muck around with anything else. Since your code works (although slow) you might want to try moving this question to [CodeReview](https://codereview.stackexchange.com/) or [Database Administrators](https://dba.stackexchange.com/). Be sure to read their help and on-topic sections first. – Dan Pichelman Jul 18 '17 at 13:08
  • Sounds like a sql problem, but JSON is not an efficient serialization format. – Frank Hileman Jul 18 '17 at 18:41

2 Answers2

1

Ok. I think the slowness is mainly a factor of your sql. Which can be streamlined.

First, examine whether you really need the database generated id rather than a guid. or in the case of your parent object the 'code' parameter.

Second. Instead of Insert or Update try Delete and Insert. ie Delete where id=@id; Insert into...

This removes your selects to check for existence and also allows a much more streamlined sub object update. ie

Delete from prices where parentId=@pid

Insert into prices (....)
Insert into prices (....)
Insert into prices (....)

Removing the Selects from your process will prevent table locks and make things more easily run in parallel.

I think you should see a massive increase in speed with these changes.

I would lose the sprocs and do it all in code. The sprocs wont add much speed and prevent you optimising based on what you know already. ie I've deleted all the sub objects so i know its ok to insert without checking etc. Also you can pick when you open and close your transactions and possibly batch up multiple inserts.

Moving to a NoSql db have the same effect, in that you'll only be checking the primary key of the parent object before overwriting it.

However, your current relational structure enables you to select parent objects based on the attributes of thier children. ie get me products that cost between x and y today.

Ewan
  • 70,664
  • 5
  • 76
  • 161
  • Will test this and I will mark this as solved if it speeds up a lot. – JP Hellemons Jul 18 '17 at 13:15
  • also your program could cache the countryIds knocking out another select – Ewan Jul 18 '17 at 13:17
  • Do you mean this for batch up multiple inserts: https://stackoverflow.com/a/452934/169714 per 1000 objects of prices – JP Hellemons Jul 18 '17 at 13:30
  • I didnt, but yes that could be a further optimisation. – Ewan Jul 18 '17 at 13:31
  • I have a foreach with 10.000 iterations wrapped around `var task = Task.Run(() => { // the sql }` and the Transactions gave me deadlocks. So I am considering adding nolock to the delete and inserts. because the outer loop makes it parent object specific. all sql is also parent object specific. so no collisions there. or other recommendations? – JP Hellemons Jul 24 '17 at 07:58
  • dont try and parallelize it. take out the task.run – Ewan Jul 24 '17 at 08:05
  • you shouldnt get deadlock though. sounds like you still have some selects in there somewhere – Ewan Jul 24 '17 at 08:06
  • But running all these sql as async seems to speed it up a lot. Having it wrapped in task.run gave me exceptions on the transactions. Tried now with `IsolationLevel.ReadUncommitted` Maybe I should migrate my issue to stackoverflow and full focus on the sql perf combined with my c# code – JP Hellemons Jul 24 '17 at 08:10
  • they all run on the same database. you are over complicating the problem. 10k inserts should run in sub second time – Ewan Jul 24 '17 at 08:12
  • Thank you @Ewan for your time and effort. You helped me the most. Will mark it as solved. I continue here with my disaster code :) https://stackoverflow.com/questions/45275729/storing-a-lot-of-data-fast-in-sql-server-using-async-caused-deadlock-how-to-sp – JP Hellemons Jul 24 '17 at 08:31
1

It really depends on your data and how much of it you have. A very common approach these days is to have a hybrid approach. The SQL database is the store of record, and then something like ElasticSearch or Solr is populated from that data. Both of those use a Lucene index under the covers, and it basically means you have very powerful search capabilities.

With NoSQL you lose ACID, and have to live with BASE (ref). This is the price you pay for the ability to scale out. If you need ACID transactions, then optimize your SQL.

Having done the transition from SQL to NoSQL (ElasticSearch), I can say from experience that while you gain an order of magnitude of speed, you lose the safety of SQL queries. Because SQL is a standard that's been around for a long time, you can be relatively sure that your SQL you wrote on an old version of a server will continue to work on a newer version. You don't have that confidence with NoSQL databases (ElasticSearch had some big changes from version 1.x to 2.x that caused me a lot of rework).

One thing that can help with SQL inserts is to use batch processing. Basically what this does is turn off row by row indexing, and do it all at once when you commit the transaction. That saves a lot of time. Unfortunately the syntax for batch processing is a bit different for each SQL database.

Berin Loritsch
  • 45,784
  • 7
  • 87
  • 160
  • I have read https://docs.microsoft.com/en-us/azure/sql-database/sql-database-use-batching-to-improve-performance#batching-strategies and I should wrap my foreach in a transaction to make the log writing/flushing faster and gain more speed. @Ewan gave me some good hints, but thanks for explaining the nosql option(s) to me! – JP Hellemons Jul 18 '17 at 13:57