9

I'm trying to develop a small reporting tool (with sqlite backend). I can best describe this tool as a "transaction" ledger. What I'm trying to do is keep track of "transactions" from weekly data extract:

  • "new" (or add) - resource is new to my app since my app may not have tracked this resource before as it hasn't been seen via extracts.
  • "update" (or hit) - there is a recent use of that resource, update retention period by another week.
  • "delete" (or drop) - this item saw no use since last report (optional, but would be nice to have for graphing week-to-week changes in demand for resources).

All I've got is a weekly data extract (pipe-delimited flat file) coming from a legacy archiving/record-management system that I have no control over.

Each line can be distilled to basically this:
resource_id | resource info | customer_id | customer_info

Sample data:

10| Title X       | 1 | Bob
11| Another title | 1 | Bob
10| Title X       | 2 | Alice

The goal is make it easy to report on resources that haven't seen use for X-months (based on last hit). There is a retention period where resources are kept around for ease of access if they're popular. A resource that hasn't seen use for 18 months is marked for long-term archival elsewhere.

This must be a common problem. Wondering if there a general-purpose algorithm to determine what's new/same/removed between data sets (db vs. latest extract)?

Swartz
  • 141
  • 3

7 Answers7

1

Well your answer is... Yes. There is a simple algorithm you can implement that doesn't require any of that other stuff. It's a net present value algorithm. It's easy to implement and all it requires on the DB end is that you date stamp the weekly data and write one simple query and one small recursive function or for loop, or you could do one of those other solutions.

NPV = PV-(PV(CP/T) or the New Present Value equals the Present Value times the Current Period (months since last entry) divided by the Term (e.g. 18 months) when the resource value falls to 0 it's net present value is expended.

If you give me a lang you want it in I'll post the code here in an edit

J-Boss
  • 175
  • 6
  • Language isn't that important. Ruby or C++ if I had to choose. If you can write an algorithm in HTML 4.0 Strict you will be my hero. Kidding about that last part :) – Swartz Sep 05 '15 at 07:29
  • Would be interested to see the code. Ruby or C++. Thank you. – Swartz Sep 08 '15 at 08:00
0

If you’re keeping the updates in a SQLite backend anyway, you could turn the weekly update into a new table and compare it to the archived data with queries, before merging it.

Example of using SQL to find new additions to a table: https://stackoverflow.com/questions/2077807/sql-query-to-return-differences-between-two-tables

If a field in your DB stores the date of the transaction, you could just query all users who have had transactions in the past 18 months. Then the archive is just the full DB. Alternatively, you could query all users who haven’t, extract their data, then drop them. Updates are just any rows timestamped this week.

Davislor
  • 1,513
  • 10
  • 13
  • Better, it's a data-centric solution at least, but it's still overkill – J-Boss Sep 05 '15 at 05:30
  • I'm using an sqlite for time being as it's easy to start with. Could easily switch to MySQL (or PostgreSQL). If using a no-SQL backend would net anything to make this work even better, I'm all ears. – Swartz Sep 05 '15 at 07:31
  • Well, my thinking was mainly that you’re converting it to rows in a database *anyway*. If you don’t need to run it from multiple processes concurrently, I don’t think you want to switch to something more heavyweight than SQLite. – Davislor Sep 05 '15 at 07:35
  • No need for concurrent processing. But I need to store the data about resources somewhere. An SQL db seemed like a good choice, However, there is nothing preventing me from loading data into any datatype for processing deltas. All I want at the end of the each extract run is to figure out what's new, what remained same, and what has disappeared. I can figure out how to update records as necessary from this info. – Swartz Sep 05 '15 at 07:43
  • After you’ve parsed the data and put it into the database, it’s probably simpler to write a query than implement an algorithm. That said, if you do want to code it, the algorithm you want is set difference and there’s an implementation in the C++ STL that you can use to do it in a single line once you’ve put both sets of data into the container of your choice, probably a `Vector`. – Davislor Sep 05 '15 at 07:53
0

Alternative idea:

  1. Parse your list of transactions into some kind of data structure, such as an an array. (In C++, think Vector, and in Java, ArrayList.)

  2. Perform a query on your SQL backend such as SELECT DISTINCT customer_id FROM Transactions ORDER BY customer_id and pack the sorted distinct customer IDs into a set, old. If you do the exact same thing with a WHERE clause separating the old and new transactions, you can skip step 3.

  3. Get the unique customer IDs from the new updates into a separate data structure, in sorted order. There are a couple of data structures you could use to get is into a data structure, new. Insertion sort into a double-linked list is very simple, but using an intermediate hashtable would run in close to linear time, or if you’re sorting the original array anyway, getting a set out of that is easy.

  4. Take the set difference new - old using your favorite language’s standard library. Your favorite language does have this algorithm in its standard library?

The other things you want to do are definitely SQL queries after you’ve updated your transaction database.

Note on step 3: Consider the nature of your data. Suppose that your text file lists orders chronologically, and in a typical week, there are a lot of first-time customers who are given a new customer_id in ascending order. Suppose that most other orders are from a small number of loyal repeat customers, with lower customer_id. Then your inputs are already mostly-sorted. An insertion sort where you try to insert low customer_id at the front of a double-linked list and high customer_id at the back would, in that situation, perform well in practice.

Davislor
  • 1,513
  • 10
  • 13
  • 1
    I'm more interested in the new/same/updated **resources** rather than customers. But yes, the idea would be the same. – Swartz Sep 05 '15 at 19:47
0

As I understand from your question you actually have resource_id (+info) and "list" of customer(id + info).

So you can easily keep List of customer per resource and check last node in each list on the resource (in order to know last operation time; you just need to add date field to your customer in the code)

I'm not familiar with SQL, therefore I give my example with HashMap and List but I'm sure it's same idea: HashMap <Resource, List<Customer>>, when Resource should contain resourceID as key and Customer should contain customer ID, info and operation date.

With this idea you can know easily last operation time and can modify any resource (add\remove resource\customer).

AsfK
  • 101
  • 1
0

If you are using a SqLite database, if you add the date of the batch also as a column of the table,

10| Title X       | 1 | Bob    | 2015-03-01
11| Another title | 1 | Bob    | 2015-03-01
...............................
10| Title X       | 1 | Alice  | 2015-03-05

it would pretty easy to use a SQL to get the resources not used in last X number of days

Select distinct r.ResourceID from Resources r
where not exists (SELECT julianday('now') - julianday(r.DateUpdated)) < X

I have not tested the SQL but it should give you an idea

Low Flying Pelican
  • 1,602
  • 9
  • 13
0

From the original post, it sounds like the data being ingested does not have a field to indicate the date/time of the transaction, and I presume the file is ingested on a frequent basis on a schedule such as daily, hourly, etc.

I would handle this by adding a SQL timestamp column which is either autogenerated at the database level, or by the code that extracts the data and inserts into the DB. Then you put an index on that timestamp column and be done with it. Let the DB engine do the job of making it efficient to answer the question "how many transactions haven't happened since this time", or "how many between this time and that time".

Then you schedule a job to query and calculate the differentials that you want to report on. Transactions that are "new" are transactions that don't have any records in the DB prior to the date you are asking "new since". Old records are those that have no transactions since a cut-off date.

Thomas Carlisle
  • 1,293
  • 9
  • 11
-2

Isn't this what HashTables are for ? If all you want to do is keep records of which resources have been used in the past months and delete resources that haven't been accessed in the last 18 months then you can use a HashTable where the Key is the resource_id and the value is the last access date.

For archiving the >18 months records you can go through all of the records in the hash table and just remove (or move) those specific records. (you can do this weekly when the report comes in)

Adrian Buzea
  • 273
  • 2
  • 9
  • Why the need for HashTable if I'm storing stuff in the database? I can do updates to db records. I'm more interested in a case: take two data sets, find out the differences (what's added, remains the same, deleted) between the two sets. How would a HashTable technique assist in finding new and "removed" records? – Swartz Aug 26 '15 at 16:57
  • If the tables are indexed in the database then they're basically also HashTables behind the scenes. If you have 2 tables, each representing a data set then you can get your new and removed records by doing some outer joins. See this for reference: http://i.stack.imgur.com/pxUO3.png. Make sure you have indexes on the resource_id column and it should be pretty quick. If you had to implement this from scratch then I think HashTables would still be the way to go as you can do lookup/insertion/deletion in O(1) amortized time. Can't think of a more efficient way to do this. – Adrian Buzea Aug 26 '15 at 20:17
  • 3
    There are better data structures that handle aging without the extra steps of cramming this into a hash table. –  Aug 26 '15 at 20:52
  • Care to mention some ? – Adrian Buzea Aug 27 '15 at 07:28
  • @Snowman - I wish I could up rate that a few more times, I'll just have emphatically agree in this comment – J-Boss Sep 05 '15 at 05:05
  • Would love to hear about these other data structures. Pls provide a name or link or something so I can look into these further. ty. – Swartz Sep 05 '15 at 07:35
  • Apprently everyone knows they exist but no one knows what they are. – Adrian Buzea Sep 05 '15 at 10:11
  • There are priority queues that can handle aging: the concept of a priority queue is rather abstract, and it may be backed by one or several data structures. It just provides the semantics of how they work. One concrete example is Java's WeakHashMap which will reclaim old entries as needed. –  Sep 07 '15 at 04:21
  • So it's basically still a hash-structure at it's core, but optimized for specific scenarios. You said there are better data structures that are NOT hash tables. Am I missing something here? – Adrian Buzea Sep 07 '15 at 14:49