3

The moment of working on a system that gives you statistics based on some data gathered from the database has arrived in my company.

How do you efficiently gather statistics from a database in such a way that does not add too much overhead in loading a page, or too much complexity in cache management?

Currently, statistics are calculated at run-time, no data is saved or cached, the problem is that as you add new statistics, and those as well get calculated at run-time, I'll reach a point where the website is going to be reaaally slow, which is not acceptable.

The only idea that came to my mind to solve this issue is caching data that has date filters past the day of when they are calculated.

For example, let's say that I'd like to know if a user has visited a specific page between 2017-01-01 and 2017-01-08. Since today it's 2017-01-12, it's implied that this result could never change in the future, since the dates selected are old.

This is an example of how I calculate statistics in Laravel (4.x):

namespace App\Composers\Users;

use Illuminate\Support\Collection;
use User;

class ShowComposer
{
    public function compose($view)
    {
        $viewData = $view->getData();

        $view->with([
            'sellings'    => $this->getSellingStatistics($viewData['user'])
        ]);
    }

    public function getSellingStatistics(User $user)
    {
        $sellings = [];

        $getSellingsOf = function (User $user, $months) {
            $startOfMonth = \Carbon::now()->subMonths($months)->startOfMonth();
            $endOfMonth   = \Carbon::now()->subMonths($months)->endOfMonth();

             return $user
                ->mavs()
                ->whereHas('buyerProposal', function ($proposal) use ($startOfMonth, $endOfMonth) {
                    $proposal->whereBetween('sold_at', [
                        $startOfMonth, $endOfMonth
                    ]);
                })
                ->count();
        };

        $sellings['best'] = value(function () use ($getSellingsOf) {
            $months = [];

            for ($month = 0; $month < 12; $month++) {
                $startOfMonth = \Carbon::now()->subMonths($month)->startOfMonth();
                $endOfMonth   = \Carbon::now()->subMonths($month)->endOfMonth();

                $query = <<<SQL
            SELECT
                id, (SELECT COUNT(*)
                    FROM `mav`
                    INNER JOIN `mav_proposals` ON `mav`.`mav_proposal_id` = `mav_proposals`.`id`
                    WHERE sold_at BETWEEN ? AND ?
                    AND mav.user_id = users.id) AS sellings
            FROM users
            ORDER BY sellings DESC
            LIMIT 1
SQL;

                $response = \DB::select($query, [
                    $startOfMonth->toDateTimeString(),
                    $endOfMonth->toDateTimeString()
                ]);

                $user = User::find($response[0]->id);

                $months[] = $getSellingsOf($user, $month);
            }

            $months = array_reverse($months);

            return $months;
        });

        $sellings['personal'] = value(function () use ($user, $getSellingsOf) {
            $months = [];

            for ($month = 0; $month < 12; $month++) {
                $months[] = $getSellingsOf($user, $month);
            }

            $months = array_reverse($months);

            return $months;
        });

        $sellings['global'] = value(function () use ($user) {
            $months = [];

            for ($month = 0; $month < 12; $month++) {
                $startOfMonth = \Carbon::now()->subMonths($month)->startOfMonth();
                $endOfMonth   = \Carbon::now()->subMonths($month)->endOfMonth();

                $companySoldMavs = \App\Models\MAV::whereHas('buyerProposal',
                    function ($proposal) use ($startOfMonth, $endOfMonth) {
                        $proposal->whereBetween('sold_at', [
                            $startOfMonth, $endOfMonth
                        ]);
                    })->count();

                $usersWithSoldMavs = \User::whereHas('mavs', function ($mav) use ($startOfMonth, $endOfMonth) {
                    $mav->whereHas('buyerProposal', function ($proposal) use ($startOfMonth, $endOfMonth) {
                        $proposal->whereBetween('sold_at', [
                            $startOfMonth, $endOfMonth
                        ]);
                    });
                })->count();

                $months[] = ($usersWithSoldMavs > 0)
                    ? round($companySoldMavs / $usersWithSoldMavs)
                    : 0;
            }

            $months = array_reverse($months);

            return $months;
        });

        return $sellings;
    }
}

Now, here are the only two options I have thought of:

  • Calculate statistics every 24 hours and save them in a database.
  • Cache data based on the parameters used to gather the statistics.

The first option is quite complicated and it takes a lot of time to be developed proplerly.

The second option could be interesting, however I am afraid that cache is going to give me headaches sooner or later.

Is there an other way to do it efficiently? How do enterprises move themselves towards data mining? Are languages like R always used in these cases, or PHP can be just fine if used properly?

It's a new world for me, please be kind.

GiamPy
  • 235
  • 2
  • 13
  • Instead of generating and calculating the statics on the production system, you could do this work on an slave (backup) system, that will be updated once a day. That way the production server wouldn't suffer from possible performance issues. – KhorneHoly Jan 12 '17 at 15:24
  • What do you mean by "your statistics could work on this server"? The issue it's not the calculation itself of whatever statistic I'd like to mine, it's more of a logistical issue. How do you perform these calculations? Where is this data stored? How do you deal with caching and statistics in specific date ranges? Do you mine ALL the database prior researching? I am confused. Is MySQL even suitable for this purpose? – GiamPy Jan 12 '17 at 14:56
  • Question migrated from SO to SE. – GiamPy Jan 16 '17 at 11:35
  • It is hard to give a decent answer without knowing a database schema. Also, is your example relevant to the code you gave? – nerdlyist Jan 20 '17 at 20:55
  • @nerdlyist Yes. Is it necessary a complete database schema? – GiamPy Jan 21 '17 at 01:31

1 Answers1

1

There are two problems in this.

Firstly storing results (in database or in cache etc.) so that the calculations does not have to be remade every time a page is loaded.

Secondly you need a mechanism to redo all calculations, you could have found a flaw in an old calculation or old data is updated/purged or new calculations are added.

The best way to do this is to add tables in the database that contains the calculations up and until a given time (eg. yesterday at midnight). As new calculations are added you'll need to update those tables or add new tables.

Whenever you need to you can empty the tables and run a script that does the calculations. This could very well be a script run automatically each night.

But you will also need to be able to clear the tables and run the script manually.

One important note is that the statistics tables are not supposed to contain permanent data. The data only needs to be easily regenerated.

The idea that old data never changes is a dangerous assumption. You can have data sources that no longer are considered reliable. There can be many reasons for old data to become obsolete. Designing a system to rely on old data never changing is designing for failure. You should design for not having to rely on old data.

Bent
  • 2,566
  • 1
  • 14
  • 18
  • That is what i think piwik is doing. And it does not slow down ... so the way to go is never calculate/sum up twice .. which is sort of like doing a lot of caching – snitch182 Jan 16 '17 at 12:54