Best way to pre-aggregate time-series data in postgres - sql

I have a table of sent alerts as below:
id | user_id | sent_at
1 | 123 | 01/01/2020 12:09:39
2 | 452 | 04/01/2020 02:39:50
3 | 264 | 11/01/2020 05:09:39
4 | 123 | 16/01/2020 11:09:39
5 | 452 | 22/01/2020 16:09:39
Alerts are sparse and I've around 100 Million user_ids. This table has total ~500 Million entries (last 2 months).
I want to query alerts per user in last X hours/days/weeks/months for 10 million users_ids(saved in another table). I cannot use any external time-series database and it has to be done in postgres only.
I tried keeping hourly buckets for each user. But data is so sparse that I've too many rows (userIds*hours). For eg. Getting alerts count for 10 Million users in last 10 hours takes a long time from this table.
user_id | hour | count
123 | 01/01/2020 12:00:00 | 2
123 | 01/01/2020 10:00:00 | 1
234 | 11/01/2020 12:00:00 | 1

There are not many alerts per user, so an index on (user_id) should be sufficient.
However, you might as well put the time into it as well, so I would recommend (user_id, sent_at). This covers the where clause of your query. Postgres will still need to look up the original data pages to check for changes to the data.

Related

How to use groupby and nth_value at the same time in pyspark?

So, I have a dataset with some repeated data, which I need to remove. For some reason, the data I need is always in the middle:
--> df_apps
DATE | APP | DOWNLOADS | ACTIVE_USERS
______________________________________________________
2021-01-10 | FACEBOOK | 1000 | 5000
2021-01-10 | FACEBOOK | 20000 | 900000
2021-02-10 | FACEBOOK | 9000 | 72000
2021-01-11 | FACEBOOK | 4000 | 2000
2021-01-11 | FACEBOOK | 40000 | 85000
2021-02-11 | FACEBOOK | 1000 | 2000
In pandas, it'd be as simple as df_apps_grouped = df_apps.groupby('DATE').nth_value(1) and I'd get the result bellow:
--> df_apps_grouped
DATE | APP | DOWNLOADS | ACTIVE_USERS
______________________________________________________
2021-01-10 | FACEBOOK | 20000 | 900000
2021-01-11 | FACEBOOK | 40000 | 85000
But for one specific project, I must use pyspark and I can't get this result on it.
Could you please help me with this?
Thanks!
You'll want to do:
from pyspark.sql import Window, functions as F
w = Window.partitionBy('date').orderBy('date')
df = df.withColumn('row_n', F.row_number().over(w)).filter('row_n ==1')
Because of its distributed nature the rows are in random order and row 1 might be different the second time you query it. This is why you need an order by, this will make sure you get the same result every time
What you are looking for is row_number applied over the a window partitioned by DATE and ordered by DATE, however due to the distributed nature of spark, we can't guarantee that during ordering
2021-01-10 | FACEBOOK | 1000 | 5000
will always come before
2021-01-10 | FACEBOOK | 20000 | 900000
I would suggest, including a line number if you are reading from a file, and ordering based on the file number. Refer here for achieving this in Spark.

PostgreSQL query and data caching

I have this SQL query:
SELECT p.timestamp,
COUNT(*) as total,
date_part('hour', p.timestamp) as hour
FROM parties as p
WHERE p.timestamp >= TIMESTAMP 'today' AND p.timestamp < TIMESTAMP 'tomorrow'
AND p.member_id = 1
GROUP BY p.timestamp, hour;
which will grouped how many people by hour:
+-------------------------+-------+------+
| Timestamp | Total | Hour |
+-------------------------+-------+------+
| 2018-11-21 12:00:00+07 | 10 | 12 |
| 2018-11-21 13:00:00+07 | 2 | 13 |
| 2018-11-21 14:00:00+07 | 2 | 14 |
| 2018-11-21 16:00:00+07 | 1 | 16 |
| 2018-11-21 17:00:00+07 | 21 | 17 |
| 2018-11-21 19:00:00+07 | 18 | 19 |
| 2018-11-21 20:00:00+07 | 8 | 20 |
| 2018-11-21 21:00:00+07 | 1 | 21 |
+-------------------------+-------+------+
My question is, if I refetch some API end point that will query above statement, would it be the data in the past hour cached automatically? because in my case, if there is a new data, it will update the last hour's row only.
If not how to cache it? Thanks in advance
PSQL can not cache result of query itself. The solution is cache the result at API application layer.
I prefer using redis to cache it. Using a hash with fields is year+month+day+hour and value is total online user of each hour. Example:
hash: useronline
field: 2018112112 - value: 10
field: 2018112113 - value: 2
You also set a timeout on key. After the timeout has expired, the key will automatically be deleted. I will set 1 hour in here.
EXPIRE useronline 3600
When have API request we will get result in redis cache first. If do not exist or expired call query to database layer to get result, save to redis cache again. Reponse result to client.
Here is list of redis clients suitable for programing language.

I think I need a loop in an MS Access Query

I have a table of login and logout times for users, table looks something like below:
| ID | User | WorkDate | Start | Finish |
| 1 | Bill | 07/12/2017 | 09:00:00 | 17:00:00 |
| 2 | John | 07/12/2017 | 09:00:00 | 12:00:00 |
| 3 | John | 07/12/2017 | 12:30:00 | 17:00:00 |
| 4 | Mary | 07/12/2017 | 09:00:00 | 10:00:00 |
| 5 | Mary | 07/12/2017 | 10:10:00 | 12:00:00 |
| 6 | Mary | 07/12/2017 | 12:10:00 | 17:00:00 |
I'm running a query to find out the length of the breaks that each user took by running a date diff between the Min of Finish, and Max of Start, then doing some other sums/queries to find out their break length.
This works where i have a maximum of two rows per User per WorkDate, so rows 1,2,3 give me workable data.
Rows 4,5,6 do not.
So long story short, how can i calculate the break times based on the above data in MS Access in a query. I'm assuming i'm going to need some looping statement but have no idea where to begin.
Here is a solution that comes to mind first.
First query to get the min/max start and end times.
Second query to calculate the total time worked for each day by using your Min(start time) and max(end time) query.
Third query to calculate the total time worked for each shift (time difference between start and end times) and then do a daily sum.
Forth query to calculate the difference between total time from the second query and the total time from the third query. The difference gives you the amount of break time they took.
If you need additional help, I can provide some screenshots of example queries.

Matching disambiguating data to existing duplicate records

I have a table called transactions that has the ledger from a storefront. Let's say it looks like this, for simplicity:
trans_id | cust | date | num_items | cost
---------+------+------+-----------+------
1 | Joe | 4/18 | 6 | 14.83
2 | Sue | 4/19 | 3 | 8.30
3 | Ann | 4/19 | 1 | 2.28
4 | Joe | 4/19 | 4 | 17.32
5 | Sue | 4/19 | 3 | 8.30
6 | Lee | 4/19 | 2 | 9.55
7 | Ann | 4/20 | 1 | 2.28
For the credit card purchases, I subsequently get an electronic ledger that has the full timestamp. So I have a table called cctrans with date, time, cust, cost, and some other info. I want to add a column trans_id to the cctrans table, that references the transactions table.
The update statement for this is simple enough, except for one hitch: I have an 11 AM transaction from Sue on 4/19 for $8.30 and a 3 PM transaction from Sue on 4/19 for $8.30 that are the same in the transactions table except for the trans_id field. I don't really care which record of the cctrans table gets linked to trans_id 2 and which one gets linked to trans_id 5, but they can't both be assigned the same trans_id.
The question here is: How do I accomplish that (ideally in a way that also works when a customer makes the same purchase three or four times in a day)?
The best I have so far is to do:
UPDATE cctrans AS cc
SET trans_id = t.trans_id
WHERE cc.cust = t.cust AND cc.date = t.date AND cc.cost = t.cost
And then fix them one-by-one via manual inspection. But obviously that's not my preferred solution.
Thanks for any help you can provide.

Postgres table with more rows or fewer rows with more data in each row

I'm in a situation where I need to track user information very similar to fitbit steps, and am looking for feedback on two thoughts I had on modelling the data.
My requirements are to store the number of samples on a minute by minute basis. These are also going to be associated to a user (who did the steps), challenges and tasks for the user to complete. (gamification)
Now I can store all the samples in one table
id(pk) | user | start date | steps | challengeId
uuid1 | user1 | 1/1/2015 10:00PM | 100 | challenge1
uuid2 | user1 | 1/1/2015 10:01PM | 101 | challenge1
... can have hundreds of minutes with a challenge
uuid3 | user1 | 1/1/2015 10:02PM | 102 |
uuid4 | user2 | 1/1/2015 10:00PM | 100 |
so user1 has 303 steps between 10:00PM and 10:02PM but was only participating in challenge1 at 10:00PM and 10:01 PM
However, I don't think this can scale, since assuming ideal data for a single user in a year
12 (hours in a day) * 60 (minutes in a day) * 365 (days in a year) = 262,800 records in a database, for 1 user. Considering 100k users, the table would become pretty large.
I'm also thinking about the idea of grouping the minutes into a concept of a session, where it would look like
id(pk) | user | start date | steps | challengeId
uuid1 | user1 | 1/1/2015 10:00PM | [100,101] | challenge1
uuid2 | user1 | 1/1/2015 10:01PM | [102] |
uuid3 | user2 | 1/1/2015 10:02PM | [102] |
where the steps array assumes an 1 minute intervals. based on the use cases, there could hundreds / thousands of minutes in a challenge.
I think the second approach makes sense, since it means querying single records vs hundreds or thousands and could shrink the table by a factor of hundreds, but if there are any gotchas to this approach or any thoughts, it would be appreciated.