Joining on closest timestamp? Presto SQL - sql

I have 2 tables with epoch values. One with multiple samples per minute such as:
id
First_name
epoch_time
1
Paul
1650317420
2
Jeff
1650317443
3
Raul
1650317455
And one with 1 sample per minute:
id
Home
epoch_time
1
New York
1650317432
What I would like to do is join on the closest timestamp between the two tables. Ideally, finding the closest values between tables 1 and 2 and then populating a field from table 1 and 2. Id like to populate the 'Home' field and keep the rest of the records from table 1 as is, such as:
id
Name
Home
epoch_time
1
Paul
New York
1650317420
2
Jeff
New York
1650317443
3
Raul
New York
1650317455
The problem is the actual join. The ID is not unique hence why I need to not only join on ID but also scan for the closest epoch time between the 2 tables. I cannot use correlated subqueries, since Presto doesn't support correlated subqueries.

Answered my own question. It was as simple as first adding some offset such as a LEAD() between each minute sample and then using a BETWEEN in the join between the tables on the current minute sample looking ahead 59 seconds. Such that:
WITH tbl1 AS (
SELECT
*
FROM table_1
),
tbl2 AS (
SELECT
*,
LEAD(epoch_time) OVER (
PARTITION BY
name,
home
ORDER BY
epoch_time
) - 1 AS next_time
FROM table_2
)
SELECT
t1.Id,
t1.Name,
t2.Home,
t1.epoch_time
FROM tbl1 t1
LEFT JOIN tbl2 t2
ON t1.Id = t2.Id
AND t1.epoch_time BETWEEN t2.epoch_time AND t2.next_time

Related

Postgres, groupBy and count for table and relations at the same time

I have a table called 'users' that has the following structure:
id (PK)
campaign_id
createdAt
1
123
2022-07-14T10:30:01.967Z
2
1234
2022-07-14T10:30:01.967Z
3
123
2022-07-14T10:30:01.967Z
4
123
2022-07-14T10:30:01.967Z
At the same time I have a table that tracks clicks per user:
id (PK)
user_id(FK)
createdAt
1
1
2022-07-14T10:30:01.967Z
2
2
2022-07-14T10:30:01.967Z
3
2
2022-07-14T10:30:01.967Z
4
2
2022-07-14T10:30:01.967Z
Both of these table are up to millions of records... I need the most efficient query to group the data per campaign_id.
The result I am looking for would look like this:
campaign_id
total_users
total_clicks
123
3
1
1234
1
3
I unfortunately have no idea how to achieve this while minding performance and most important of it all I need to use WHERE or HAVING to limit the query in a certain time range by createdAt
Note, PostgreSQL is not my forte, nor is SQL. But, I'm learning spending some time on your question. Have a go with INNER JOIN after two seperate SELECT() statements:
SELECT * FROM
(
SELECT campaign_id, COUNT (t1."id(PK)") total_users FROM t1 GROUP BY campaign_id
) tbl1
INNER JOIN
(
SELECT campaign_id, COUNT (t2."user_id(FK)") total_clicks FROM t2 INNER JOIN t1 ON t1."id(PK)" = t2."user_id(FK)" GROUP BY campaign_id
) tbl2
USING(campaign_id)
See an online fiddle. I believe this is now also ready for a WHERE clause in both SELECT statements to filter by "createdAt". I'm pretty sure someone else will come up with something better.
Good luck.
Hope this will help you.
select u.campaign_id,
count(distinct u.id) users_count,
count(c.user_id) clicks_count
from
users u left join clicks c on u.id=c.user_id
group by 1;
See here query output

How to include values with count 0?

I have the following table called Trains.
id | name | train_id
1 Carl 1
2 Kat 1
3 Paul 2
4 Adam 4
5 Janet 4
6 James 4
I am trying to count for each name how many other people are in the same train.
Here's what I've gotten so far:
SELECT T1.name, COUNT(T2.name)
FROM Trains T1, Trains T2
WHERE T1.name<>T2.name AND T1.train_id=T2.train_id
GROUP BY T1.name;
However, the result I get is
Janet 2
Adam 2
Kat 1
Carl 1
James 2
but I should also have the name 'Paul' there with count 0. I am new to SQL and I am unsure of how I could change my code to have the zero values here as well.
If you phrase your current logic as a left join, it should work:
SELECT t1.name, COUNT(t2.name) AS cnt
FROM Trains t1
LEFT JOIN Trains t2
ON t1.name <> t2.name AND t1.train_id = t2.train_id
GROUP BY t1.name;
Demo
The problem with your current approach is that it doing an old school implicit inner join, not a left join. This means that first the join happens, then the WHERE clause is filtering off the missing Paul record. By using a left join, all names on the left side of the join are retained.
I don't think a join is needed. Just use window functions:
select name, count(*) over (partition by train_id) - 1
from trains t;
Basically, count(*) over (partition by train_id) counts the number of rows on the train. The - 1 is to subtract the current row.
We use COUNT(*) which counts all of the input rows for a group.
(COUNT() also works with expressions, but it has slightly different behavior.)
Here's how the database executes this query:
FROM train_id: — First, retrieve all of the records from the Trains table.
GROUP BY name — Next, determine the unique name groups.
SELECT ... — Finally, select the name and the count of the number of rows in that group.
We also give this count of rows an alias using AS people_in_same_train to make the output more readable.
SELECT
name, COUNT(*) AS people_in_same_train
FROM Trains
GROUP BY name;

unable to use LIMIT when using correlated query

I have two tables in Postgres. I want to get the latest 3records data from table.
Below is the query:
select two.sid as sid,
two.sidname as sidname,
two.myPercent as mypercent,
two.saccur as saccur,
one.totalSid as totalSid
from table1 one,table2 two
where one.sid = two.sid;
The above query displays all records checking the condition one.sid = two.sid;I want to get only recent 3 records data(4,5,6) from table2.
I know in Postgres we can use limit to limit the rows to retrieve, but here in table2 for each ID I have multiple rows. So I guess I cannot use limit on table2 but should use on table1. Any suggestions?
table1:
sid totalSid
1 10
2 20
3 30
4 40
5 50
6 60
table2:
sid sidname myPercent saccur
1 aaaa 11 11t
1 bbb 13 13g
1 ccc 11 11g
1 qw 88 88k
//more data for 2,3,4,5....
6 xyz 89 895W
6 xyz1 90 90k
6 xyz2 91 91p
6 xyz3 92 92q
Given a changed understanding of the question a simple subquery and join should suffice.
We select everything from table1 limit to 3 records in sid order desc. This gives us the 3 most recent Sid's and then join to table2 to get the other SID relevant data. The assumption here is that SID is unique in table one and "most recent" would be those records having the highest SID.
SELECT two.sid as sid
, two.sidname as sidname
, two.myPercent as mypercent
, two.saccur as saccur
, one.totalSid as totalSid
FROM (SELECT * FROM table1 ORDER BY SID DESC LIMIT 3) one
INNER JOIN table2 two
ON one.sid = two.sid;
*note I removed a comma after one alias above.
and below we reinstated the ANSI 88 join syntax using , notation.
SELECT two.sid as sid
, two.sidname as sidname
, two.myPercent as mypercent
, two.saccur as saccur
, one.totalSid as totalSid
FROM (SELECT * FROM table1 ORDER BY SID DESC LIMIT 3) one
, table2 two
WHERE one.sid = two.sid;
This syntax basically says get the 3 most recent SIDs from table one and cross join (For each record in one match it to all records in two) that to all records in table two but then return only records that have the same SID on both sides. Modern compilers may be able to use Cost based optimization to improve performance here negating the need to do the entire cross join; however, order of operation says this is what the database would normally have to do. if one and two are both tables of substantial size, you can see the cross join could result in a very large temporary dataset

SQL query with grouping and MAX

I have a table that looks like the following but also has more columns that are not needed for this instance.
ID DATE Random
-- -------- ---------
1 4/12/2015 2
2 4/15/2015 2
3 3/12/2015 2
4 9/16/2015 3
5 1/12/2015 3
6 2/12/2015 3
ID is the primary key
Random is a foreign key but i am not actually using table it points to.
I am trying to design a query that groups the results by Random and Date and select the MAX Date within the grouping then gives me the associated ID.
IF i do the following query
select top 100 ID, Random, MAX(Date) from DateBase group by Random, Date, ID
I get duplicate Randoms since ID is the primary key and will always be unique.
The results i need would look something like this
ID DATE Random
-- -------- ---------
2 4/15/2015 2
4 9/16/2015 3
Also another question is there could be times where there are many of the same date. What will MAX do in that case?
You can use NOT EXISTS() :
SELECT * FROM YourTable t
WHERE NOT EXISTS(SELECT 1 FROM YourTable s
WHERE s.random = t.random
AND s.date > t.date)
This will select only those who doesn't have a bigger date for corresponding random value.
Can also be done using IN() :
SELECT * FROM YourTable t
WHERE (t.random,t.date) in (SELECT s.random,max(s.date)
FROM YourTable s
GROUP BY s.random)
Or with a join:
SELECT t.* FROM YourTable t
INNER JOIN (SELECT s.random,max(s.date) as max_date
FROM YourTable s
GROUP BY s.random) tt
ON(t.date = tt.max_date and s.random = t.random)
In SQL Server you could do something like the following,
select a.* from DateBase a inner join
(select Random,
MAX(dt) as dt from DateBase group by Random) as x
on a.dt =x.dt and a.random = x.random
This method will work in all versions of SQL as there are no vendor specifics (you'll need to format the dates using your vendor specific syntax)
You can do this in two stages:
The first step is to work out the max date for each random:
SELECT MAX(DateField) AS MaxDateField, Random
FROM Example
GROUP BY Random
Now you can join back onto your table to get the max ID for each combination:
SELECT MAX(e.ID) AS ID
,e.DateField AS DateField
,e.Random
FROM Example AS e
INNER JOIN (
SELECT MAX(DateField) AS MaxDateField, Random
FROM Example
GROUP BY Random
) data
ON data.MaxDateField = e.DateField
AND data.Random = e.Random
GROUP BY DateField, Random
SQL Fiddle example here: SQL Fiddle
To answer your second question:
If there are multiples of the same date, the MAX(e.ID) will simply choose the highest number. If you want the lowest, you can use MIN(e.ID) instead.

Get top values from two columns

Lets say I have a table like this:
id | peru | usa
1 20 10
2 5 100
3 1 5
How can I get the top values from peru and usa as well as the spefic ids. So that I get as result:
usa_id: 2 | usa: 100 | peru_id: 1 | peru: 20
Is this possible In one query? Or do I have to do two ORDER BY querys?
Im using postgresql
You can do this with some subqueries and a cross join:
select
u.id usa_id,
u.usa,
p.id peru_id,
p.peru
from
(select id, usa from mytable where usa=(select max(usa) from mytable) order by id limit 1) u
cross join (select id, peru from mytable where peru=(select max(peru) from mytable) order by id limit 1) p
;
In the case that there are multiple rows with the same max value (for usa or peru, independently), this solution will select the one with the lowest id (I've assumed that id is unique).
SELECT
t1.id as peru_id, t1.peru
, t2.id as usa_id, t2.usa
FROM tab1 t1, tab1 t2
ORDER BY t1.peru desc, t2.usa desc
limit 1
http://sqlfiddle.com/#!15/0c12f/6
As basicly what this does is a simple carthesian product - I guess that performance WILL be poor for large datasets.
on the fiddle it took 196ms for a 1k rows table. On 10k rows table - sqlFiddle hung up.
You can consider using MAX aggregate function in conjunction with ARRAY type. Check this out:
CREATE TEMPORARY TABLE _test(
id integer primary key,
peru integer not null,
usa integer not null
);
INSERT INTO _test(id, peru, usa)
VALUES
(1,20,10),
(2,5,100),
(3,1,5);
SELECT MAX(ARRAY[peru, id]) AS max_peru, MAX(array[usa, id]) AS max_usa FROM _test;
SELECT x.max_peru[1] AS peru, x.max_peru[2] AS peru_id, x.max_usa[1]
AS usa, x.max_usa[2] AS usa_id FROM (
SELECT MAX(array[peru, id]) AS max_peru,
MAX(array[usa, id]) AS max_usa FROM _test ) as x;