Inner Join - special time conditions - sql

Given an hourly table A with full heart_rate records, e.g.:
User Hour Heart_rate
Joe 1 60
Joe 2 70
Joe 3 72
Joe 4 75
Joe 5 68
Joe 6 71
Joe 7 78
Joe 8 83
Joe 9 85
Joe 10 80
And a subset hours where a purchase happened, e.g.
User Hour Purchase
Joe 3 'Soda'
Joe 9 'Coke'
Joe 10 'Doughnut'
I want to keep only those records from A that are in B or at most 2hr behind the B subset, without duplication, i.e. and preserving both the heart_rate from A and the item purchased from b so the outcome is
User Hour Heart_rate Purchase
Joe 1 60 null
Joe 2 70 null
Joe 3 72 'Soda'
Joe 7 78 null
Joe 8 83 null
Joe 9 85 'Coke'
Joe 10 80 'Doughnut'
How can the result be achieved with an inner join, without duplication (in this case the hours 8&9) (This is an MWE, assume multiple users and timestamps instead of hours)
The obvious solution is to combine
Inner Join + deduplication
Left join
Can this be achieved in a more elegant way?

You could use an INNER join of the tables and conditional aggregation for the deduplication:
SELECT a.User, a.Hour, a.Heart_rate,
MAX(CASE WHEN a.Hour = b.Hour THEN b.Purchase END) Purchase
FROM a INNER JOIN b
ON b.User = a.User AND a.Hour BETWEEN b.Hour - 2 AND b.Hour
WHERE a.User = 'Joe' -- remove this line if you want results for all users
GROUP BY a.User, a.Hour, a.Heart_rate;
Or with MAX() window function:
SELECT DISTINCT a.*,
MAX(CASE WHEN a.Hour = b.Hour THEN b.Purchase END) OVER (PARTITION BY a.User, a.Hour) Purchase
FROM a INNER JOIN b
ON b.User = a.User AND a.Hour BETWEEN b.Hour - 2 AND b.Hour;
See the demo (for MySql but it is standard SQL).

Your solutiuons should work and sounds good.
There is another way, using 3 Select Statements.
The inner Select combines both tables by UNION ALL. Because only tables with the same columns can be combinded, fields which are only in one table have to be defined in the other one as well and set to null. The column hour_eat is added to see when the last purchase has occured. By sorting this table, we can archive that under each row from table B lies now the row of table A which occures next.
In the middle Select statement the lag(Purchase) gets the last Purchase. If we only think about the rows from the 1st table, the Purchase value from the 2nd table is now at the right place. This comes in handy if timestamps and not defined hours are used. The row the last_value calculates the time between the purchase and measurement of the heart_beat.
The outer Select filters the rows of interest. The last 2 hours before the purchase and only the rows of the 1st table.
With
heart_tbl as (SELECT "Joe" as USER, row_number() over() Hour, Heart_rate from unnest([60,72,72,75,68,71,78,83,85,80]) Heart_rate ),
eat_tbl as (Select "Joe" as User ,3 Hour , 'Soda' as Purchase UNION ALL SELECT "Joe", 9, 'Coke' UNION ALL SELECT "Joe", 10, 'Doughnut' )
SELECT user, hour,heart_rate,Purchase_,hours_till_Purchase
from
(
SELECT *,
lag(Purchase) over (order by hour, heart_rate is not null) as Purchase_,
hour-last_value(hour_eat ignore nulls) over (order by hour desc,heart_rate is not null) as hours_till_Purchase
From # combine both tables to one table (ordered by hours)
(
SELECT user, hour,heart_rate, null as Purchase, null as hour_eat from heart_tbl
UNION ALL
Select user, hour, null as heart_rate, Purchase, hour from eat_tbl
)
)
Where heart_rate is not null and hours_till_Purchase >= -2
order by hour

Related

Frequency of Address changes in number of days SQL

Hi I'm trying to find out how frequently a business would change their address. I've got two tables one with trading address and the other with office address. The complicated part is one id will have several sequence numbers. I need to find out the difference between one address's create date and another address create date.
Trading address table
ID
Create_date
Seq_no
Address
1
2002-03-23
1
20 bottle way
1
2002-05-23
2
12 sunset blvd
2
2003-01-14
1
76 moonrise ct
Office address table
ID
Create_date
Seq_no
Address
1
2004-02-13
1
12 paper st
2
2005-03-01
1
30 pencil way
2
2005-04-01
2
25 mouse rd
2
2005-08-01
3
89 glass cct
My result set will be
Difference
NumberOfID's
30 days
1
60 days
1
120 days
1
Other
2
I think I solved it. Steps are
I did an union and created a separate column to find out actual
sequence no for the union set.
Used LEAD function to create a separate column of to bring up the date.
Date difference to find out the actual difference between id's
Case statement to categorize the days and counting the id's
WITH BASE AS (
SELECT ID,SEQ_NO,CREATE_DATE
FROM TradingAddress
UNION ALL
SELECT ID,SEQ_NO,CREATE_DATE
FROM OfficeAddress
),
WORKINGS AS (
SELECT ID,CREATE_DATE,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY CREATE_DATE ASC) AS SNO,
LEAD(CREATE_DATE) OVER (PARTITION BY ID ORDER BY CREATE_DATE) AS REF_DATE,
DATEDIFF(DAY,CREATE_DATE,LEAD(CREATE_DATE) OVER (PARTITION BY ID ORDER BY CREATE_DATE)) AS DATE_DIFFERENCE
FROM BASE
),
WORKINGS_2 AS (
SELECT *,
CASE WHEN DATE_DIFFERENCE BETWEEN 1 AND 30 THEN '1-30 DAYS'
WHEN DATE_DIFFERENCE BETWEEN 31 AND 60 THEN '31-60 DAYS'
WHEN DATE_DIFFERENCE BETWEEN 61 AND 90 THEN '61-90 DAYS'
WHEN DATE_DIFFERENCE BETWEEN 91 AND 120 THEN '91-120 DAYS'ELSE 'MORE THAN 120 DAYS'
END AS DIFFERENCE_DAYS
FROM WORKINGS
WHERE REF_DATE IS NOT NULL
)
SELECT DIFFERENCE_DAYS,COUNT(DIFFERENCE_DAYS) AS NUMBEROFIDS
FROM WORKINGS_2
GROUP BY DIFFERENCE_DAYS
you can do this in this way
SELECT DATEDIFF(day,t1.create_date,t2.create_date) AS 'yourdats', Count (*) as ids FROM test1 t1 join test2 t2 on t1.id = t2.id GROUP BY DATEDIFF(day,t1.create_date,t2.create_date)

SQL Query Problem Involving (SUM, Group By, Order by, I guess? and maybe total, or even count)

By using SQL query, find out the Top 5 highest total Transaction Value, which Industry are they? and the number of stores in that industry?
My SQL data looks like this:
Store Name
Industry
Transaction Value
Ace
A
196
Ace
A
193
Area
A
168
Apple
A
165
Boy
B
145
Boy
B
143
Bull
B
136
Bread
B
131
Cat
C
116
Cat
C
106
Cake
C
104
Candy
C
102
Dog
D
101
Dog
D
92
Door
D
80
Daddy
D
75
Egg
E
70
Egg
E
67
Earl
E
66
Eagle
E
61
This is just for your reference, Top 5 highest Transaction Value are:
No.
Store Name
Industry
Total Transaction Value
1
Ace
A
389
2
Boy
B
288
3
Cat
C
222
4
Dog
D
193
5
Area
A
168
SQL Query Results should look something like this:
Industry
No. of Stores
A
2
B
1
C
1
D
1
E
0
select a.industry, sum(case when b.name is null then 0 else 1 end) as no
from
(select distinct industry from transactions ) a
left join
(select name, industry
from transactions
group by name, industry
order by sum(transaction_vaule) desc limit 5) b
on a.industry = b.industry
group by a.industry
order by a.industry
I think I have a solution for you. Please check my code I have used Common Table Expression ,CASE,SUM and group by =>
WITH CTE AS
(
SELECT industry, SUM(TransactionValue) AS Transaction_Value,
COUNT(StoreName) AS StoreCount FROM MYTable
GROUP BY StoreName,industry
ORDER BY SUM(TransactionValue) DESC
Limit 5
)
SELECT T1.industry,
SUM((CASE WHEN c.industry IS NULL THEN 0
ELSE 1 END)) as CT
FROM
(SELECT DISTINCT Industry FROM MYTable) AS T1
LEFT JOIN CTE as c ON T1.industry=c.industry
GROUP BY T1.industry
Note: Subquery is not best practice, but in your case, I think there will be no performance issue. Also, please check the code because, I do not have Snowflake SQL database installed, so there might be some syntactical error can be evident
.
To get a deterministic result, you must be aware of ties. Let's say the top 9 results are
Cat/A/600, Dog/A/500, Cat/B/500, Dog/B/400, Cat/C/300, Dog/C/300, Cat/D/300, Dog/D/200, Cat/E/100
Which is the top fifth? Cat/C/300 or Dog/C/300 or Cat/D/300? Or none of them? If we pick a row arbitrarily (by LIMIT 5 or FETCH FIRST 5 ROWS ONLY) we prefer one industry over another.
In standard SQL we have the clause FETCH FIRST 5 ROWS WITH TIES, but snowflake doesn't feature this, unfortunately. It does however feature DENSE_RANK. It ranks my sample rows thus:
#1: Cat/A/600
#2: Dog/A/500
#2: Cat/B/500
#3: Dog/B/400
#4: Cat/C/300
#4: Dog/C/300
#4: Cat/D/300
#5: Dog/D/200
#6: Cat/E/100
because the five top values are 600, 500, 400, 300, and 200.
The query:
select industry, count(case when rnk <= 5 then 1 end) as stores
from
(
select industry, dense_rank() over (order by sum(transaction_value) desc) as rnk
from mytable
group by store_name, industry
) ranked
group by industry
order by industry;
If you only want to show top industries:
select industry, count(*) as stores
from
(
select industry, dense_rank() over (order by sum(transaction_value) desc) as rnk
from mytable
group by store_name, industry
) ranked
where rnk <= 5
group by industry
order by industry;

SQL Joining One Table to a Selection of Rows from Second Table that Contains a Max Value per Group

I have a table of Cases with info like the following -
ID
CaseName
Date
Occupation
11
John
2020-01-01
Joiner
12
Mark
2019-10-10
Mechanic
And a table of Financial information like the following -
ID
CaseID
Date
Value
1
11
2020-01-01
1,000
2
11
2020-02-03
2,000
3
12
2019-10-10
3,000
4
12
2019-12-25
4,000
What I need to produce is a list of Cases including details of the most recent Financial value, for example -
ID
CaseName
Occupation
Lastest Value
11
John
Joiner
2,000
12
Mark
Mechanic
4,000
Now I can join my tables easy enough with -
SELECT *
FROM Cases AS c
LEFT JOIN Financial AS f ON f.CaseID = c.ID
And I can find the most recent date per case from the financial table with -
SELECT CaseID, MAX(Date) AS LastDate
FROM Financial
GROUP BY CaseID
But I am struggling to find a way to bring these two together to produce the required results as per the table set out above.
A simple method is window functions:
SELECT *
FROM Cases c LEFT JOIN
(SELECT f.*, MAX(date) OVER (PARTITION BY CaseId) as max_date
FROM Financial f
) f
ON f.CaseID = c.ID AND f.max_date = f.date;

Combining Two Tables & Summing REV amts by Mth

Below are my two tables of data
Acct BillingDate REV
101 01/05/2018 5
101 01/30/2018 4
102 01/15/2018 2
103 01/4/2018 3
103 02/05/2018 2
106 03/06/2018 5
Acct BillingDate Lease_Rev
101 01/15/2018 2
102 01/16/2018 1
103 01/19/2018 2
104 02/05/2018 3
105 04/02/2018 1
Desired Output
Acct Jan Feb Mar Apr
101 11
102 3
103 5 2
104 3
105 1
106 5
My SQL Script is Below:
SELECT [NewSalesHistory].[Region]
,[NewSalesHistory].[Account]
,SUM(case when [NewSalesHistory].[billingdate] between '6/1/2016' and '6/30/2016' then REV else 0 end ) + [X].[Jun-16] AS 'Jun-16'
FROM [NewSalesHistory]
FULL join (SELECT [Account]
,SUM(case when [BWLease].[billingdate] between '6/1/2016' and '6/30/2016' then Lease_REV else 0 end ) as 'Jun-16'
FROM [AirgasPricing].[dbo].[BWLease]
GROUP BY [Account]) X ON [NewSalesHistory].[Account] = [X].[Account]
GROUP BY [NewSalesHistory].[Region]
,[NewSalesHistory].[Account]
,[X].[Jun-16]
I am having trouble combining these tables. If there is a rev amt and lease rev amt then it will combine (sum) for that account. If there is not a lease rev amt (which is the majority of the time), it brings back NULLs for all other rev amts accounts in Table 1. Table one can have duplicate accounts with different Rev, while the Table two is one unique account only w Lease rev. The output above is how I would like to see the data.
What am I missing here? Thanks!
I would suggest union all and group by:
select acct,
sum(case when billingdate >= '2016-01-01' and billingdate < '2016-02-01' then rev end) as rev_201601,
sum(case when billingdate >= '2016-02-01' and billingdate < '2016-03-01' then rev end) as rev_201602,
. . .
from ((select nsh.acct, nsh.billingdate, nsh.rev
from NewSalesHistory
) union all
(select bl.acct, bl.billingdate, bl.rev
from AirgasPricing..BWLease bl
)
) x
group by acct;
Okay, so there are a few things going on here:
1) As Gordon Linoff mentioned you can perform a union all on the two tables. Be sure to limit your column selections and name your columns appropriately:
select
x as consistentname1,
y as consistentname2,
z as consistentname3
from [NewSalesHistory]
union all
select
a as consistentname1,
b as consistentname2,
c as consistentname3
from [BWLease]
2) Your desired result contains a pivoted month column. Generate a column with your desired granularity on the result of the union in step one. F.ex. months:
concat(datepart(yy, Date_),'-',datename(mm,Date_)) as yyyyM
Then perform aggregation using a group by:
select sum(...) as desiredcolumnname
...
group by PK1, PK2, yyyyM
Finally, PIVOT to obtain your result: https://learn.microsoft.com/en-us/sql/t-sql/queries/from-using-pivot-and-unpivot?view=sql-server-2017
3) If you have other fields/columns that you wish to present then you first need to determine whether they are measures (can be aggregated) or are dimensions. That may be best addressed in a follow up question after you've achieved what you set out for in this part.
Hope it helps
As an aside, it seems like you are preparing data for reporting. Performing these transformations can be facilitated using a GUI such as MS Power Query. As long as your end goal is not data manipulation in the DB itself, you do not need to resort to raw sql.

Return rows where specific number is reached for the first time (postgres)

Have hit a roadblock.
Context: am using PostgreSQL 9.5.8
I have a table, as follows, with customers' points accrued. The table has multiple rows per customer as it records every change in points (like an event table). i.e. customer 1 may buy 1 item and accrue 10 points which is one row, then on another day spend some of these points and be left with 5 points which is another row, and then purchase another item and accrue a further 10 bringing them back up to 15 which displays as another row. Each of these rows with point amounts has a created_at column.
Example table:
Customer ID created_at no_points row
123 17/09/2017 5 1
123 09/10/2017 8 2
124 10/10/2017 12 3
123 10/10/2017 15 4
125 12/10/2017 12 5
126 17/09/2017 6 6
123 11/10/2017 11 7
123 12/10/2017 9 8
127 17/09/2017 5 9
124 11/10/2017 5 10
125 13/10/2017 5 11
123 13/10/2017 12 12
I want to track the first time a customer reaches a certain threshold i.e. >= 10 points. It doesn't matter how much they go over 10 points, the only criteria is that I select the first time the customer reaches this threshold. I would also like this query to fetch only rows where the customer has reached the threshold of 10 for the first time in the last week.
Following these rules, in the above example, I would like my query to select rows 3, 4 and 5.
I have tried the following query:
SELECT x.id,
min(x.created_at)
FROM (
SELECT
p.id as id,
p.created_at as created_at,
p.amount as amount
FROM "points" p
WHERE p.amount >= 10 ) x
WHERE x.created_at >= (now()::date - 7)
AND x.created_at < now()::date
GROUP BY x.id
I'm unsure that I'm retrieving the right thing however from the result set I am seeing & the results set is huge so it's not evident. Could someone sense check?
Thanks in advance.
Use cumulative functions:
select p.*
from (select p.*,
sum(num_points) over (partition by p.customer_id order by p.created_at) as cume_num_points
from points p
) p
where cume_num_points >= 10 and
(cume_num_points - num_points) < 10;
EDIT:
I may have misunderstood the question. If you just want the first break, one method uses window functions:
select p.*
from (select p.*,
lag(num_points) over (partition by p.customer_id order by p.created_at) as prev_num_points
from points p
) p
where num_points >= 10 and
prev_num_points < 10;
Or, without a subquery:
select distinct on (p.customer_id) p.*
from customers p
where num_points >= 10
order by p.customer_id, p.created_at;