SQL query for column threaded relationship - sql

This is a simplified view of a table. I apologize, but I could not save a picture of the table so I hope this is ok.
c1___c2
1____a
1____b
2____a
2____b
2____c
2____d
3____e
3____a
4____z
5____d
The result is that due to the relationships of column C2,
Group 1 would include, 1,2,3,5 (because they have overlapping c2 values basically stating a=b=c=d=e)
Group 2 would include 4
I have millions of rows with this kind of data and currently there is a cursor job that runs x number of times to build these groups. I am able to visualize how this should work, but I have not been able to build a query that can pull out this relationship.
Any suggestions?
Thank you

Tested on SQL Server 2012:
WITH t AS (
SELECT
t.c1,
t.c2,
tm.c1_min
FROM
Test t
JOIN
(
SELECT
c2,
MIN(c1) AS c1_min
FROM
Test
GROUP BY
c2
) AS tm
ON
t.c2 = tm.c2
),
rt AS (
SELECT
c1_min,
c1,
1 AS cnt
FROM
t
UNION ALL
SELECT
rt.c1_min,
t.c1,
rt.cnt + 1 AS cnt
FROM
rt
JOIN
t
ON
rt.c1 = t.c1_min
AND
rt.c1 < t.c1
)
SELECT
SUM(t.rst) OVER (ORDER BY t.ord ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_number,
t.c1
FROM
(
SELECT
t.c1,
t.rst,
t.ord
FROM
(
SELECT
rt.c1,
CASE
WHEN rt.c1_min = MIN(rt.c1_min) OVER (ORDER BY rt.c1_min, rt.c1 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) THEN 0
ELSE 1
END AS rst,
ROW_NUMBER() OVER (ORDER BY rt.c1_min, rt.c1) AS ord,
ROW_NUMBER() OVER (PARTITION BY rt.c1 ORDER BY rt.c1_min, rt.cnt) AS qfy
FROM
rt
) AS t
WHERE
t.qfy = 1
) AS t;

Related

Get max value within PARTITON BY in CTE function

I'm struggling with log aggregation with window functions. I have table with columns
Item (string)
User (string)
TimeStart (timestamp)
With a CTE, I'm trying to merge all entries within 5 min into one, to create a time interval. In C1 I'm looking for the start of next interval, C2 is grouping records into intervals. To this point it works fine.
When I selected C2 I got my data labeled with group ID. But when I'm trying to catch max time from my group as TimeEnd in C3 I got messed data that is useless. As I understand OVER PARTITION BY it should be a simple comparison of TimeStart values within partition and return of MAX, but instead of I get messed group IDs, and TimeEnds
Code below, SQL Server Express:
WITH C1 AS
(
SELECT
Item,
User,
TimeStart,
CASE
WHEN DATEDIFF(MINUTE, LAG(TimeStart) OVER (PARTITION BY Item, User ORDER BY TimeStart), TimeStart) < 5
THEN 0
ELSE 1
END AS isstart
FROM
Log
),
C2 AS
(
SELECT
*,
SUM(isstart) OVER (ORDER BY TimeStart ROWS UNBOUNDED PRECEDING) AS grp
FROM
C1
),
C3 AS
(
SELECT
*,
MAX(TimeStart) OVER (PARTITION BY grp) AS TimeEnd
FROM
C2
)
SELECT *
FROM C3
Could you explain to me what happened here and how to solve it?
PS: I could use GROUP BY clause but I'll lose non aggregated columns.
Thanks to #Alex, I got this solution:
WITH C1 AS
(
SELECT
Item,
User,
TimeStart,
CASE
WHEN DATEDIFF(MINUTE, LAG(TimeStart) OVER (PARTITION BY Item, User ORDER BY TimeStart), TimeStart) < 5
THEN 0
ELSE 1
END AS isstart
FROM
Log
),
C2 AS
(
SELECT
*,
SUM(isstart) OVER ( ORDER BY ItemPath,UserName,TimeStart
ROWS UNBOUNDED PRECEDING) as grp
FROM
C1
),
C3 AS(
SELECT *, MAX(TimeStart) OVER (PARTITION BY grp ) AS [TimeEnd] FROM C2
)
SELECT * FROM C3
where isstart=1
ORDER BY grp
The problem was in second CTE with the wrong ORDER of rows.
Also I send preview of output data
Output data preview

How to query the V shaped data?

TxnID RunningAmount MemberID
==================================
1 80000 20
2 90000 20
3 70000 20 //<==== Falls but previously never below 100k, hence ignore
4 90000 20
5 110000 20
6 60000 20 //<==== Falls below 100k, hence we want ID 8
7 80000 20
8 120000 20
9 85000 28
...
....
How to construct the query such that it group by members, get the first transactionID that formed the "V" shape. Even a pseudocode is fine, I can't share my attempt because I am totally clueless about how to do it.
UPDATES:
Sorry for the lack of explanations on the conditions. The base amount we looking is 100k. ID is random, definitely we need to have rownumber
We ignore all transactions before ID = 5 because their runningAmount is never exceeded 100k.
Now when ID=5, exceeded 100k, we check if transactions after ID=5 if there is a down trend in runningAmount that falls below 100k.
Immediately we see ID=6 falls below 100k, so we want to find the first transaction that exceed 100k again(if there is).
From the data sample above, the expected result is only one record, which is ID=8.
For every member, there will only be either one or zero record found based on the conditions I've mentioned
Try this query:
declare #tbl table(TxnID int, RunningAmount int, MemberID int);
insert into #tbl values
(1, 80000, 20),
(2, 90000, 20),
(3, 70000, 20),
(4, 90000, 20),
(5, 110000, 20),
(6, 60000, 20),
(7, 120000, 20),
(8, 85000, 28);
select TxnID, RunningAmount, MemberID,
LAG(VShape) over (partition by MemberID order by TxnID) VShape
from (
select TxnID, RunningAmount, MemberID,
case when rn < lagrn and rn < leadrn then 1 else 0 end VShape
from (
select *,
LAG(rn) over (partition by MemberID order by TxnID) lagRn,
LEAD(rn) over (partition by MemberID order by TxnID) leadRn
from (
select TxnID,
RunningAmount,
MemberID,
ROW_NUMBER() over (partition by MemberID order by RunningAmount) rn
from #tbl
) a
) a
) a
Last column VShape indicates if value in RunningAmount completes V shape (although you could be more clearer on what it means instead of everybody figuring it out). Now you can filter values based on RunningAmount (wheter they fall below or above 100k).
Here is version for earlier versions of SQL Server that don't have LAG and LEAD functions:
;with cte as (
select *,
ROW_NUMBER() over (partition by MemberID order by RunningAmount) rn
from #tbl
), cte2 as (
select c1.TxnID, c1.RunningAmount, c1.MemberID, c1.rn, c2.rn [lagRn] , c3.rn [leadRn]
from cte c1
left join cte c2 on c1.TxnID = c2.TxnID + 1 and c1.MemberID = c2.MemberID
left join cte c3 on c1.TxnID = c3.TxnID - 1 and c1.MemberID = c3.MemberID
), cte3 as (
select TxnID, RunningAmount, MemberID,
case when rn < lagrn and rn < leadrn then 1 else 0 end VShape
from cte2
), FinalResult as (
select c1.TxnID, c1.RunningAmount, c1.MemberID, c2.VShape
from cte3 c1
left join cte3 c2 on c1.TxnID = c2.TxnID + 1 and c1.MemberID = c2.MemberID
)
select fr.*, fr2.RunningAmount RunningAmountLagBy2 from FinalResult fr
left join FinalResult fr2 on fr.TxnID = fr2.TxnID + 2
where fr.RunningAmount > 100000 and fr2.RunningAmount > 100000 and fr.VShape = 1
UPDATE
After question update, here's solution:
select TxnID from (
select *, ROW_NUMBER() over (partition by VShape order by TxnID) CompletesVShape from (
select TxnID,
RunningAmount,
MemberID,
sum(case when RunningAmount >= 100000 then 1 else 0 end) over (partition by MemberID order by TxnID rows between unbounded preceding and current row) VShape
from #tbl
) a
) a where VShape > 1 and CompletesVShape = 1
Based on your question update and assuming for V shape necessary condition is to get above and below running amounts > 100000 and middle be smaller than above and below running amounts, below is a query showing how to do it in 2008 sql server.
also see live demo
; with firstlargeamount as
(
select MemberId, minTrxid=min(TxnID)
from t
where RunningAmount>100000
group by MemberId
)
,tbl as
(
select *,
rn=row_number() over( partition by MemberId order by TxnId)
from
t
)
select t3.*,f.*
from tbl t1
join tbl t2
on
t1.memberId=t2.memberid and t1.rn=t2.rn +1
and t1.RunningAmount<t2.RunningAmount
join tbl t3
on
t1.memberId=t3.memberid and t1.rn=t3.rn -1
and t1.RunningAmount<t3.RunningAmount
join firstlargeamount f
on
f.Memberid=t2.memberid and f.minTrxid>=t1.TxnID
Explanation:
First step is to generate a row number sequence at member level as cte tbl and min limiting transaction in cte firstlargeamount
Second step is double self join to find above and below records per row which satisfy the V shape criteria as well join with firstlargeamount to find rows which satisfy the 100000 criteria
Note that the above and below records are simply found using +1/-1 from the current records's row number computed in the step 1

SQL query challenge - find top frequent items in columns and summarize result to a pivot table

I am looking for a query to do following transformation.
Basically I want to find top 3 frequent SELL_COUNTRY and top 3 frequent category, on per website, per day bases. (for example, website 1, date 6-5-2017, there are 2*US, 1*JP and 1*UK for SELL_COUNTRY, therefore TOP1_SELL_COUNTRY is US, and JP and UK going to TOP2_SELL_COUNTRY and TOP3_SELL_COUNTRY. Same idea for CATEGORY column)
My current solution involves many subqueries, which works, but I feel it is too complicated. I am interested in how sql master would do it in an elegant way.
Currently I know how to do it uses
From
To
I would do that in 3 steps:
group by country and rank by count
group by category and rank by count
blend results using conditional aggregate (which will just place the values in the necessary cells because the result of the CASE would be just your value and many NULL values, so min() outputs the value)
Like this:
WITH
countries as (
SELECT *, row_number() over (partition by website,date order by count desc)
FROM (
SELECT
website
,date::date
,sell_country
,count(1)
FROM your_table
GROUP BY 1,2,3
)
)
,categories as (
SELECT *, row_number() over (partition by website,date order by count desc)
FROM (
SELECT
website
,date::date
,category
,count(1)
FROM your_table
GROUP BY 1,2,3
)
)
SELECT
website
,date
,coalesce(min(case when t1.row_number=1 then t1.sell_country end),'NA') as top1_sell_country
,coalesce(min(case when t1.row_number=2 then t1.sell_country end),'NA') as top2_sell_country
,coalesce(min(case when t1.row_number=3 then t1.sell_country end),'NA') as top3_sell_country
,coalesce(min(case when t2.row_number=1 then t2.category end),'NA') as top1_sell_category
,coalesce(min(case when t2.row_number=2 then t2.category end),'NA') as top2_sell_category
,coalesce(min(case when t2.row_number=3 then t2.category end),'NA') as top3_sell_category
FROM countries t1
FULL JOIN categories t2
USING (website,date)
GROUP BY 1,2
ORDER BY 1,2
WITH a1 AS
(
SELECT *,
COUNT(*) OVER( PARTITION BY website,SUBSTRING(visit_date,1,8),sell_country ) AS sell_cntry,
COUNT(*) OVER( PARTITION BY website,SUBSTRING(visit_date,1,8),pur_country ) AS pur_cntry
FROM Yourtable
),
a2 AS
(
SELECT website,
visit_date,
sell_country,
RANK() OVER ( PARTITION BY website,SUBSTRING(visit_date,1,8) ORDER BY sell_cntry DESC ) AS sell_cntry_rnk
FROM a1
),
a3 AS
(
SELECT website,
visit_date,
pur_country,
RANK() OVER ( PARTITION BY website,SUBSTRING(visit_date,1,8) ORDER BY pur_cntry DESC ) AS pur_cntry_rnk
FROM a1
),
a4 AS
(
SELECT a2.website AS company,
a2.v_date,
CASE WHEN a2.sell_cntry_rn = 1 THEN a2.sell_country END AS TOP1_SELL_COUNTRY,
CASE WHEN a2.sell_cntry_rn = 2 THEN a2.sell_country END AS TOP2_SELL_COUNTRY,
CASE WHEN a2.sell_cntry_rn = 3 THEN a2.sell_country END AS TOP3_SELL_COUNTRY,
CASE WHEN a3.pur_cntry_rn = 1 THEN a3.pur_country END AS TOP1_PUR_COUNTRY,
CASE WHEN a3.pur_cntry_rn = 2 THEN a3.pur_country END AS TOP2_PUR_COUNTRY,
CASE WHEN a3.pur_cntry_rn = 3 THEN a3.pur_country END AS TOP3_PUR_COUNTRY
FROM (
SELECT Z.*,
ROW_NUMBER() OVER( PARTITION BY website,v_date ORDER BY sell_cntry_rnk,sell_country ) AS sell_cntry_rn
FROM
(
SELECT DISTINCT website,
SUBSTRING(visit_date,1,8) AS v_date,
sell_cntry_rnk,
sell_country
FROM a2
) Z
WHERE Z.sell_cntry_rnk <= 3
) a2
INNER JOIN
(
SELECT *,
ROW_NUMBER() OVER( PARTITION BY website,v_date ORDER BY pur_cntry_rnk,pur_country ) AS pur_cntry_rn
FROM
( SELECT DISTINCT website,
SUBSTRING(visit_date,1,8) AS v_date,
pur_cntry_rnk,
pur_country
FROM a3
) Z
WHERE Z.pur_cntry_rnk <= 3
) a3
ON a2.website = a3.website
AND a2.v_date = a3.v_date
),
a5 AS
(
SELECT company,
v_date,
MAX(TOP1_SELL_COUNTRY) AS TOP1_SELL_COUNTRY,
MAX(TOP2_SELL_COUNTRY) AS TOP2_SELL_COUNTRY,
MAX(TOP3_SELL_COUNTRY) AS TOP3_SELL_COUNTRY,
MAX(TOP1_PUR_COUNTRY) AS TOP1_PUR_COUNTRY,
MAX(TOP2_PUR_COUNTRY) AS TOP2_PUR_COUNTRY,
MAX(TOP3_PUR_COUNTRY) AS TOP3_PUR_COUNTRY
FROM a4
GROUP BY company,
v_date
)
SELECT company,
v_date,
CASE WHEN TOP1_SELL_COUNTRY IS NULL THEN 'NA' ELSE TOP1_SELL_COUNTRY END AS TOP1_SELL_COUNTRY,
CASE WHEN TOP2_SELL_COUNTRY IS NULL THEN 'NA' ELSE TOP2_SELL_COUNTRY END AS TOP2_SELL_COUNTRY,
CASE WHEN TOP3_SELL_COUNTRY IS NULL THEN 'NA' ELSE TOP3_SELL_COUNTRY END AS TOP3_SELL_COUNTRY,
CASE WHEN TOP1_PUR_COUNTRY IS NULL THEN 'NA' ELSE TOP1_PUR_COUNTRY END AS TOP1_PUR_COUNTRY,
CASE WHEN TOP2_PUR_COUNTRY IS NULL THEN 'NA' ELSE TOP2_PUR_COUNTRY END AS TOP2_PUR_COUNTRY,
CASE WHEN TOP3_PUR_COUNTRY IS NULL THEN 'NA' ELSE TOP3_PUR_COUNTRY END AS TOP3_PUR_COUNTRY
FROM a5
ORDER BY company,v_date;

Remove duplicates based on a condition

For the below given data set I want to remove the row which has later timestamp.
**37C1Z2990E5E0 (TRXID) should be UNIQUE** in the below dataSet
JKLAMMSDF123 20141112 20141117 5000.0 P 1.22 RT101018 *2014-11-12 10:10:26* 37C1Z2990E5E0 101018
JKLAMMSDF123 20141110 20141114 5000.0 P 1.22 RT161002 *2014-11-12 10:11:33* 37C1Z2990E5E0 161002
-- More rows
Try this:
;WITH DATA AS
(
SELECT TRXID, MAX(YourTimestampColumn) AS TS
FROM YourTable
GROUP BY TRXID
HAVING COUNT(*) > 1
)
DELETE T
FROM YourTable AS T
INNER JOIN DATA AS D
ON T.TRXID = D.TRXID
AND T.YourTimestampColumn = D.TS;
Select the min of the timestamp column and group by all of the other columns.
SELECT MIN(TIMESTAMP), C1, C2, C3...
FROM YOUR_TABLE
GROUP BY C1, C2, C3..
I will do this by using window function plus CTE.
To check the result after removing duplicates use this.
;WITH DATA
AS (SELECT *,
Row_number()OVER(partition BY TRXID ORDER BY YourTimestampColumn) rn
FROM YourTable)
select *
FROM data
WHERE rn = 1
To delete the duplicates use this.
;WITH DATA
AS (SELECT *,
Row_number()OVER(partition BY TRXID ORDER BY YourTimestampColumn) rn
FROM YourTable)
DELETE FROM data
WHERE rn > 1
This will work even if you more than one duplicate for same TRXID

PostgreSQL - column value changed - select query optimization

Say we have a table:
CREATE TABLE p
(
id serial NOT NULL,
val boolean NOT NULL,
PRIMARY KEY (id)
);
Populated with some rows:
insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID VAL
1 1
2 0
3 0
4 1
5 1
6 1
7 0
I want to determine when the value has been changed. So the result of my query should be:
ID VAL
2 0
4 1
7 0
I have a solution with joins and subqueries:
select min(id) id, val from
(
select p1.id, p1.val, max(p2.id) last_prev
from p p1
join p p2
on p2.id < p1.id and p2.val != p1.val
group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;
But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?
SQL Fiddle
This is how I would do it with an analytic:
SELECT id, val
FROM ( SELECT id, val
,LAG(val) OVER (ORDER BY id) AS prev_val
FROM p ) x
WHERE val <> COALESCE(prev_val, val)
ORDER BY id
Update (some explanation):
Analytic functions operate as a post-processing step. The query result is broken into groupings (partition by) and the analytic function is applied within the context of a grouping.
In this case, the query is a selection from p. The analytic function being applied is LAG. Since there is no partition by clause, there is only one grouping: the entire result set. This grouping is ordered by id. LAG returns the value of the previous row in the grouping using the specified order. The result is each row having an additional column (aliased prev_val) which is the val of the preceding row. That is the subquery.
Then we look for rows where the val does not match the val of the previous row (prev_val). The COALESCE handles the special case of the first row which does not have a previous value.
Analytic functions may seem a bit strange at first, but a search on analytic functions finds a lot of examples walking through how they work. For example: http://www.cs.utexas.edu/~cannata/dbms/Analytic%20Functions%20in%20Oracle%208i%20and%209i.htm Just remember that it is a post-processing step. You won't be able to perform filtering, etc on the value of an analytic function unless you subquery it.
Window function
Instead of calling COALESCE, you can provide a default from the window function lag() directly. A minor detail in this case since all columns are defined NOT NULL. But this may be essential to distinguish "no previous row" from "NULL in previous row".
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.
If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL values - even though your columns are defined NOT NULL. Basic lag() returns NULL in case there is no previous row:
SELECT id, val
FROM (
SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Or employ the additional parameters of lag() once again:
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Recursive CTE
As proof of concept. :)
Performance won't keep up with posted alternatives.
WITH RECURSIVE cte AS (
SELECT id, val
FROM p
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id < p.id
)
UNION ALL
SELECT p.id, p.val
FROM cte
JOIN p ON p.id > cte.id
AND p.val <> cte.val
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id > cte.id
AND p0.val <> cte.val
AND p0.id < p.id
)
)
SELECT * FROM cte;
With an improvement from #wildplasser.
SQL Fiddle demonstrating all.
Can even be done without window functions.
SELECT * FROM p p0
WHERE EXISTS (
SELECT * FROM p ex
WHERE ex.id < p0.id
AND ex.val <> p0.val
AND NOT EXISTS (
SELECT * FROM p nx
WHERE nx.id < p0.id
AND nx.id > ex.id
)
);
UPDATE: Self-joining a non-recursive CTE (could also be a subquery instead of a CTE)
WITH drag AS (
SELECT id
, rank() OVER (ORDER BY id) AS rnk
, val
FROM p
)
SELECT d1.*
FROM drag d1
JOIN drag d0 ON d0.rnk = d1.rnk -1
WHERE d1.val <> d0.val
;
This nonrecursive CTE approach is surprisingly fast, although it needs an implicit sort.
Using 2 row_number() computations: This is also possible to do with usual "islands and gaps" SQL technique (could be useful if you can't use lag() window function for some reason:
with cte1 as (
select
*,
row_number() over(order by id) as rn1,
row_number() over(partition by val order by id) as rn2
from p
)
select *, rn1 - rn2 as g
from cte1
order by id
So this query will give you all islands
ID VAL RN1 RN2 G
1 1 1 1 0
2 0 2 1 1
3 0 3 2 1
4 1 4 2 2
5 1 5 3 2
6 1 6 4 2
7 0 7 3 4
You see, how G field could be used to group this islands together:
with cte1 as (
select
*,
row_number() over(order by id) as rn1,
row_number() over(partition by val order by id) as rn2
from p
)
select
min(id) as id,
val
from cte1
group by val, rn1 - rn2
order by 1
So you'll get
ID VAL
1 1
2 0
4 1
7 0
The only thing now is you have to remove first record which can be done by getting min(...) over() window function:
with cte1 as (
...
), cte2 as (
select
min(id) as id,
val,
min(min(id)) over() as mid
from cte1
group by val, rn1 - rn2
)
select id, val
from cte2
where id <> mid
And results:
ID VAL
2 0
4 1
7 0
A simple inner join can do it. SQL Fiddle
select p2.id, p2.val
from
p p1
inner join
p p2 on p2.id = p1.id + 1
where p2.val != p1.val