Include only transition states in SQL query

Include only transition states in SQL query - sql

I have a table with customers and their purchase behaviour that looks as follows:
customer shop time
----------------------------
1 5 13.30
1 5 14.33
1 10 22.17
2 3 12.15
2 1 13.30
2 1 15.55
2 3 17.29
Since I want the shift in shop I need the following output
customer shop time
----------------------------
1 5 13.30
1 10 22.17
2 3 12.15
2 1 13.30
2 3 17.29
I have tried using
ROW_NUMBER() OVER (PARTITION BY customer, shop ORDER BY time ASC) AS a counter
and then only keeping all counter=1. However, this troubles me when the customer visits the same shop again later on, as with customer=2 and shop=3 in my example.
I came up with this:
WITH a AS
(
SELECT
customer, shop, time,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY time ASC) AS counter
FROM
db
)
SELECT a1.*
FROM a a1
JOIN a AS a2 ON (a1.device = a2.device AND a2.counter1 + 1 = a1.counter1 AND a2.id <> a1.id)
UNION
SELECT a.*
FROM a
WHERE counter1 = 1
However, this is very inefficient and running it in AWS where my data is located results in a error telling me that
Query exhausted resources at this scale factor
Is there any way to make this query more efficient?

This is a gaps-and-islands problem. But the simplest solution uses lag():
select customer, shop, time
from (select t.*, lag(shop) over (partition by customer order by time) as prev_shop
from t
) t
where prev_shop is null or prev_shop <> shop;

Related

Access sql Moving Average of Top N With 2 criterias

I have been searching the forum and found a single post that is a little smilair to my problem here: Calculate average for Top n combined with SQL Group By.
My situation is:
I have a table tblWEIGHT that contains: ID, Date, idPONR, Weight
I have a second table tblSALES that contains: ID, Date, Sales, idPONR
I have a third table tblPONR that contains: ID, PONR, idProduct
And a fouth table tblPRODUCT that contais: ID, Product
The linking:
tblWEIGHT.idPONR = tblPONR.ID
tblSALES.idPONR = tblPONR.ID
tblPONR.idProduct = tblPRODUCT.ID
The maintable of my query is tblSALES. I want to all my sales listed, with the moving average of the top5
weights of the PRODUCT where the date of the weight is less than the sales date, and the product is the same as the sold product. Its IMPORTANT that the result isn't grouped by the date. I need all the records of tblSALES.
i have gotten as far as to get the top 1 weight, but im not able to get the moving average instread.
The query that gest the top 1 is the following, and i am guessing that the query i need is going to look a lot like it.
SELECT tblSALES.ID, tblSALES.Dato, tblPONR.idPRODUCT,
(
SELECT top 1 Weight FROM tblWEIGHT INNER JOIN tblPONR ON tblWeight.idPONR = tblPONR.ID
WHERE tblPONR.idPRODUCT = idPRODUCT AND
SALES.Date > tblWEIGHT.Date
ORDER BY tblWEIGHT.Date desc
) AS LatestWeight
FROM tblSALES INNER JOIN VtblPONR ON tblSALES.idPONR = tblPONR.ID
this is not my exact query since im danish and i wouldnt make sense. I know im not supposed to use Date as a fieldname.
i imagine the filan query would be something like:
SELECT tblSALES.ID..... avg(SELECT TOP 5 weight .........)
but doing this i keep getting error at max 1 record can be returned by this subquery
Final Question.
How do i make a query that creates a moving average of the top 5 weights of my sold product, where the date of the weight is earlier than the date i sold the product?
EDIT Sampledata:
DATEFORMAT: dd/mm/yyyy
tblWEIGHT
ID Date idPONR Weight
1 01-01-2020 1 100
2 02-01-2020 2 200
3 03-01-2020 3 200
4 04-01-2020 3 400
5 05-01-2020 2 250
6 06-01-2020 1 150
7 07-01-2020 2 200
tblSALES
ID Date Sales(amt) idPONR
1 05-01-2020 30 1
2 06-01-2020 15 2
3 10-01-2020 20 3
tblPONR
ID PONR(production Number) idProduct
1 2521 1
2 1548 1
3 5484 2
tblPRODUCT
ID Product
1 Bricks
2 Tiles
Desired outcome read comments for AvgWeight
tblSALES.ID tblSALES.Date tblSales.Sales(amt) AvgWeigt
1 05-01-2020 30 123 -->avg(top 5 newest weight of both idPONR 1 And 2 because they are the same product, and where tblWeight.Date<05-01-2020)
2 06-01-2020 15 123 -->avg(top 5 newest weight of both idPONR 1 And 2 because they are the same product, and where tblWeight.Date<06-01-2020)
3 10-01-2020 20 123 -->avg(top 5 newest weight of idPONR 3 since thats the only idPONR with that product, and where tblWeight.Date<10-01-2020)

Consider:
Query1
SELECT tblWeight.ID AS WeightID, tblWeight.Date AS WtDate,
tblWeight.idPONR, tblPONR.PONR, tblPONR.idProduct, tblWeight.Weight, tblSales.SalesAmt,
tblSales.ID AS SalesID, tblSales.Date AS SalesDate
FROM (tblPONR INNER JOIN tblWeight ON tblPONR.ID = tblWeight.idPONR)
INNER JOIN tblSales ON tblPONR.ID = tblSales.idPONR;
Query2
SELECT * FROM Query1 WHERE WeightID IN (
SELECT TOP 5 WeightID FROM Query1 AS Dupe WHERE Dupe.idProduct = Query1.idProduct
AND Dupe.WtDate<Query1.SalesDate ORDER BY Dupe.WtDate);
Query3
SELECT Query2.SalesID, Query2.SalesDate, Query2.SalesAmt,
First(DAvg("Weight","Query2","idProduct=" & [idProduct] & " AND WtDate<#" & [SalesDate] & "#")) AS AvgWt
FROM Query2
GROUP BY Query2.SalesID, Query2.SalesDate, Query2.SalesAmt;

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
first
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
ie.
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
from
(
select r.individual_id, cal_month_nbr, count(*) as count_e
from
ww_customer_dl_secure.cust_scan
GROUP by
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions

You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

Vertica SQL for running count distinct and running conditional count

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.

Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.

Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms

How to remove duplicate accounts in SQL?

I am using SQL Server 2008 and I was wondering how to remove duplicate customers either from the table or exclude it in my query. An Account_ID can only have 1 product associated with it. And the account with the most recent purchase date is what should be showing. An example is below:
Account_ID, Account_Purchase, Purchase_Date
1 Product 1 1/1/2016
2 Product 1 1/2/2016
3 Product 2 1/5/2016
1 Product 3 3/12/2016
4 Product 3 1/5/2016
Ideally I would only see:
Account_ID, Account_Purchase, Purchase_Date
2 Product 1 1/2/2016
3 Product 2 1/5/2016
1 Product 3 3/12/2016
4 Product 3 1/5/2016
This should not show up because it is not the most recent purchase from account 1
Account_ID, Account_Purchase, Purchase_Date
1 Product 1 1/1/2016
Thank you all for help, folks!

Simply acquire the latest purchase_date using max and group by account_id. Then use inner join to get the other details from the acquired details.
SELECT TABLE_NAME.* FROM TABLE_NAME
INNER JOIN(
SELECT Account_ID, MAX(Purchase_Date) AS Purchase_Date
GROUP BY Account_ID
) LatestPurchases
ON TABLE_NAME.Account_ID = LatestPurchases.Account_ID
AND TABLE_NAME.Purchase_Date = LatestPurchases.Purchase_Date

Try below query, please replace TABLENAME with your table
WITH CTE
AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY Account_ID ORDER BY Purchase_Date DESC) AS RN
FROM TABLENAME
)
SELECT
*
FROM CTE
WHERE RN = 1

Here is another query
SELECT
t.Account_id,
t.Account_Purchase,
t.Purchase_Date
FROM
tablename t
WHERE
t.Purchase_Date = (SELECT MAX(Purchase_date) FROM Tablename WHERE Account_ID = t.Account_ID)
ORDER BY
t.Purchase_Date DESC

Firebird Query- Return first row each group

In a firebird database with a table "Sales", I need to select the first sale of all customers. See below a sample that show the table and desired result of query.
---------------------------------------
SALES
---------------------------------------
ID CUSTOMERID DTHRSALE
1 25 01/04/16 09:32
2 30 02/04/16 11:22
3 25 05/04/16 08:10
4 31 07/03/16 10:22
5 22 01/02/16 12:30
6 22 10/01/16 08:45
Result: only first sale, based on sale date.
ID CUSTOMERID DTHRSALE
1 25 01/04/16 09:32
2 30 02/04/16 11:22
4 31 07/03/16 10:22
6 22 10/01/16 08:45
I've already tested following code "Select first row in each GROUP BY group?", but it did not work.

In Firebird 2.5 you can do this with the following query; this is a minor modification of the second part of the accepted answer of the question you linked to tailored to your schema and requirements:
select x.id,
x.customerid,
x.dthrsale
from sales x
join (select customerid,
min(dthrsale) as first_sale
from sales
group by customerid) p on p.customerid = x.customerid
and p.first_sale = x.dthrsale
order by x.id
The order by is not necessary, I just added it to make it give the order as shown in your question.
With Firebird 3 you can use the window function ROW_NUMBER which is also described in the linked answer. The linked answer incorrectly said the first solution would work on Firebird 2.1 and higher. I have now edited it.

Search for the sales with no earlier sales:
SELECT S1.*
FROM SALES S1
LEFT JOIN SALES S2 ON S2.CUSTOMERID = S1.CUSTOMERID AND S2.DTHRSALE < S1.DTHRSALE
WHERE S2.ID IS NULL
Define an index over (customerid, dthrsale) to make it fast.

in Firebird 3 , get first row foreach customer by min sales_date :
SELECT id, customer_id, total, sales_date
FROM (
SELECT id, customer_id, total, sales_date
, row_number() OVER(PARTITION BY customer_id ORDER BY sales_date ASC ) AS rn
FROM SALES
) sub
WHERE rn = 1;
İf you want to get other related columns, This is where your self-answer fails.
select customer_id , min(sales_date)
, id, total --what about other colums
from SALES
group by customer_id

So simple as:
select CUSTOMERID min(DTHRSALE) from SALES group by CUSTOMERID

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Include only transition states in SQL query - sql

This is a gaps-and-islands problem. But the simplest solution uses lag(): select customer, shop, time from (select t.*, lag(shop) over (partition by customer order by time) as prev_shop from t ) t where prev_shop is null or prev_shop <> shop;

Related

Access sql Moving Average of Top N With 2 criterias

Count the number of transactions per month for an individual group by date Hive

Vertica SQL for running count distinct and running conditional count

How to remove duplicate accounts in SQL?

Firebird Query- Return first row each group

Categories

Resources