SQL - How to find if the combination of column has occured before or not? - sql

Following example demonstrates the question
id
location
dt
1
India
2020-01-01
2
Usa
2020-02-01
1
Usa
2020-03-01
3
China
2020-04-01
1
India
2020-05-01
2
France
2020-06-01
1
India
2020-07-01
2
Usa
2020-08-01
This table is sorted by date.
I want to create another column, which would tell if the id has been to the location before or not.
So, The output would be like
id
location
dt
travelled
1
India
2020-01-01
0
2
Usa
2020-02-01
0
1
Usa
2020-03-01
0
3
China
2020-04-01
0
1
India
2020-05-01
1
2
France.
2020-06-01
0
1
India
2020-07-01
1
2
Usa
2020-08-01
1
The issue I am facing is, For every row, I need to consider only the rows above it.

Use EXISTS in a CASE expression:
SELECT t1.id, t1.location,
CASE
WHEN EXISTS (
SELECT 1
FROM tablename t2
WHERE t2.id = t1.id AND t2.location = t1.location AND t2.date < t1.date
) THEN 1
ELSE 0
END travelled
FROM tablename t1

I would strongly recommend window functions for this:
select t.*,
(case when row_number() over (partition by id, location order by date) > 1
then 1 else 0
end) as travelled
from t;
Window functions are usually faster than alternative methods.

Related

Specific grouping elements in SQL Server

I've got a problem with my SQL task and didn't find any answers yet.
I've got table with this sample data:
ID
Value
Date
1
1
2020-01-01
1
2
2020-03-02
1
1
2020-03-21
1
1
2020-04-14
1
3
2020-05-01
1
1
2020-08-09
1
1
2020-09-12
1
1
2020-10-12
1
3
2020-12-04
All I want to get is:
ID
Value
Date
1
1
2020-01-01
1
2
2020-03-02
1
1
2020-03-21
1
3
2020-05-01
1
1
2020-08-09
1
3
2020-12-04
Some kind of changing value history, but only if the value was changed - when value on new record is the same, get value with min date.
I tried with grouping and row_number, but got no positive results. Any ideas how to do that?
One way to articulate your logic is to say that you want to retain a record when the previous record, as ordered by the date (within a given ID), has a different value than the current record.
WITH cte AS (
SELECT *, LAG(Value) OVER (PARTITION BY ID ORDER BY Date) LagValue
FROM yourTable
)
SELECT ID, Value, Date
FROM cte
WHERE LagValue <> Value OR LagValue IS NULL
ORDER BY Date;
Demo

Aggregate data based on fixed moving date window in Presto

I wanted to:
aggregate numbers in a “3-months" rolling window, (eg Jan-Mar, Feb-Apr, Mar-May....)
then compare the same country & city with last year's same rolling window
Table I already have: (unique at: country + city + month level)
country city month sum
US A 2019-03-01 3
US B 2019-03-01 4
DE C 2019-03-01 5
US A 2019-03-01 3
CN B 2019-03-01 4
US B 2019-04-01 4
UK C 2019-04-01 7
US C 2019-04-01 2
....
US A 2019-12-01 10
US B 2020-12-01 6
US C 2021-01-01 7
Step 1 ideal output:
country city period sum
US A 2019-03-01~2019-05-01 XXX
US A 2019-04-01~2019-06-01 YYY
UK A 2019-03-01~2019-05-01 ZZZ
...
UK A 2020-12-01~2021-02-01 BBB
Step 2 ideal output:
country city period sum last_year_sum year_over_year_%
US A 2019-03-01~2019-05-01 XXX 111 40%
US A 2019-04-01~2019-06-01 YYY 1111 30%
UK A 2019-03-01~2019-05-01 ZZZ 11111 20%
...
UK A 2020-12-01~2021-02-01 BBB 1111 15%
Ideally, I wanted to achieve this in Presto - any idea how to do that? Thanks!!
Unfortunately, Presto doesn't support the range window frame specification using dates. One method uses joins and aggregation and then lag() to get the last year amount:
select t.country, t.city, t.sum,
sum(t2.sum) as this_year_sum,
lag(sum(t2.sum), 12) over (partition by country, city order by month) as prev_year_sum,
(-1 +
sum(t2.sum) /
lag(sum(t2.sum), 12) over (partition by country, city order by month)
) as yoy_increase
from t left join
t t2
on t2.country = t.country and
t2.city = t.city and
t2.month >= t.month and
t2.month <= t.month + interval '2' month
group by t.country, t.city, t.sum;
Note: This assumes that you have data for all months for each country/city combination.

Sum only for Employee ID's present in latest snapshot

I have a database with a row per month for each employee working in our company. So, if employee A has been working for our company from July 2016 till now, this person has approx. 24 rows (one row for each month she was in service).
I'm trying to summarize the experience each of the current employees have in a particular function. So, if employee A has worked 6 months in Sales and 18 months in Marketing, then I count the number of rows this employee has Sales or Marketing in the column indicating the function.
I have created a code which does seems to count the functional experience per employee, but it double counts data. It does not take the latest snapshot as starting point.
SELECT A.EMPLOYEE_ID,
SUM(CASE WHEN A.FUNCTION_CODE ='CUS' THEN 1 ELSE 0 END) AS EXP_CUS,
SUM(CASE WHEN A.FUNCTION_CODE ='MKT' THEN 1 ELSE 0 END) AS EXP_MKT
FROM [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] AS A INNER JOIN [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] AS B ON A.EMPLOYEE_ID = B.EMPLOYEE_ID
WHERE B.WORKLEVEL_CODE > '1'
GROUP BY A.EMPLOYEE_ID
I expected the output for employee A to be EXP_CUS = 6 and EXP_MKT = 18. Instead, the output for both is much higher as it is double counting rows. When I add the line AND B.SNAPSHOT_DATE = '2019-06-30', the output is correct. I don't like to manually adjust the code every month and rather refer to the latest snapshot date.
ADDED
The original table looks like this
SNAPSHOT_DATE | EMPLOYEE_ID | FUNCTION_CODE
2019-06-30 | 000000001 | CUS
2019-06-30 | 000000002 | MKT
2019-05-31 | 000000001 | CUS
2019-05-31 | 000000002 | MKT
2019-04-30 | 000000001 | MKT
2019-04-30 | 000000002 | MKT
The desired output would be
EMPLOYEE_ID | EXP_CUS | EXP_MKT
000000001 | 2 | 1
000000002 | 0 | 3
You can use PIVOT to get your desired result as below-
SELECT EMPLOYEE_ID,
ISNULL([CUS],0) AS [EXP_CUS],
ISNULL([MKT],0) AS [EXP_MKT]
FROM
(
SELECT EMPLOYEE_ID,FUNCTION_CODE,COUNT(SNAPSHOT_DATE) T
FROM your_table
GROUP BY EMPLOYEE_ID,FUNCTION_CODE
)P
PIVOT(
SUM(T)
FOR FUNCTION_CODE IN ([CUS],[MKT])
)PVT
Output is-
EMPLOYEE_ID EXP_CUS EXP_MKT
000000001 2 1
000000002 0 3
I don't understand why you are using a self join. This seems to do what you want:
SELECT ED.EMPLOYEE_ID,
SUM(CASE WHEN ED.FUNCTION_CODE ='CUS' THEN 1 ELSE 0 END) AS EXP_CUS,
SUM(CASE WHEN ED.FUNCTION_CODE ='MKT' THEN 1 ELSE 0 END) AS EXP_MKT
FROM [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] ed
WHERE ED.WORKLEVEL_CODE > '1'
GROUP BY ED.EMPLOYEE_ID;
If you only want employees with the most recent snapshot date, then you can use window functions:
SELECT ED.EMPLOYEE_ID,
SUM(CASE WHEN ED.FUNCTION_CODE ='CUS' THEN 1 ELSE 0 END) AS EXP_CUS,
SUM(CASE WHEN ED.FUNCTION_CODE ='MKT' THEN 1 ELSE 0 END) AS EXP_MKT
(SELECT ED.*,
MAX(SNAPSHOT_DATE) OVER () as OVERALL_MAX_SNAPSHOT_DATE,
MAX(SNAPSHOT_DATE) OVER (PARTITION BY EMPLOYEE_ID) as EMPLOYEE_MAX_SNAPSHOT_DATE
FROM [dbname].[AGL_V_HRA_FE_R].[VW_HRA_EMPLOYEE_DETAIL] ED
) ED
WHERE ED.WORKLEVEL_CODE > '1' AND
EMPLOYEE_MAX_SNAPSHOT_DATE = OVERALL_MAX_SNAPSHOT_DATE
GROUP BY ED.EMPLOYEE_ID;

simple sql over (partition by) not working as expected

Feels like it should be simple but my mind has gone blank so would appreciate any help!
Let's say I have this dataset
Date sale_id salesperson Missed_payment_this_month
01/01/2016 1001 John 1
01/01/2016 1002 Bob 0
01/01/2016 1003 Bob 0
01/01/2016 1004 John N/A
01/02/2016 1001 John 1
01/02/2016 1002 Bob 1
01/02/2016 1003 Bob 0
01/02/2016 1004 John 1
01/03/2016 1001 John 1
01/03/2016 1002 Bob 0
01/03/2016 1003 Bob 0
01/03/2016 1004 John 1
And want to add these two columns to the end. They look at the number of missed payments previously, by sales_id and salesperson.
Previous_missed_payment_by_sale_id Previous_missed_payment_by_sales person
0 0
0 0
0 0
0 0
1 1
0 0
0 0
0 1
2 3
1 1
0 1
1 3
sales_id is ok but getting it over sales persons is giving me an error (group by) or adding in extra columns. I need to keep the rows constant.
My best guess that returns extra columns:
select t1.Date, t1.sale_id, t1.salesperson
,sum(case when t2.Missed_payment_this_month = '1' then 1 else 0 end) previous_missed_sales_id
,sum(case when t2.Missed_payment_this_month = '1' then 1 else 0 end) OVER (PARTITION by t1.salesperson) previous_missed_salesperson
from [dbo].[simple_join_table2] t1
inner join [dbo].[simple_join_table2] t2 on
(t2.[Date] < t1.[Date] AND t1.[sale_id] = t2.[sale_id])
group by t1.Date, t1.sale_id, t1.salesperson
,case when t2.Missed_payment_this_month = '1' then 1 else 0 end
this is the output:
Date sale_id salesperson previous_missed_sales_id previous_missed_salesperson
01/02/2016 1002 Bob 0 1
01/02/2016 1003 Bob 0 1
01/03/2016 1002 Bob 0 1
01/03/2016 1002 Bob 1 1
01/03/2016 1003 Bob 0 1
01/02/2016 1001 John 1 3
01/02/2016 1004 John 0 3
01/03/2016 1001 John 2 3
01/03/2016 1004 John 0 3
01/03/2016 1004 John 1 3
Is this possible without another sub query? I guess another way to put it is i'm trying to mimic the sumx and earlier functions of Powerpivot.
If you are on 2012+ use windowing aggregates. Previous = sum all_previous_including_curret - sum current. Ms sql default window is exactly ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
with [simple_join_table2] as(
-- sample data
select cast(valuesDate as Date) valuesDate, sale_id, salesperson, Missed_payment_this_month
from (
values
('20160101',1001,'John', 1)
,('20160101',1002,'Bob ', 0)
,('20160101',1003,'Bob ', 0)
,('20160101',1004,'John',null)
,('20160201',1001,'John', 1)
,('20160201',1002,'Bob ', 1)
,('20160201',1003,'Bob ', 0)
,('20160201',1004,'John', 1)
,('20160301',1001,'John', 1)
,('20160301',1002,'Bob ', 0)
,('20160301',1003,'Bob ', 0)
,('20160301',1004,'John', 1)
) t(valuesDate, sale_id, salesperson, Missed_payment_this_month)
)
select valuesDate,sale_id, salesperson, Missed_payment_this_month,
byidprevmonth = sum(Missed_payment_this_month ) over(partition by sale_id order by valuesDate)
- sum(Missed_payment_this_month) over(partition by valuesDate, sale_id),
bypersonprevmonth = sum(Missed_payment_this_month) over(partition by salesperson order by valuesDate)
- sum(Missed_payment_this_month) over(partition by valuesDate, salesperson)
from [simple_join_table2]
order by salesperson, valuesDate

PostgreSQL GROUP BY: SELECT column on MAX of another WHERE a third column = x

Let's suppose we have two tables in PostgreSQL:
Table "citizens"
country_ref citizen_name entry_date
-----------------------------------------------------
0 peter 2013-01-14 21:00:00.000
1 fernando 2013-01-14 20:00:00.000
0 robert 2013-01-14 19:00:00.000
3 albert 2013-01-14 18:00:00.000
2 esther 2013-01-14 17:00:00.000
1 juan 2013-01-14 16:00:00.000
3 egbert 2013-01-14 15:00:00.000
1 francisco 2013-01-14 14:00:00.000
3 adolph 2013-01-14 13:00:00.000
2 emilie 2013-01-14 12:00:00.000
2 jacques 2013-01-14 11:00:00.000
0 david 2013-01-14 10:00:00.000
Table "countries"
country_id country_name country_group
-------------------------------------------
0 england 0
1 spain 0
2 france 1
3 germany 1
Now I want to obtain the last entered citizen on the "citizens" table for each country of a given country_group.
My best try so far is this query (Let's call it Query_1) :
SELECT country_ref, MAX(entry_date) FROM citizens
LEFT JOIN countries ON country_id = country_ref
WHERE country_group = 1 GROUP BY country_ref
Output:
country_ref max
---------------------------------
3 2013-01-14 18:00:00
2 2013-01-14 17:00:00
So then I could do:
SELECT citizen_name FROM citizens WHERE (country_ref, entry_date) IN (Query_1)
... which will give me the output I'm looking for: albert and esther.
But I'd prefer to achieve this in a single query. I wonder if it's possible?
This should be simplest and fastest:
SELECT DISTINCT ON (i.country_ref)
i.citizen_name
FROM citizens i
JOIN countries o ON o.country_id = i.country_ref
WHERE o.country_group = 1
ORDER BY i.country_ref, i.entry_date DESC
You can easily return more columns from both tables by simply adding them to the SELECT list.
SQL Fiddle.
Details, links and explanation in this related answer:
Select first row in each GROUP BY group?
SELECT citizen_name,
country_ref,
entry_date
from (
SELECT cit.citizen_name,
cit.country_ref,
MAX(cit.entry_date) over (partition by cit.country_ref) as max_date,
cit.entry_date
FROM citizens cit
LEFT JOIN countries cou ON cou.country_id = cit.country_ref
WHERE cou.country_group = 1
) t
where max_date = entry_date
SQLFiddle demo: http://www.sqlfiddle.com/#!12/50776/1
Why don't you simply:
SELECT citizen_name FROM citizens WHERE (country_ref, entry_date) IN (
SELECT country_ref, MAX(entry_date) FROM citizens
LEFT JOIN countries ON country_id = country_ref
WHERE country_group = 1 GROUP BY country_ref
)
It might not be the best plan, but it depends on many factors, and it is simple to write.