SQL/HQL - Partition By - sql

Trying to understand Partition By and getting super confused, I have the following data:
Name ID Product Date Amount
Jason 1 Car Jan 2017 $10
Jason 1 Car Feb 2017 $5
Jason 2 Car Jan 2017 $50
Jason 2 Car Feb 2017 $60
Jason 3 House Jan 2017 $20
Jason 3 House Feb 2017 $30
Would doing:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By Name Order by Date**
FROM table
give me Jason's correct previous month amount for the appropriate Product and ID number? So, for example at Feb 2017: Jason, ID 1 and Product Car's should give me the amount $5.
OR
Would I need to modify the Partition by to include the Product and ID, such as:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By *Name, ID, Product* Order by Date** FROM table'
Thanks!

I myself also came here in search of some understanding of the "partition by" clause. But to answer your question, the new column created would give you the previous row's value. So you don't have to add the other columns (i.e. Product and ID) in to your Partition by clause.
Essentially, you would have your existing 5 columns, plus one (in which contains the row contains the value of the previous row in "amount").

Related

Creating an indicator based on the same ID in different years in PROC SQL/SQL

I'd like to create an indicator based on the ID and product type. My data:
Year ID Purchase_Category
2020 1 Kitchen
2020 2 Home
2020 2 Kitchen
2020 3 Home
2021 1 Home
2021 2 Kitchen
2021 3 Kitchen
If someone with the same ID purchased Kitchen in 2020 and then Home in 2021 or vice versa, then they are deemed holistic. ID 2 in this case is not holistic because Home and Kitchen were purchased in the same year.
The output should look like this:
ID Indicator
1 Holistic
2 Not Holistic
3 Holistic
Something like this might work:
SELECT ID, CASE COUNT(*) WHEN 1 THEN 'Not Holistic' ELSE 'Holistic' END AS INDICATOR
FROM (SELECT ID, YEAR, COUNT(*) FROM DATA GROUP BY ID, YEAR)
GROUP BY ID
First, determine the distinct years per ID, then from that set, if an ID appears only once then everything was purchased in same year, otherwise there were products purchased in different years.
You just need a distinct count on the Year column per ID. No need for two steps.
select ID,
case when count(distinct "Year") > 1
then 'Holistic' else 'Not Holistic' end as Indicator
from T
group by ID
It would be just as easy to say:
case when max("Year") > min("Year") then ...
I don't know which one seems more natural. If you have a lot of data the second approach is potentially faster.

SQL - Monthly cumulative count of new customer based on a created date field

Thanks in advance.
I have Customer records that look like this:
Customer_Number
Create_Date
34343
01/22/2001
54554
03/03/2020
85296
01/01/2001
...
I have about a thousand of these records (customer number is unique) and the bossman wants to see how the number of customers has grown over time.
The output I need:
Customer_Count
Monthly_Bucket
7
01/01/2021
9
02/01/2021
13
03/01/2021
20
04/01/2021
The customer count is cumulative and the Monthly Bucket will just feed the graphing package to make a nice bar chart answering the question "how many customers to we have in total in a particular month and how is it growing over time".
Try the following SELECT SQL with a sub-query:
SELECT Customer_Count=
(
SELECT COUNT(s.[Create_Date])
FROM [Customer_Sales] s
WHERE MONTH(s.[Create_Date]) <= MONTH(t.[Create_Date])
), Monthly_Bucket=MONTH([Create_Date])
FROM Customer_Sales t
WHERE YEAR(t.[Create_Date]) = ????
GROUP BY MONTH(t.[Create_Date])
Where [Customer_Sales] is the sales table and ??? = your year

Finding, grouping and deleting the duplicate records in MS SQL and keeping the oldest

I have a table that holds the employee's bank account records. There are duplicate records of the last record created by a user, which were created due to daily job execution.
For example: for Employee E1 below are the bank account records in DB
Jan 1 2015 - Bank X
Jan 1 2018 - Bank Y
Jan 1 2020 - Bank X
Here multiple duplicate records for Bank X and Y after each manual change on 1st Jan of 2015,18 and 20.
Now, I have to delete the duplicate records which are duplicate in terms of values of columns "BankName" and "BankAccountNumber".
Here in this scenario, the records should remain in system are all those which are updated on Jan 1 of years 2015, 2018 and 2020 even though the name and account number is the same for Bank X.
The columns we are considering in the table for preparing script are:
1. Recordid(Uniqueidentifier) primary key
2. Recordsequence(INT) Identity column by 1
3. EmployeeID(INT) <set of records are linked to an employee through of employee table>.
My current Logic is to find the duplicate and delete the records:
;WITH BARecords
AS (
SELECT recordid
,ROW_NUMBER() OVER (
PARTITION BY employeeID
,BankName
,AccountNumber ORDER BY recordsequence
) row_num
FROM employeebankaccount WITH (NOLOCK)
WHERE employeeid IN (
SELECT Id
FROM #EMPLOYEEIDs
)
)
DELETE
FROM BARecords
WHERE row_num > 1
My current logic will remove the Bank X details from Jan 1 2020 as well and will keep only Jan 1 2015.
As the user have updated Bank X again that should also remain in the system of creation date Jan1 2020.

Finding a lagged cumulative total from a table containing both cumulative and delta entries

I have a SQL table with a schema where a value is either a cumulative value for a particular category, or a delta on top of the previous value. While I appreciate this is not a particularly great design, it comes from an external source and thus I can't change it in any way.
The table looks something like the following:
Date Category AmountSoldType AmountSold
-----------------------------------------------------
Jan 1 Apples Cumulative 100
Jan 1 Bananas Cumulative 50
Jan 2 Apples Delta 20
Jan 2 Bananas Delta 10
Jan 3 Apples Delta 25
Jan 3 Bananas Cumulative 75
For this example, I want to produce the total cumulative number of fruits sold by item at the beginning of each day:
Date Category AmountSold
--------------------------------
Jan 1 Apples 0
Jan 1 Bananas 0
Jan 2 Apples 100
Jan 2 Bananas 50
Jan 3 Apples 170
Jan 3 Bananas 60
Jan 4 Apples 195
Jan 4 Bananas 75
Intuitively, I want to take the most recent cumulative total, and add any deltas that have appeared since that entry.
I imagine something akin to
SELECT Date, Category
LEAD((subquery??), 1) OVER (PARTITION BY Category ORDER BY Date) AS Amt
FROM Fruits
GROUP BY Date, Category
ORDER BY Date ASC
is what I want, but I'm having trouble putting the right subquery together. Any suggestions?
You seem to want to add the deltas to the most recent cumulative -- all before the current date.
If so, I think this logic does what you want:
select f.*,
(max(case when date = date_cumulative then amountsold else 0 end) over (partition by category
) +
sum(case when date > date_cumulative then amountsold else 0 end) over (partition by category order by date rows between unbounded preceding and 1 preceding
)
) amt
from (select f.*,
max(case when AmountSoldType = 'cumulative' then date else 0 end) over
(partition by category order by date rows between unbounded preceding and current_row
) as date_cumulative
from fruits f
) f
I'm a bit confused by this data set (notwithstanding the mistake in adding up the apples). I assume the raw data states end-of-day figures, so for example 20 apples were sold on Jan 2 (because there is a delta of 20 reported for that day).
In your example results, it does not appear valid to say that zero apples were sold on Jan 1. It isn't actually possible to say how many were sold on that day, because it is not clear whether the 100 cumulative apples were accrued during Jan 1 (and thus should be excluded from the start-of-day figure you seek) or whether they were accrued on previous days (and should be included), or some mix of the two. That day's data should thus be null.
It is also not clear whether all data sets must begin with a cumulative, or whether data sets can begin with a delta (which might require working backwards from a subsequent cumulative), and whether you potentially have access to multiple data sets from your external source which form a continuous consistent sequence, or whether "cumulatives" relate purely to a single data set received. I'm going to assume at least that all data sets begin with a cumulative.
All that said, this problem is a simple case of firstly converting all rows into either all deltas, or all cumulatives. Assuming we go for all cumulatives, then recursing through each row in order, it is a case of either selecting the AmountSold as-is (if the row is a cumulative), or adding the AmountSold to the result of the previous step (if it is a delta).
Once pre-processed like this, then for a start-of-day cumulative, it is all just a question of looking at the previous day's cumulative (which was an end-of-day cumulative, if my initial assumption was correct that all raw data relates to end-of-day figures).
Using the LAG function in this final step to get the previous day's cumulative, will also neatly produce a null for the first row.

Calculating Cumulative Sum in PostgreSQL

I want to find the cumulative or running amount of field and insert it from staging to table. My staging structure is something like this:
ea_month id amount ea_year circle_id
April 92570 1000 2014 1
April 92571 3000 2014 2
April 92572 2000 2014 3
March 92573 3000 2014 1
March 92574 2500 2014 2
March 92575 3750 2014 3
February 92576 2000 2014 1
February 92577 2500 2014 2
February 92578 1450 2014 3
I want my target table to look something like this:
ea_month id amount ea_year circle_id cum_amt
February 92576 1000 2014 1 1000
March 92573 3000 2014 1 4000
April 92570 2000 2014 1 6000
February 92577 3000 2014 2 3000
March 92574 2500 2014 2 5500
April 92571 3750 2014 2 9250
February 92578 2000 2014 3 2000
March 92575 2500 2014 3 4500
April 92572 1450 2014 3 5950
I am really very much confused with how to go about achieving this result. I want to achieve this result using PostgreSQL.
Can anyone suggest how to go about achieving this result-set?
Basically, you need a window function. That's a standard feature nowadays. In addition to genuine window functions, you can use any aggregate function as window function in Postgres by appending an OVER clause.
The special difficulty here is to get partitions and sort order right:
SELECT ea_month, id, amount, ea_year, circle_id
, sum(amount) OVER (PARTITION BY circle_id
ORDER BY ea_year, ea_month) AS cum_amt
FROM tbl
ORDER BY circle_id, ea_year, ea_month;
And no GROUP BY.
The sum for each row is calculated from the first row in the partition to the current row - or quoting the manual to be precise:
The default framing option is RANGE UNBOUNDED PRECEDING, which is
the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With
ORDER BY, this sets the frame to be all rows from the partition
start up through the current row's last ORDER BY peer.
Bold emphasis mine.
This is the cumulative (or "running") sum you are after.
In default RANGE mode, rows with the same rank in the sort order are "peers" - same (circle_id, ea_year, ea_month) in this query. All of those show the same running sum with all peers added to the sum. But I assume your table is UNIQUE on (circle_id, ea_year, ea_month), then the sort order is deterministic and no row has peers. (And you might as well use the cheaper ROWS mode.)
Postgres 11 added tools to include / exclude peers with the new frame_exclusion options. See:
Aggregating all values not in the same group
Now, ORDER BY ... ea_month won't work with strings for month names. Postgres would sort alphabetically according to the locale setting.
If you have actual date values stored in your table you can sort properly. If not, I suggest to replace ea_year and ea_month with a single column the_date of type date in your table.
Transform what you have with to_date():
to_date(ea_year || ea_month , 'YYYYMonth') AS the_date
For display, you can get original strings with to_char():
to_char(the_date, 'Month') AS ea_month
to_char(the_date, 'YYYY') AS ea_year
While stuck with the unfortunate design, this will work:
SELECT ea_month, id, amount, ea_year, circle_id
, sum(amount) OVER (PARTITION BY circle_id ORDER BY the_date) AS cum_amt
FROM (SELECT *, to_date(ea_year || ea_month, 'YYYYMonth') AS the_date FROM tbl) sub
ORDER BY circle_id, mon;