Postgresql sum not working as expected when it is clear - sql

I am solving the following Hard Leetcode SQL Question.
Link to Question: https://leetcode.com/problems/trips-and-users/
(You can directly look at the solution and understand the problem)
Question:
Trips table:
+----+-----------+-----------+---------+---------------------+------------+
| id | client_id | driver_id | city_id | status | request_at |
+----+-----------+-----------+---------+---------------------+------------+
| 1 | 1 | 10 | 1 | completed | 2013-10-01 |
| 2 | 2 | 11 | 1 | cancelled_by_driver | 2013-10-01 |
| 3 | 3 | 12 | 6 | completed | 2013-10-01 |
| 4 | 4 | 13 | 6 | cancelled_by_client | 2013-10-01 |
| 5 | 1 | 10 | 1 | completed | 2013-10-02 |
| 6 | 2 | 11 | 6 | completed | 2013-10-02 |
| 7 | 3 | 12 | 6 | completed | 2013-10-02 |
| 8 | 2 | 12 | 12 | completed | 2013-10-03 |
| 9 | 3 | 10 | 12 | completed | 2013-10-03 |
| 10 | 4 | 13 | 12 | cancelled_by_driver | 2013-10-03 |
+----+-----------+-----------+---------+---------------------+------------+
Users table:
+----------+--------+--------+
| users_id | banned | role |
+----------+--------+--------+
| 1 | No | client |
| 2 | Yes | client |
| 3 | No | client |
| 4 | No | client |
| 10 | No | driver |
| 11 | No | driver |
| 12 | No | driver |
| 13 | No | driver |
+----------+--------+--------+
Output:
+------------+-------------------+
| Day | Cancellation Rate |
+------------+-------------------+
| 2013-10-01 | 0.33 |
| 2013-10-02 | 0.00 |
| 2013-10-03 | 0.50 |
+------------+-------------------+
Here's my code:
SELECT request_at,
count(*) c,
sum(case when status='cancelled_by_driver' or status='cancelled_by_client' then 1 else 0 end) s,
round(sum(case when status='cancelled_by_client' or status='cancelled_by_client' then 1 else 0 end)/count(*),2) as Cancellation_rate
FROM trips
WHERE
client_id not in (select users_id from users where banned = 'Yes')
AND
driver_id not in (select users_id from users where banned = 'Yes')
GROUP BY request_at;
And the output is:
request_at | c | s | cancellation_rate
------------+---+---+-------------------
2013-10-01 | 3 | 1 | 0.00
2013-10-03 | 2 | 1 | 0.00
2013-10-02 | 2 | 0 | 0.00
How is the cancellation_rate is 0.00 when it is clear by looking at previous columns(s/c) that it should be 0.33,0.50, 0.00.

The good news is you're only off by a typo.
In your example you are using cancelled_by_client or cancelled_by_client
round(sum(case when status='cancelled_by_client' or status='cancelled_by_client' then 1 else 0 end)/count(*),2) as Cancellation_rate
rather than:
.. when status='cancelled_by_driver' or status='cancelled_by_client' then ..
which would return:
request_at
c
s
cancellation_rate
2013-10-01
3
1
0.33
2013-10-02
2
0
0.00
2013-10-03
2
1
0.50

Related

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

SQL - Window Functions with dense_rank()

I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!
This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;

MDX last non empty over multiple dimensions

I would geatly appreciate if someone could help me with the
problem. I have the following fact table:
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| EntryNo | ItemNo | CompanyId | BranchId | LocationId | ValuationDate | ValuatedQty | ValuatedAmount |
+=========+========+===========+==========+============+===============+=============+================+
| 1 | Item1 | 1 | 1 | 1 | 2016-03-01 | 0 | 0 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 2 | Item1 | 1 | 2 | 1 | 2016-03-01 | 4 | 400 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 3 | Item1 | 1 | 1 | 1 | 2016-03-02 | 10 | 1000 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 4 | Item2 | 1 | 1 | 2 | 2016-03-02 | 4 | 200 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 5 | Item2 | 2 | 2 | 2 | 2016-03-02 | 6 | 300 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 6 | Item1 | 2 | 2 | 1 | 2016-03-03 | 0 | 0 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 7 | Item3 | 1 | 2 | 3 | 2016-03-03 | 0 | 0 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 8 | Item1 | 2 | 2 | 3 | 2016-03-03 | 9 | 450 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
There are two measures that represent "overstocked" items on a particular day.
Is it possible to create a calculated member that will allow for slicing data
on the all linked dimensions (Items, Companies, etc.) ? I guess the LastNonEmpty agregration
would be useful here except it is not available in the standard edition.
Given the example the results should be as follows:
By Company:
+---------+-------------+----------------+
| Company | ValuatedQty | ValuatedAmount |
+=========+=============+================+
| 1 | 14 | 1200 |
+---------+-------------+----------------+
| 2 | 15 | 750 |
+---------+-------------+----------------+
By Date:
+------------+-------------+----------------+
| Date | ValuatedQty | ValuatedAmount |
+============+=============+================+
| 2016-03-01 | 4 | 400 |
+------------+-------------+----------------+
| 2016-03-02 | 16 | 1300 |
+------------+-------------+----------------+
| 2016-03-03 | 9 | 450 |
+------------+-------------+----------------+
By Item:
+-------+-------------+----------------+
| Item | ValuatedQty | ValuatedAmount |
+=======+=============+================+
| Item1 | 9 | 450 |
+-------+-------------+----------------+
| Item2 | 6 | 300 |
+-------+-------------+----------------+
| Item3 | 0 | 0 |
+-------+-------------+----------------+
Two functions that come to mind for your requirements are:
Tail: https://msdn.microsoft.com/en-us/library/ms146056.aspx
Bottomcount: https://msdn.microsoft.com/en-us/library/ms144864.aspx
So with Tail something like the following is possible:
WITH SET [LastYearPerSubCat] AS
GENERATE(
[Product].[Product Categories].[SubCategory].members AS S,
S.CURRENTMEMBER
*
TAIL(
NONEMPTY(
[Date].[Calendar Year].[Calendar Year].MEMBERS,
S.CURRENTMEMBER
)
)
)
SELECT
[Measures].[Reseller Gross Profit] ON 0
,[LastYearPerSubCat] ON 1
FROM [Adventure Works];

Count rows each month of a year - SQL Server

I have a table "Product" as :
| ProductId | ProductCatId | Price | Date | Deadline |
--------------------------------------------------------------------
| 1 | 1 | 10.00 | 2016-01-01 | 2016-01-27 |
| 2 | 2 | 10.00 | 2016-02-01 | 2016-02-27 |
| 3 | 3 | 10.00 | 2016-03-01 | 2016-03-27 |
| 4 | 1 | 10.00 | 2016-04-01 | 2016-04-27 |
| 5 | 3 | 10.00 | 2016-05-01 | 2016-05-27 |
| 6 | 3 | 10.00 | 2016-06-01 | 2016-06-27 |
| 7 | 1 | 20.00 | 2016-01-01 | 2016-01-27 |
| 8 | 2 | 30.00 | 2016-02-01 | 2016-02-27 |
| 9 | 1 | 40.00 | 2016-03-01 | 2016-03-27 |
| 10 | 4 | 15.00 | 2016-04-01 | 2016-04-27 |
| 11 | 1 | 25.00 | 2016-05-01 | 2016-05-27 |
| 12 | 5 | 55.00 | 2016-06-01 | 2016-06-27 |
| 13 | 5 | 55.00 | 2016-06-01 | 2016-01-27 |
| 14 | 5 | 55.00 | 2016-06-01 | 2016-02-27 |
| 15 | 5 | 55.00 | 2016-06-01 | 2016-03-27 |
I want to create SP count rows of Product each month with condition Year = CurrentYear , like :
| Month| SumProducts | SumExpiredProducts |
-------------------------------------------
| 1 | 3 | 3 |
| 2 | 3 | 3 |
| 3 | 3 | 3 |
| 4 | 2 | 2 |
| 5 | 2 | 2 |
| 6 | 2 | 2 |
What should i do ?
You can use a query like the following:
SELECT MONTH([Date]),
COUNT(*) AS SumProducts ,
COUNT(CASE WHEN [Date] > Deadline THEN 1 END) AS SumExpiredProducts
FROM mytable
WHERE YEAR([Date]) = YEAR(GETDATE())
GROUP BY MONTH([Date])

Join Distinct or First

I have a table structure for SalesItems, and Sales.
SalesItems is setup something like this
| SaleItemID | SaleID | ProductID | ProductType |
| 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 2 |
| 3 | 1 | 15 | 1 |
| 4 | 2 | 5 | 2 |
| 5 | 3 | 1 | 1 |
| 6 | 3 | 8 | 5 |
And Sales is setup something like this
| Sale | Cash |
| 1 | 1.00 |
| 2 | 10.00 |
| 3 | 28.50 |
I am trying to export a basic 'Daily History' that uses joins to spit out the information like this.
| Date | StoreID | Type1Sales | Type2Sales | ... | Cash Taken |
| 5/2 | 50 | 50 | 40 | ... | 39.50 |
| 5/3 | 50 | 10 | 32.50 | ... | 48.50 |
The issue I'm having is if I do an inner join From Sales to Sales Items, I'll end up with this.
| SaleItemID | SaleID | ProductID | ProductType | Sale | Cash |
| 1 | 1 | 1 | 1 | 1 | 1.00 |
| 2 | 1 | 2 | 2 | 1 | 1.00 |
| 3 | 1 | 15 | 1 | 1 | 1.00 |
| 4 | 2 | 5 | 2 | 2 | 10.00 |
| 5 | 3 | 1 | 1 | 3 | 28.50 |
| 6 | 3 | 8 | 5 | 3 | 28.50 |
So if I do a SUM(Cash), then I'll end up returning $70.00, instead of the correct $39.50. I'm not the best with joins, so I've been researching outer joins and such, but none of those seem to work as it's still matching up. Is there a way to only match on the FIRST instance, and return NULL for the rest? For example, something like this
| SaleItemID | SaleID | ProductID | ProductType | Sale | Cash |
| 1 | 1 | 1 | 1 | 1 | 1.00 |
| 2 | 1 | 2 | 2 | 1 | NULL |
| 3 | 1 | 15 | 1 | 1 | NULL |
| 4 | 2 | 5 | 2 | 2 | 10.00 |
| 5 | 3 | 1 | 1 | 3 | 28.50 |
| 6 | 3 | 8 | 5 | 3 | NULL |
Or do you have any other suggestions for returning back the correct amount of Cash for each particular day?
Use DISTINCT(SaleID) in your SELECT to return a single row for each Sale ID.