I have this table of financial transactions..
PersonID | SeqId | FundId | PortfolioDbu | Date
----------------------------------------------------------
456 | 1 | B | 0.1 | 2012-04-03
456 | 1 | F | 0.5 | 2012-04-03
456 | 1 | H | 0.3 | 2012-04-03
456 | 1 | Z | 0.1 | 2012-04-03
8 | 1 | B | 0.5 | 2012-03-23
8 | 1 | A | 0.5 | 2012-03-23
8 | 2 | C | 0.3 | 2011-03-24
8 | 2 | X | 0.3 | 2011-03-24
8 | 2 | F | 0.4 | 2011-03-24
6001 | 1 | J | 0.5 | 2008-01-01
6001 | 1 | R | 0.5 | 2008-01-01
76 | 1 | A | 0.25 | 2010-09-26
76 | 1 | B | 0.25 | 2010-09-26
76 | 1 | C | 0.25 | 2010-09-26
76 | 1 | D | 0.25 | 2010-09-26
321 | 1 | X | 0.2 | 2012-02-21
321 | 1 | Y | 0.2 | 2012-02-21
321 | 1 | U | 0.2 | 2012-02-21
321 | 1 | P | 0.2 | 2012-02-21
321 | 1 | W | 0.2 | 2012-02-21
456 | 2 | Y | 1 | 2012-11-01
which I need to convert to a "wide" format, like so..
Date | PersonId | SeqId | Fund1 | Fund2 | Fund3 | Fund4 | Fund5 | Dbu1 | Dbu2 | Dbu3 | Dbu4 | Dbu5
----------------------------------------------------------------------------------------------------------
2012-04-03 | 456 | 1 | B | F | H | Z | . | 0.1 | 0.5 | 0.3 | 0.1 | .
2012-03-23 | 8 | 1 | B | A | . | . | . | 0.5 | 0.5 | . | . | .
2012-03-24 | 8 | 2 | C | X | F | . | . | 0.3 | 0.3 | 0.4 | . | .
2008-01-01 | 6001 | 1 | J | R | . | . | . | 0.5 | 0.5 | . | . | .
2010-09-26 | 76 | 1 | A | B | C | D | . | 0.25 | 0.25 | 0.25 | 0.25 | .
2010-02-21 | 321 | 1 | X | Y | U | P | W | 0.2 | 0.2 | 0.2 | 0.2 | 0.2
2012-11-01 | 456 | 2 | Y | . | . | . | . | 1 | . | . | . | .
Is this possible even though I don't want to aggregate the data in any way?
SQL Fiddle
I'm not real good a PIVOT tables, but you can use the following alternative CASE statement pattern to get the output you're looking for:
WITH T AS (
SELECT
personid,
seqid,
row_number() over (partition BY personid,seqid ORDER BY FundId) AS ROW,
FundId,
portfoliodbu,
date
FROM
transactions
)
SELECT
date,
personid,
seqid,
max(CASE WHEN ROW=1 THEN fundid END) AS fund1,
max(CASE WHEN ROW=2 THEN fundid END) AS fund2,
max(CASE WHEN ROW=3 THEN fundid END) AS fund3,
max(CASE WHEN ROW=4 THEN fundid END) AS fund4,
max(CASE WHEN ROW=5 THEN fundid END) AS fund5,
max(CASE WHEN ROW=1 THEN portfoliodbu END) AS dbu1,
max(CASE WHEN ROW=2 THEN portfoliodbu END) AS dbu2,
max(CASE WHEN ROW=3 THEN portfoliodbu END) AS dbu3,
max(CASE WHEN ROW=4 THEN portfoliodbu END) AS dbu4,
max(CASE WHEN ROW=5 THEN portfoliodbu END) AS dbu5
FROM
T
GROUP BY
date,personid,seqid
Demo: SQL Fiddle
Results:
| DATE | PERSONID | SEQID | FUND1 | FUND2 | FUND3 | FUND4 | FUND5 | DBU1 | DBU2 | DBU3 | DBU4 | DBU5 |
----------------------------------------------------------------------------------------------------------------------------------------------
| January, 01 2008 00:00:00+0000 | 6001 | 1 | J | R | (null) | (null) | (null) | 0.5 | 0.5 | (null) | (null) | (null) |
| September, 26 2010 00:00:00+0000 | 76 | 1 | A | B | C | D | (null) | 0.25 | 0.25 | 0.25 | 0.25 | (null) |
| March, 24 2011 00:00:00+0000 | 8 | 2 | C | F | X | (null) | (null) | 0.3 | 0.4 | 0.3 | (null) | (null) |
| February, 21 2012 00:00:00+0000 | 321 | 1 | P | U | W | X | Y | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| March, 23 2012 00:00:00+0000 | 8 | 1 | A | B | (null) | (null) | (null) | 0.5 | 0.5 | (null) | (null) | (null) |
| April, 03 2012 00:00:00+0000 | 456 | 1 | B | F | H | Z | (null) | 0.1 | 0.5 | 0.3 | 0.1 | (null) |
| November, 01 2012 00:00:00+0000 | 456 | 2 | Y | (null) | (null) | (null) | (null) | 1 | (null) | (null) | (null) | (null) |
Related
I am trying out S3 Select from Presto using hive connector and Minio Object store. I am able to create an external table and run all the SQL queries. But, S3 Select does not seem to be working, even with the hive.s3select-pushdown.enabled=true set in the properties file in the catalog folder. I ran a packet trace on the Minio server, I only see GET/LIST calls being made, do not see any POST /{Key+}?select&select-type=2 HTTP/1.1 being made.
Below is the hive properties file.
hive.metastore.uri=thrift://hadoop-master:9083
hive.s3.path-style-access=true
hive.s3.endpoint=http://X.X.X.X:9000
hive.s3.aws-access-key=minioadmin
hive.s3.aws-secret-key=minioadmin
hive.non-managed-table-writes-enabled=true
hive.storage-format=ORC
hive.s3select-pushdown.enabled=true
I see that the same is set from the SESSION parameters in presto.
minio.s3_select_pushdown_enabled | true | true
minio.projection_pushdown_enabled | true | true
This is how I am creating the external table from presto cli.
presto:default> CREATE TABLE nyc_9 ( vendorid VARCHAR, tpep_pickup_datetime VARCHAR, tpep_dropoff_datetime VARCHAR, passenger_count VARCHAR, trip_distance VARCHAR, ratecodeid VARCHAR, store_and_fwd_flag VARCHAR, pulocationid VARCHAR, dolocationid VARCHAR, payment_type VARCHAR, fare_amount VARCHAR, extra VARCHAR, mta_tax VARCHAR, tip_amount VARCHAR, tolls_amount VARCHAR, improvement_surcharge VARCHAR, total_amount VARCHAR) WITH (FORMAT = 'CSV', skip_header_line_count = 1, EXTERNAL_LOCATION = 's3a://test10gb5/');
Query being run
presto:default> SELECT * FROM nyc_9 WHERE trip_distance > '20' AND fare_amount > '10' AND tip_amount > '2' AND passenger_count = '2' LIMIT 10;
vendorid | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_sur
----------+------------------------+------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+----------------
2 | 04/26/2018 08:51:16 AM | 04/26/2018 09:42:03 AM | 2 | 5.06 | 1 | N | 236 | 170 | 1 | 31 | 0 | 0.5 | 6.36 | 0 | 0.3
2 | 04/26/2018 08:14:17 AM | 04/26/2018 08:35:08 AM | 2 | 6.88 | 1 | N | 263 | 45 | 1 | 22 | 0 | 0.5 | 6.84 | 0 | 0.3
1 | 04/26/2018 08:19:47 AM | 04/26/2018 09:17:45 AM | 2 | 9.7 | 1 | N | 138 | 144 | 1 | 39 | 0 | 0.5 | 8 | 0 | 0.3
2 | 04/26/2018 08:38:15 AM | 04/26/2018 09:09:58 AM | 2 | 4.73 | 1 | N | 142 | 144 | 1 | 22 | 0 | 0.5 | 4.56 | 0 | 0.3
2 | 04/26/2018 08:38:26 AM | 04/26/2018 09:22:12 AM | 2 | 5.95 | 1 | N | 239 | 13 | 1 | 29 | 0 | 0.5 | 2.98 | 0 | 0.3
2 | 04/26/2018 08:47:03 AM | 04/26/2018 09:17:02 AM | 2 | 3.27 | 1 | N | 158 | 162 | 1 | 19 | 0 | 0.5 | 3.96 | 0 | 0.3
2 | 04/26/2018 08:21:19 AM | 04/26/2018 08:46:55 AM | 2 | 3.89 | 1 | N | 262 | 107 | 1 | 18.5 | 0 | 0.5 | 3.86 | 0 | 0.3
2 | 04/26/2018 08:35:32 AM | 04/26/2018 09:01:54 AM | 2 | 4.09 | 1 | N | 236 | 137 | 1 | 17.5 | 0 | 0.5 | 3.66 | 0 | 0.3
1 | 04/26/2018 08:43:45 AM | 04/26/2018 09:03:41 AM | 2 | 3 | 1 | N | 163 | 145 | 1 | 15 | 0 | 0.5 | 6 | 0 | 0.3
1 | 04/26/2018 08:01:47 AM | 04/26/2018 08:13:08 AM | 2 | 3.1 | 1 | N | 264 | 137 | 1 | 12 | 0 | 0.5 | 2.55 | 0 | 0.3
(10 rows)
Is there anything else that needs to be done for S3 Select to work?
I am having SQL Server data which contains the available and allocated information, and i need to show the report like Available Data, Alloted Data and Balance Data for each resources.
Right now I am having all the information in single row and balance data needs to be calculated. How can I retreive the expected data?
Current Data and Output
SQL Fiddle
Available
+---+-------------+------------+---------------------+---------+------------+------------+
| Id| ResourceName| EmployeeId | iGPMResourceGroupId | GroupId | Capacity01 | Capacity02 |
+---+-------------+------------+---------------------+---------+------------+------------+
| 1 | Palanisamy | 24 | 1025135 | 15 | 0.70 | 0.70 |
| 2 | Anil | 20 | 1018707 | 15 | 1.00 | 1.00 |
| 3 | Ravi | 18 | 1025136 | 15 | 0.50 | 0.50 |
| 4 | Manikumar | 9 | 1025164 | 29 | 1.00 | 1.00 |
| 5 | Sakathi | 11 | 1020687 | 29 | 1.00 | 1.00 |
+---+-------------+------------+---------------------+---------+------------+------------+
Demand
+------------+-------------+---------------------+
| Number | ProjectName | iGPMResourceGroupId |
+------------+-------------+---------------------+
| BM-00000001| Project 1 | 1020687 |
| BM-00000002| Project 2 | 1020687 |
| BM-00000002| Project 2 | 1025136 |
| BM-00000003| Project 3 | 1025164 |
| BM-00000002| Project 2 | 1025135 |
| BM-00000003| Project 3 | 1025135 |
| BM-00000003| Project 3 | 1020687 |
| BM-00000002| Project 2 | 1025164 |
+------------+-------------+---------------------+
Allocated
+----+---------------------+----------------+------------+---------------------+------------+-----------+-----------+
| Id | AvailableResourceId | AssociateName | EmployeeId | iGPMResourceGroupId | ProjectId | Staffed01 | Staffed02 |
+----+---------------------+----------------+------------+---------------------+------------+-----------+-----------+
| 1 | 5 | Sakathi | 11 | 1020687 | BM-00000001| 0.30 | 0.30 |
| 2 | 5 | Sakathi | 11 | 1020687 | BM-00000003| 0.30 | 0.30 |
| 3 | 3 | Ravi | 18 | 1025136 | BM-00000002| 0.50 | 0.50 |
+----+---------------------+----------------+------------+---------------------+------------+-----------+-----------+
Query
SELECT ResourceName, iGPMResourceGroupId, ProjectName, StaffedId, Capacity01, Capacity02, Staffed01, Staffed02
FROM
(
SELECT Distinct A. ResourceName, A.EmployeeId, A.[iGPMResourceGroupId], D.[ProjectName], S.[Id] AS StaffedId, A.[Capacity01], A.[Capacity02], S.[Staffed01], S.[Staffed02]
FROM [AvailableR] A JOIN [DemandR] D ON A.[iGPMResourceGroupId] = D.[iGPMResourceGroupId]
LEFT JOIN [dbo].[AllocatedR] S ON A.Id = S.[AvailableResourceId] AND S.Number = D.Number AND S.[iGPMResourceGroupId] = D.[iGPMResourceGroupId]
)X
ORDER BY EmployeeId
Output
+--------------+---------------------+-------------+-----------+------------+------------+-----------+-----------+
| ResourceName | iGPMResourceGroupId | ProjectName | StaffedId | Capacity01 | Capacity02 | Staffed01 | Staffed02 |
+--------------+---------------------+-------------+-----------+------------+------------+-----------+-----------+
| Sakathi | 1020687 | Project 1 | 1 | 1 | 1 | 0 | 0 |
| Sakathi | 1020687 | Project 2 | (null) | 1 | 1 | (null) | (null) |
| Sakathi | 1020687 | Project 3 | 2 | 1 | 1 | 0 | 0 |
| Ravi | 1025136 | Project 2 | 3 | 0 | 0 | 1 | 1 |
| Palanisamy | 1025135 | Project 2 | (null) | 1 | 1 | (null) | (null) |
| Palanisamy | 1025135 | Project 3 | (null) | 1 | 1 | (null) | (null) |
| Manikumar | 1025164 | Project 2 | (null) | 1 | 1 | (null) | (null) |
| Manikumar | 1025164 | Project 3 | (null) | 1 | 1 | (null) | (null) |
+--------------+---------------------+-------------+-----------+------------+------------+-----------+-----------+
Expected Output:
Left = Available - Sum(Allotted)
+--------------+---------------------+-------------+-----------+------------+------+------+
| ResourceName | iGPMResourceGroupId | ProjectName | Status | StaffedId | 01 | 02 |
+--------------+---------------------+-------------+-----------+------------+------+------+
| Sakathi | | | Available | (null) | 1 | 1 |
| Sakathi | 1020687 | Project 1 | Alloted | 1 | 0.30 | 0.30 |
| Sakathi | 1020687 | Project 2 | Alloted | (null) |(null)|(null)|
| Sakathi | 1020687 | Project 3 | Alloted | 2 | 0.30 | 0.30 |
| Sakathi | | | Left | (null) | 0.40 | 0.40 |
| Ravi | | | Available | (null) | 0.50 | 0.50 |
| Ravi | 1025136 | Project 2 | Alloted | 3 | 0.50 | 0.50 |
| Ravi | | | Left | (null) | 0 | 0 |
| Palanisamy | | | Available | (null) | 1 | 1 |
| Palanisamy | 1025135 | Project 2 | Alloted | (null) |(null)|(null)|
| Palanisamy | 1025135 | Project 3 | Alloted | (null) |(null)|(null)|
| Palanisamy | | | Left | (null) | 1 | 1 |
| Manikumar | | | Available | (null) | 1 | 1 |
| Manikumar | 1025164 | Project 2 | Alloted | (null) |(null)|(null)|
| Manikumar | 1025164 | Project 3 | Alloted | (null) |(null)|(null)|
| Manikumar | | | Left | (null) | 1 | 1 |
+--------------+---------------------+-------------+-----------+------------+------+------+
So, if I got you right, you need an additional column "Left" with the calculated result and a new column which stored this value per resource? I guess you can achieve this by a redesign of your query and the use of two subqueries:
WITH cte AS(
SELECT Distinct A. ResourceName, A.EmployeeId, A.[iGPMResourceGroupId], NULL AS [ProjectName], N'Available' AS Status, NULL AS StaffedId, A.[Capacity01], A.[Capacity02], NULL AS [Staffed01], NULL AS [Staffed02]
FROM [AvailableR] A
JOIN [DemandR] D ON A.[iGPMResourceGroupId] = D.[iGPMResourceGroupId]
UNION ALL
SELECT Distinct A. ResourceName, A.EmployeeId, A.[iGPMResourceGroupId], D.[ProjectName], N'Alloted' AS Status, S.[Id] AS StaffedId, A.[Capacity01], A.[Capacity02], S.[Staffed01], S.[Staffed02]
FROM [AvailableR] A JOIN [DemandR] D ON A.[iGPMResourceGroupId] = D.[iGPMResourceGroupId]
LEFT JOIN [dbo].[AllocatedR] S ON A.Id = S.[AvailableResourceId] AND S.Number = D.Number AND S.[iGPMResourceGroupId] = D.[iGPMResourceGroupId]
),
cteLeft AS(
SELECT ResourceName, EmployeeID, iGPMResourceGroupId, SUM([Capacity01] + [Capacity02]) - SUM(Staffed01 + Staffed02) AS LeftTotal
FROM cte
WHERE Status = N'Alloted'
GROUP BY ResourceName, EmployeeID, iGPMResourceGroupId
)
SELECT *, NULL AS [Left]
FROM cte
UNION ALL
SELECT Distinct ResourceName, EmployeeId, [iGPMResourceGroupId], NULL AS [ProjectName], N'Left' AS Status, NULL AS StaffedId, NULL AS [Capacity01], NULL AS [Capacity02], NULL AS [Staffed01], NULL AS [Staffed02], [LeftTotal] AS [Left]
FROM cteLeft
ORDER BY 1
See fiddle for details: http://sqlfiddle.com/#!18/2a30d/16/0
I have a table A:
Create table A(
Name varchar(10),
Number integer,
Exc integer,
D1 date
)
I have inserted 11 rows.
Sel * from A;
+ -----+--------+-----+------------+
| NAME | NUMBER | EXC | D1 |
+ -----+--------+-----+------------+
| a | 1 | 1 | 2020-02-03 |
| a | 1 | 2 | 2020-02-03 |
| a | 1 | 3 | 2020-02-03 |
| a | 1 | 4 | 2020-02-03 |
| a | 1 | 1 | 2020-02-04 |
| a | 1 | 2 | 2020-02-04 |
| a | 1 | 3 | 2020-02-04 |
| a | 1 | 1 | 2020-02-05 |
| a | 1 | 2 | 2020-02-05 |
| a | 1 | 3 | 2020-02-05 |
| a | 1 | 4 | 2020-02-05 |
+ -----+--------+-----+------------+
Now, when I apply dense rank like below:
sel vt.*,dense_rank() OVER(PARTITION BY Name,Number,EXC ORDER BY D1 ) AS rn
from vt;
Output:
+ -----+--------+-----+------------+----+
| NAME | NUMBER | EXC | D1 | RN |
+ -----+--------+-----+------------+----+
| a | 1 | 1 | 2020-02-03 | 1 |
| a | 1 | 2 | 2020-02-03 | 1 |
| a | 1 | 3 | 2020-02-03 | 1 |
| a | 1 | 4 | 2020-02-03 | 1 |
| a | 1 | 1 | 2020-02-04 | 2 |
| a | 1 | 2 | 2020-02-04 | 2 |
| a | 1 | 3 | 2020-02-04 | 2 |
| a | 1 | 1 | 2020-02-05 | 3 |
| a | 1 | 2 | 2020-02-05 | 3 |
| a | 1 | 3 | 2020-02-05 | 3 |
| a | 1 | 4 | 2020-02-05 | 2 |
+ -----+--------+-----+------------+----+
Expected:
+ -----+--------+-----+------------+----+
| NAME | NUMBER | EXC | D1 | RN |
+ -----+--------+-----+------------+----+
| a | 1 | 1 | 2020-02-03 | 1 |
| a | 1 | 2 | 2020-02-03 | 1 |
| a | 1 | 3 | 2020-02-03 | 1 |
| a | 1 | 4 | 2020-02-03 | 1 |
| a | 1 | 1 | 2020-02-04 | 2 |
| a | 1 | 2 | 2020-02-04 | 2 |
| a | 1 | 3 | 2020-02-04 | 2 |
| a | 1 | 1 | 2020-02-05 | 3 |
| a | 1 | 2 | 2020-02-05 | 3 |
| a | 1 | 3 | 2020-02-05 | 3 |
| a | 1 | 4 | 2020-02-05 | 3 | <-- Difference here
+ -----+--------+-----+------------+----+
Removing column EXC from the PARTITION would give you the results that you expect:
DENSE_RANK() OVER(PARTITION BY Name, Number ORDER BY D1)
Demo on DB Fiddle:
name | number | exc | d1 | rn
:--- | -----: | --: | :--------- | :-
a | 1 | 1 | 2020-02-03 | 1
a | 1 | 2 | 2020-02-03 | 1
a | 1 | 3 | 2020-02-03 | 1
a | 1 | 4 | 2020-02-03 | 1
a | 1 | 1 | 2020-02-04 | 2
a | 1 | 2 | 2020-02-04 | 2
a | 1 | 3 | 2020-02-04 | 2
a | 1 | 1 | 2020-02-05 | 3
a | 1 | 2 | 2020-02-05 | 3
a | 1 | 3 | 2020-02-05 | 3
a | 1 | 4 | 2020-02-05 | 3
Do you simply want to rank all rows by date? Then remove the partition clause, so the ranking is done not inside groups.
select name, number, exc, d1, dense_rank() over (order by d1) as rn
from vt
order by d1, name, number, exc;
I have DataFrame
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | NaN | -2.737 | <-
| 1.13 | NaN | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
and would like to back-fill NaN values using decreasing sequence ending with next valid value
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | 13 | -2.737 | <-
| 1.13 | 12 | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
Is there a way to do this?
Get positions where NaNs are found
positions = df['A'].isna().astype(int)
| positions |
--------------
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 0 |
| 0 |
| 0 |
then doing reverse cumulative sum:
mask = df['A'].isna().astype(int).loc[::-1]
cumSum = mask.cumsum()
posCumSum = (cumSum - cumSum.where(~mask).ffill().fillna(0).astype(int)).loc[::-1]
| posCumSum |
--------------
| 0 |
| 0 |
| 0 |
| 2 |
| 1 |
| 0 |
| 0 |
| 0 |
adding it to backfilled original column:
df['A'] = df['A'].bfill() + posCumSum
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | 13 | -2.737 | <-
| 1.13 | 12 | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!
This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;