I found lots of drop_duplicates for index when both multi level indices are the same but, I would like to keep the first row of a multi index when the second level of the multi index has duplicates. So here:
| | col_0 | col_1 | col_2 | col_3 | col_4 |
|:-------------------------------|--------:|--------:|--------:|--------:|--------:|
| date | ID
| ('2022-01-01', 'identifier_0') | 26 | 46 | 44 | 21 | 10 |
| ('2022-01-01', 'identifier_1') | 25 | 45 | 83 | 23 | 45 |
| ('2022-01-01', 'identifier_2') | 42 | 79 | 55 | 5 | 78 |
| ('2022-01-01', 'identifier_3') | 32 | 4 | 57 | 19 | 61 |
| ('2022-01-01', 'identifier_4') | 30 | 25 | 5 | 93 | 72 |
| ('2022-01-02', 'identifier_0') | 42 | 14 | 56 | 43 | 42 |
| ('2022-01-02', 'identifier_1') | 90 | 27 | 46 | 58 | 5 |
| ('2022-01-02', 'identifier_2') | 33 | 39 | 53 | 94 | 86 |
| ('2022-01-02', 'identifier_3') | 32 | 65 | 98 | 81 | 64 |
| ('2022-01-02', 'identifier_4') | 48 | 31 | 25 | 58 | 15 |
| ('2022-01-03', 'identifier_0') | 5 | 80 | 33 | 96 | 80 |
| ('2022-01-03', 'identifier_1') | 15 | 86 | 45 | 39 | 62 |
| ('2022-01-03', 'identifier_2') | 98 | 3 | 42 | 50 | 83 |
I'd like to keep first rows with unique ID.
If your index is a MultiIndex:
>>> df.loc[~df.index.get_level_values('ID').duplicated()]
col_0 col_1 col_2 col_3 col_4
date ID
2022-01-01 identifier_0 26 46 44 21 10
identifier_1 25 45 83 23 45
identifier_2 42 79 55 5 78
identifier_3 32 4 57 19 61
identifier_4 30 25 5 93 72
# Or
>>> df.groupby(level='ID').first()
col_0 col_1 col_2 col_3 col_4
ID
identifier_0 26 46 44 21 10
identifier_1 25 45 83 23 45
identifier_2 42 79 55 5 78
identifier_3 32 4 57 19 61
identifier_4 30 25 5 93 72
If your index is an Index:
>>> df.loc[~df.index.str[1].duplicated()]
col_0 col_1 col_2 col_3 col_4
(2022-01-01, identifier_0) 26 46 44 21 10
(2022-01-01, identifier_1) 25 45 83 23 45
(2022-01-01, identifier_2) 42 79 55 5 78
(2022-01-01, identifier_3) 32 4 57 19 61
(2022-01-01, identifier_4) 30 25 5 93 72
>>> df.groupby(df.index.str[1]).first()
col_0 col_1 col_2 col_3 col_4
identifier_0 26 46 44 21 10
identifier_1 25 45 83 23 45
identifier_2 42 79 55 5 78
identifier_3 32 4 57 19 61
identifier_4 30 25 5 93 72
I am using Snowflake SQL, but I guess this can be solved by any sql. So I have data like this:
RA_MEMBER_ID YEAR QUARTER MONTH Monthly_TOTAL_PURCHASE CATEGORY
1000 2020 1 1 105 CAT10
1000 2020 1 1 57 CAT13
1000 2020 1 2 107 CAT10
1000 2020 1 2 59 CAT13
1000 2020 1 3 109 CAT11
1000 2020 1 3 61 CAT14
1000 2020 2 4 111 CAT11
1000 2020 2 4 63 CAT14
1000 2020 2 5 113 CAT12
1000 2020 2 5 65 CAT15
1000 2020 2 6 115 CAT12
1000 2020 2 6 67 CAT15
And I need data like this:
RA_MEMBER_ID YEAR QUARTER MONTH Monthly_TOTAL_PURCHASE CATEGORY Monthly_rank Quarterly_Total_purchase Quarter_category Quarter_rank Yearly_Total_purchase Yearly_category Yearly_rank
1000 2020 1 1 105 CAT10 1 105 CAT10 1 105 CAT10 1
1000 2020 1 1 57 CAT13 2 57 CAT13 2 57 CAT13 2
1000 2020 1 2 107 CAT10 1 212 CAT10 1 212 CAT10 1
1000 2020 1 2 59 CAT13 2 116 CAT13 2 116 CAT13 2
1000 2020 1 3 109 CAT11 1 212 CAT10 1 212 CAT10 1
1000 2020 1 3 61 CAT14 2 116 CAT13 2 116 CAT13 2
1000 2020 2 4 111 CAT11 1 111 CAT11 1 212 CAT10 1
1000 2020 2 4 63 CAT14 2 63 CAT14 2 124 CAT14 2
1000 2020 2 5 113 CAT12 1 113 CAT12 1 212 CAT10 1
1000 2020 2 5 65 CAT15 2 65 CAT15 2 124 CAT14 2
1000 2020 2 6 115 CAT12 1 228 CAT12 1 228 CAT12 1
1000 2020 2 6 67 CAT15 2 132 CAT15 2 132 CAT15 2
So basically, I have the top two categories by purchase amount for the first 6 months. I need the same for quarterly based on which month of the quarter it is. So let's say it is February, then the top 2 categories and amounts should be calculated based on both January and February. For March we have to get the quarter data by taking all three months. From April it will be the same as monthly rank, for May again calculate based on April and May. Similarly for Yearly also.
I have tried a lot of things but nothing seems to give me what I want.
The solution should be generic enough because there can be many other months and years.
I really need help in this.
Not sure if below is what you are after. I assume that everything is category based:
create or replace table test (
ra_member_id int,
year int,
quarter int,
month int,
monthly_purchase int,
category varchar
);
insert into test values
(1000, 2020, 1,1, 105, 'cat10'),
(1000, 2020, 1,1, 57, 'cat13'),
(1000, 2020, 1,2, 107, 'cat10'),
(1000, 2020, 1,2, 59, 'cat13'),
(1000, 2020, 1,3, 109, 'cat11'),
(1000, 2020, 1,3, 61, 'cat14'),
(1000, 2020, 2,4, 111, 'cat11'),
(1000, 2020, 2,4, 63, 'cat14'),
(1000, 2020, 2,5, 113, 'cat12'),
(1000, 2020, 2,5, 65, 'cat15'),
(1000, 2020, 2,6, 115, 'cat12'),
(1000, 2020, 2,6, 67, 'cat15');
WITH BASE as (
select
RA_MEMBER_ID,
YEAR,
QUARTER,
MONTH,
CATEGORY,
MONTHLY_PURCHASE,
LAG(MONTHLY_PURCHASE) OVER (PARTITION BY QUARTER, CATEGORY ORDER BY MONTH) AS QUARTERLY_PURCHASE_LAG,
IFNULL(QUARTERLY_PURCHASE_LAG, 0) + MONTHLY_PURCHASE AS QUARTERLY_PURCHASE,
LAG(MONTHLY_PURCHASE) OVER (PARTITION BY YEAR, CATEGORY ORDER BY MONTH) AS YEARLY_PURCHASE_LAG,
IFNULL(YEARLY_PURCHASE_LAG, 0) + MONTHLY_PURCHASE AS YEARLY_PURCHASE
FROM
TEST
),
BASE_RANK AS (
SELECT
RA_MEMBER_ID,
YEAR,
QUARTER,
MONTH,
CATEGORY,
MONTHLY_PURCHASE,
RANK() OVER (PARTITION BY MONTH ORDER BY MONTHLY_PURCHASE DESC) as MONTHLY_RANK,
QUARTERLY_PURCHASE,
RANK() OVER (PARTITION BY QUARTER ORDER BY QUARTERLY_PURCHASE DESC) as QUARTERLY_RANK,
YEARLY_PURCHASE,
RANK() OVER (PARTITION BY YEAR ORDER BY YEARLY_PURCHASE DESC) as YEARLY_RANK
FROM BASE
),
MAIN AS (
SELECT
RA_MEMBER_ID,
YEAR,
QUARTER,
MONTH,
CATEGORY,
MONTHLY_PURCHASE,
MONTHLY_RANK,
QUARTERLY_PURCHASE,
QUARTERLY_RANK,
YEARLY_PURCHASE,
YEARLY_RANK
FROM BASE_RANK
)
SELECT * FROM MAIN
ORDER BY YEAR, QUARTER, MONTH
;
Result:
+--------------+------+---------+-------+----------+------------------+--------------+--------------------+----------------+-----------------+-------------+
| RA_MEMBER_ID | YEAR | QUARTER | MONTH | CATEGORY | MONTHLY_PURCHASE | MONTHLY_RANK | QUARTERLY_PURCHASE | QUARTERLY_RANK | YEARLY_PURCHASE | YEARLY_RANK |
|--------------+------+---------+-------+----------+------------------+--------------+--------------------+----------------+-----------------+-------------|
| 1000 | 2020 | 1 | 1 | cat10 | 105 | 1 | 105 | 4 | 105 | 9 |
| 1000 | 2020 | 1 | 1 | cat13 | 57 | 2 | 57 | 6 | 57 | 12 |
| 1000 | 2020 | 1 | 2 | cat10 | 107 | 1 | 212 | 1 | 212 | 3 |
| 1000 | 2020 | 1 | 2 | cat13 | 59 | 2 | 116 | 2 | 116 | 6 |
| 1000 | 2020 | 1 | 3 | cat11 | 109 | 1 | 109 | 3 | 109 | 8 |
| 1000 | 2020 | 1 | 3 | cat14 | 61 | 2 | 61 | 5 | 61 | 11 |
| 1000 | 2020 | 2 | 4 | cat11 | 111 | 1 | 111 | 4 | 220 | 2 |
| 1000 | 2020 | 2 | 4 | cat14 | 63 | 2 | 63 | 6 | 124 | 5 |
| 1000 | 2020 | 2 | 5 | cat12 | 113 | 1 | 113 | 3 | 113 | 7 |
| 1000 | 2020 | 2 | 5 | cat15 | 65 | 2 | 65 | 5 | 65 | 10 |
| 1000 | 2020 | 2 | 6 | cat12 | 115 | 1 | 228 | 1 | 228 | 1 |
| 1000 | 2020 | 2 | 6 | cat15 | 67 | 2 | 132 | 2 | 132 | 4 |
+--------------+------+---------+-------+----------+------------------+--------------+--------------------+----------------+-----------------+-------------+
I am new to psql. I have a table of car ids - id and their battery level - battery at every ten-minute interval for 2 weeks.
My goal is to create an output of the total count of unique car ids whose batteries experienced any gain per day. That could mean that at any time over the course of the day where a car's battery level was higher than the previous time stamp. In other words, where the value of battery - the previous value of battery is positive. Records with NA values under battery should be skipped.
I have started with the query but I am unsure how to only select unique id's whose battery levels rose. Any recommendations would be appreciated!
SELECT count(distinct id), TO_CHAR(date_trunc('day', (time::timestamp) AT TIME ZONE 'EST'), 'YYYY-MM-DD') AS day FROM test_db ....
GROUP by day
ORDER by day
Here is a sample of the data :
id| time| battery
54 | 2017-12-12 09:50:04.402775+00 | 100
54 | 2017-12-12 09:40:04.618926+00 | 100
54 | 2017-12-12 09:30:04.11399+00 | 100
54 | 2017-12-12 09:20:03.906716+00 | 100
54 | 2017-12-12 09:10:03.955133+00 | 100
54 | 2017-12-12 09:00:04.678508+00 | 100
54 | 2017-12-12 08:50:03.733471+00 | 100
54 | 2017-12-12 08:40:03.65688+00 | 100
54 | 2017-12-12 08:30:04.260608+00 | 100
54 | 2017-12-12 08:20:03.98387+00 | 100
54 | 2017-12-12 08:10:04.164129+00 | 98
54 | 2017-12-12 08:00:04.597976+00 | 98
54 | 2017-12-12 07:50:04.501231+00 | 98
54 | 2017-12-12 07:40:04.441531+00 | 98
54 | 2017-12-12 07:30:04.310876+00 | 98
54 | 2017-12-12 07:20:04.317241+00 | 98
54 | 2017-12-12 07:10:03.856432+00 | 67
54 | 2017-12-12 07:00:03.628862+00 | 67
54 | 2017-12-12 06:50:03.868495+00 | 67
54 | 2017-12-12 06:40:04.490324+00 | 67
54 | 2017-12-12 06:30:03.83739+00 | 67
54 | 2017-12-12 06:20:03.817014+00 | 67
54 | 2017-12-12 06:10:04.081174+00 | 29
54 | 2017-12-12 06:00:04.178765+00 | 29
data_type
--------------------------
integer
timestamp with time zone
integer
So i have a table containing below columns.
I want to compute an running average from positiondate and for example 3 days back, grouped on dealno.
I know how to do with "case by" but problem is that I have around 200 different DealNo so I do not want to write an own case by clause for every deal.
On dealNo 1 it desired output should be Average(149 243 440 + 149 224 446 + 149 243 451)
DealNo PositionDate MarketValue
1 | 2016-11-27 | 149 243 440
2 | 2016-11-27 | 21 496 418
3 | 2016-11-27 | 32 249 600
1 | 2016-11-26 | 149 243 446
2 | 2016-11-26 | 21 496 418
3 | 2016-11-26 | 32 249 600
1 | 2016-11-25 | 149 243 451
3 | 2016-11-25 | 32 249 600
2 | 2016-11-25 | 21 496 418
3 | 2016-11-24 | 32 249 600
1 | 2016-11-24 | 149 225 582
2 | 2016-11-24 | 21 498 120
1 | 2016-11-23 | 149 256 867
2 | 2016-11-23 | 21 504 181
3 | 2016-11-23 | 32 253 440
1 | 2016-11-22 | 149 256 873
2 | 2016-11-22 | 21 506 840
3 | 2016-11-22 | 32 253 440
1 | 2016-11-21 | 149 234 535
2 | 2016-11-21 | 21 509 179
3 | 2016-11-21 | 32 253 600
I tried below script but it was not very effective since my table contains around 300k rows and approx 200 different dealno.
Is there a more effective way to do this in SQL 2008?
with cte as (
SELECT ROW_NUMBER() over(order by dealno, positiondate desc) as Rownr,
dealno,
positiondate,
Currency,
MvCleanCcy
FROM T1
)
select
rownr, positiondate, DealNo, Currency,
mvcleanavg30d = (select avg(MvCleanCcy) from cte2 where Rownr between c.Rownr and c.Rownr+3)
from cte as c
You don't need window functions. You can do this using outer apply:
select t1.*, tt1.marketvalue_3day
from t1 outer apply
(select avg(tt1.marketvalue) as marketvalue_3day
from (select top 3 tt1.*
from t1 tt1
where tt1.deal1 = t1.deal1 and
tt1.positiondate <= t1.positiondate
order by tt1.positiondate desc
) tt1
) tt1;
I have a sql server table like this:
Value RowID Diff
153 48 1
68 49 1
50 57 NULL
75 58 1
65 59 1
70 63 NULL
66 64 1
79 66 NULL
73 67 1
82 68 1
85 69 1
66 70 1
118 88 NULL
69 89 1
67 90 1
178 91 1
How can I make it like this (note the partition after each null in 3rd column):
Value RowID Diff
153 48 1
68 49 1
50 57 NULL
75 58 2
65 59 2
70 63 NULL
66 64 3
79 66 NULL
73 67 4
82 68 4
85 69 4
66 70 4
118 88 NULL
69 89 5
67 90 5
178 91 5
It looks like you are partitioning over sequential values of RowID. There is a trick to do this directly by grouping on RowID - Row_Number():
select
value,
rowID,
Diff,
RowID - row_number() over (order by RowID) Diff2
from
Table1
Notice how this gets you similar groupings, except with distinct Diff values (in Diff2):
| VALUE | ROWID | DIFF | DIFF2 |
|-------|-------|--------|-------|
| 153 | 48 | 1 | 47 |
| 68 | 49 | 1 | 47 |
| 50 | 57 | (null) | 54 |
| 75 | 58 | 1 | 54 |
| 65 | 59 | 1 | 54 |
| 70 | 63 | (null) | 57 |
| 66 | 64 | 1 | 57 |
| 79 | 66 | (null) | 58 |
| 73 | 67 | 1 | 58 |
| 82 | 68 | 1 | 58 |
| 85 | 69 | 1 | 58 |
| 66 | 70 | 1 | 58 |
| 118 | 88 | (null) | 75 |
| 69 | 89 | 1 | 75 |
| 67 | 90 | 1 | 75 |
| 178 | 91 | 1 | 75 |
Then to get ordered values for Diff, you can use Dense_Rank() to produce a numbering over each separate partition - except when a value is Null:
select
value,
rowID,
case when Diff = 1
then dense_rank() over (order by Diff2)
else Diff end as Diff
from (
select
value,
rowID,
Diff,
RowID - row_number() over (order by RowID) Diff2
from
Table1
) T
The result is the expected result, except keyed off of RowID directly rather than off of the existing Diff column.
| VALUE | ROWID | DIFF |
|-------|-------|--------|
| 153 | 48 | 1 |
| 68 | 49 | 1 |
| 50 | 57 | (null) |
| 75 | 58 | 2 |
| 65 | 59 | 2 |
| 70 | 63 | (null) |
| 66 | 64 | 3 |
| 79 | 66 | (null) |
| 73 | 67 | 4 |
| 82 | 68 | 4 |
| 85 | 69 | 4 |
| 66 | 70 | 4 |
| 118 | 88 | (null) |
| 69 | 89 | 5 |
| 67 | 90 | 5 |
| 178 | 91 | 5 |