I have a group-by-top-n-results query that is shown in the example input data. The subgroups (grouped by ID) are limited to the top 10 results and they are sorted ASC by rank. How do I go from the input example to the output example?
I was thinking it might be some sort of pivot function or a crosstab solution, or maybe it needs to be joined on itself. I'm just not sure exactly how to make this work. It's almost as if we concatenate each subgroup on its own row.
Each subgroup can have a maximum of 10 top results, but may also not have the full 10. In the example, the subgroup 1003 & 1007 do not have results past the top 6 and past the top 3 respectively. The schema in the example output is what I am looking for with all fields of 10 possible top-ranked rows as columns.
Example input data
+------+------+----------------------+---------+---------+-------+---------+---------+
| rank | ID | NAME | FIELD 1 | FIELD 2 | MAIN | FIELD 3 | FIELD 4 |
+------+------+----------------------+---------+---------+-------+---------+---------+
| 1 | 1001 | Backdash | 123053 | 2 | 21.1 | 17.09 | 20 |
| 2 | 1001 | cold Stone Creamery | 115404 | 2 | 19.78 | 1.04 | 0.93 |
| 3 | 1001 | Mado | 97650 | 2 | 16.74 | 0.1 | 0.14 |
| 4 | 1001 | Friendly's | 57638 | 1 | 9.88 | 0.21 | 0.4 |
| 5 | 1001 | Fosters Freeze | 53187 | 2 | 9.12 | 0.24 | 1 |
| 6 | 1001 | Marble Slab Creamery | 51381 | 2 | 8.81 | 15.75 | 28.57 |
| 7 | 1001 | Lappert's | 35929 | 1 | 6.16 | 0.01 | 0.04 |
| 8 | 1001 | Greater's | 23050 | 1 | 3.95 | 0.01 | 0.05 |
| 9 | 1001 | Happy Joe's | 20422 | 1 | 3.5 | 12.73 | 25 |
| 10 | 1001 | Shake Shack | 4260 | 1 | 0.73 | 8.1 | 50 |
| 1 | 1003 | Mauds Ice Cream | 949152 | 11 | 22.29 | 0.98 | 0.75 |
| 2 | 1003 | Mr Whippy | 433590 | 5 | 10.18 | 0.61 | 0.78 |
| 3 | 1003 | New Zeland Natural | 411348 | 7 | 9.66 | 0.03 | 0.12 |
| 4 | 1003 | Tropical Sno | 394558 | 10 | 9.27 | 0.15 | 0.4 |
| 5 | 1003 | Culver's | 375755 | 5 | 8.82 | 3.47 | 2.96 |
| 6 | 1003 | Marble Slab Creamery | 276971 | 7 | 6.5 | 13.05 | 12.07 |
| 1 | 1007 | Greater's | 105866 | 2 | 58.96 | 19.91 | 12.5 |
| 2 | 1007 | Hagan-Daz | 37697 | 3 | 20.99 | 26.17 | 37.5 |
| 3 | 1007 | cold Stone Creamery | 25520 | 1 | 14.21 | 0.23 | 0.47 |
+------+------+----------------------+---------+---------+-------+---------+---------+
Example Output Format
+------+-----------------+---------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+--------------+----------------+----------------+-------------+----------------+-----------------+----------------+----------------+----------------+-------------+----------------+-----------------+----------------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+--------------+-----------------+-----------------+--------------+-----------------+------------------+
| ID | TOP 1 NAME | TOP 1 FIELD 1 | TOP 1 FIELD 2 | TOP 1 MAIN | TOP 1 FIELD 3 | TOP 1 FIELD 4 | TOP 2 NAME | TOP 2 FIELD 1 | TOP 2 FIELD 2 | TOP 2 MAIN | TOP 2 FIELD 3 | TOP 2 FIELD 4 | TOP 3 NAME | TOP 3 FIELD 1 | TOP 3 FIELD 2 | TOP 3 MAIN | TOP 3 FIELD 3 | TOP 3 FIELD 4 | TOP 4 NAME | TOP 4 FIELD 1 | TOP 4 FIELD 2 | TOP 4 MAIN | TOP 4 FIELD 3 | TOP 4 FIELD 4 | TOP 5 NAME | TOP 5 FIELD 1 | TOP 5 FIELD 2 | TOP 5 MAIN | TOP 5 FIELD 3 | TOP 5 FIELD 4 | TOP 6 NAME | TOP 6 FIELD 1 | TOP 6 FIELD 2 | TOP 6 MAIN | TOP 6 FIELD 3 | TOP 6 FIELD 4 | TOP 7 NAME | TOP 7 FIELD 1 | TOP 7 FIELD 2 | TOP 7 MAIN | TOP 7 FIELD 3 | TOP 7 FIELD 4 | TOP 8 NAME | TOP 8 FIELD 1 | TOP 8 FIELD 2 | TOP 8 MAIN | TOP 8 FIELD 3 | TOP 8 FIELD 4 | TOP 9 NAME | TOP 9 FIELD 1 | TOP 9 FIELD 2 | TOP 9 MAIN | TOP 9 FIELD 3 | TOP 9 FIELD 4 | TOP 10 NAME | TOP 10 FIELD 1 | TOP 10 FIELD 2 | TOP 10 MAIN | TOP 10 FIELD 3 | TOP 10 FIELD 4 |
+------+-----------------+---------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+--------------+----------------+----------------+-------------+----------------+-----------------+----------------+----------------+----------------+-------------+----------------+-----------------+----------------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+--------------+-----------------+-----------------+--------------+-----------------+------------------+
| 1001 | Backdash | 123053 | 2 | 21.1 | 17.09 | 20 | cold Stone Creamery | 115404 | 2 | 19.78 | 1.04 | 0.93 | Mado | 97650 | 2 | 16.74 | 0.1 | 0.14 | Friendly's | 57638 | 1 | 9.88 | 0.21 | 0.4 | Fosters Freeze | 53187 | 2 | 9.12 | 0.24 | 1 | Marble Slab Creamery | 51381 | 2 | 8.81 | 15.75 | 28.57 | Lappert's | 35929 | 1 | 6.16 | 0.01 | 0.04 | Greater's | 23050 | 1 | 3.95 | 0.01 | 0.05 | Happy Joe's | 20422 | 1 | 3.5 | 12.73 | 25 | Shake Shack | 4260 | 1 | 0.73 | 8.1 | 50 |
| 1003 | Mauds Ice Cream | 949152 | 11 | 22.29 | 0.98 | 0.75 | Mr Whippy | 433590 | 5 | 10.18 | 0.61 | 0.78 | New Zeland Natural | 411348 | 7 | 9.66 | 0.03 | 0.12 | Tropical Sno | 394558 | 10 | 9.27 | 0.15 | 0.4 | Culver's | 375755 | 5 | 8.82 | 3.47 | 2.96 | Marble Slab Creamery | 276971 | 7 | 6.5 | 13.05 | 12.07 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 1007 | Greater's | 105866 | 2 | 58.96 | 19.91 | 12.5 | Hagan-Daz | 37697 | 3 | 20.99 | 26.17 | 37.5 | cold Stone Creamery | 25520 | 1 | 14.21 | 0.23 | 0.47 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
+------+-----------------+---------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+--------------+----------------+----------------+-------------+----------------+-----------------+----------------+----------------+----------------+-------------+----------------+-----------------+----------------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+--------------+-----------------+-----------------+--------------+-----------------+------------------+
PS.
I am coming from a coding mindset, and haven't done SQL in a while. In sudocode this could be solved with a simple nested for-loop. Eg.
foreach subgroup i
foreach rank j <= 10
print i, rank[j].name, rank[j].FIELD 1, rank[j].FIELD 2, rank[j].MAIN, rank[j].FIELD 3, rank[j].FIELD 4
print \r\n
You can use conditional aggregation:
select id,
max(case when rank = 1 then name end) as name_1,
max(case when rank = 1 then field1 end) as field1_1,
. . .
max(case when rank = 2 then name end) as name_2,
max(case when rank = 2 then field1 end) as field1_2,
. . .
max(case when rank = 10 then field3 end) as field3_10,
max(case when rank = 10 then field4 end) as field4_10
from inputdata id
group by id;
I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!
This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;
Given this pandas dataframe with two columns, 'Values' and 'Intervals'. How do I get a third column 'MinMax' indicating whether the value is a maximum or a minimum within that interval? The challenge for me is that the interval length and the distance between intervals are not fixed, therefore I post the question.
import pandas as pd
import numpy as np
data = pd.DataFrame([
[1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1],
[1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1],
[1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1],
[1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1],
[1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1],
[1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1],
[1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan],
[1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1],
[1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1],
[1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1],
[1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1],
[1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1],
[1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan],
[1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1],
[1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1],
[1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1],
[1873.174,1],[1873.691,np.nan],[1873.685,np.nan]
])
In the third column below you can see where the max and min is for each interval.
+-------+----------+-----------+---------+
| index | Value | Intervals | Min/Max |
+-------+----------+-----------+---------+
| 0 | 1879.289 | np.nan | |
| 1 | 1879.281 | np.nan | |
| 2 | 1879.292 | 1 | |
| 3 | 1879.295 | 1 | |
| 4 | 1879.481 | 1 | |
| 5 | 1879.294 | 1 | |
| 6 | 1879.268 | 1 | min |
| 7 | 1879.293 | 1 | |
| 8 | 1879.277 | 1 | |
| 9 | 1879.285 | 1 | |
| 10 | 1879.464 | 1 | |
| 11 | 1879.475 | 1 | |
| 12 | 1879.971 | 1 | |
| 13 | 1879.779 | 1 | |
| 17 | 1879.986 | 1 | |
| 18 | 1880.791 | 1 | max |
| 19 | 1880.29 | 1 | |
| 55 | 1879.253 | np.nan | |
| 56 | 1878.268 | np.nan | |
| 57 | 1875.73 | 1 | |
| 58 | 1876.792 | 1 | |
| 59 | 1875.977 | 1 | min |
| 60 | 1876.408 | 1 | |
| 61 | 1877.159 | 1 | |
| 62 | 1877.187 | 1 | |
| 63 | 1883.164 | 1 | |
| 64 | 1883.171 | 1 | |
| 65 | 1883.495 | 1 | |
| 66 | 1883.962 | 1 | |
| 67 | 1885.158 | 1 | |
| 68 | 1885.974 | 1 | max |
| 69 | 1886.479 | np.nan | |
| 70 | 1885.969 | np.nan | |
| 71 | 1884.693 | 1 | |
| 72 | 1884.977 | 1 | |
| 73 | 1884.967 | 1 | |
| 74 | 1884.691 | 1 | min |
| 75 | 1886.171 | 1 | max |
| 76 | 1886.166 | np.nan | |
| 77 | 1884.476 | np.nan | |
| 78 | 1884.66 | 1 | max |
| 79 | 1882.962 | 1 | |
| 80 | 1881.496 | 1 | |
| 81 | 1871.163 | 1 | min |
| 82 | 1874.985 | 1 | |
| 83 | 1874.979 | 1 | |
| 84 | 1871.173 | np.nan | |
| 85 | 1871.973 | np.nan | |
| 86 | 1871.682 | np.nan | |
| 87 | 1872.476 | np.nan | |
| 88 | 1882.361 | 1 | max |
| 89 | 1880.869 | 1 | |
| 90 | 1882.165 | 1 | |
| 91 | 1881.857 | 1 | |
| 92 | 1880.375 | 1 | |
| 93 | 1880.66 | 1 | |
| 94 | 1880.891 | 1 | |
| 95 | 1880.377 | 1 | |
| 96 | 1881.663 | 1 | |
| 97 | 1881.66 | 1 | |
| 98 | 1877.888 | 1 | |
| 99 | 1875.69 | 1 | |
| 100 | 1875.161 | 1 | min |
| 101 | 1876.697 | np.nan | |
| 102 | 1876.671 | np.nan | |
| 103 | 1879.666 | np.nan | |
| 111 | 1877.182 | np.nan | |
| 112 | 1878.898 | 1 | |
| 113 | 1878.668 | 1 | |
| 114 | 1878.871 | 1 | |
| 115 | 1878.882 | 1 | |
| 116 | 1879.173 | 1 | max |
| 117 | 1878.887 | 1 | |
| 118 | 1878.68 | 1 | |
| 119 | 1878.872 | 1 | |
| 120 | 1878.677 | 1 | |
| 121 | 1877.877 | 1 | |
| 122 | 1877.669 | 1 | |
| 123 | 1877.69 | 1 | |
| 124 | 1877.684 | 1 | |
| 125 | 1877.68 | 1 | |
| 126 | 1877.885 | 1 | |
| 127 | 1877.863 | 1 | |
| 128 | 1877.674 | 1 | |
| 129 | 1877.676 | 1 | |
| 130 | 1877.687 | 1 | |
| 131 | 1878.367 | 1 | |
| 132 | 1878.179 | 1 | |
| 133 | 1877.696 | 1 | |
| 134 | 1877.665 | 1 | min |
| 135 | 1877.667 | np.nan | |
| 136 | 1878.678 | np.nan | |
| 137 | 1878.661 | 1 | max |
| 138 | 1878.171 | 1 | |
| 139 | 1877.371 | 1 | |
| 140 | 1877.359 | 1 | |
| 141 | 1878.381 | 1 | |
| 142 | 1875.185 | 1 | min |
| 143 | 1875.367 | np.nan | |
| 144 | 1865.492 | np.nan | |
| 145 | 1865.495 | 1 | max |
| 146 | 1866.995 | 1 | |
| 147 | 1866.672 | 1 | |
| 148 | 1867.465 | 1 | |
| 149 | 1867.663 | 1 | |
| 150 | 1867.186 | 1 | |
| 151 | 1867.687 | 1 | |
| 152 | 1867.459 | 1 | |
| 153 | 1867.168 | 1 | |
| 154 | 1869.689 | 1 | |
| 155 | 1869.693 | 1 | |
| 156 | 1871.676 | 1 | |
| 157 | 1873.174 | 1 | min |
| 158 | 1873.691 | np.nan | |
| 159 | 1873.685 | np.nan | |
+-------+----------+-----------+---------+
isnull = data.iloc[:, 1].isnull()
minmax = data.groupby(isnull.cumsum()[~isnull])[0].agg(['idxmax', 'idxmin'])
data.loc[minmax['idxmax'], 'MinMax'] = 'max'
data.loc[minmax['idxmin'], 'MinMax'] = 'min'
data.MinMax = data.MinMax.fillna('')
print(data)
0 1 MinMax
0 1879.289 NaN
1 1879.281 NaN
2 1879.292 1.0
3 1879.295 1.0
4 1879.481 1.0
5 1879.294 1.0
6 1879.268 1.0 min
7 1879.293 1.0
8 1879.277 1.0
9 1879.285 1.0
10 1879.464 1.0
11 1879.475 1.0
12 1879.971 1.0
13 1879.779 1.0
14 1879.986 1.0
15 1880.791 1.0 max
16 1880.290 1.0
17 1879.253 NaN
18 1878.268 NaN
19 1875.730 1.0 min
20 1876.792 1.0
21 1875.977 1.0
22 1876.408 1.0
23 1877.159 1.0
24 1877.187 1.0
25 1883.164 1.0
26 1883.171 1.0
27 1883.495 1.0
28 1883.962 1.0
29 1885.158 1.0
.. ... ... ...
85 1877.687 1.0
86 1878.367 1.0
87 1878.179 1.0
88 1877.696 1.0
89 1877.665 1.0 min
90 1877.667 NaN
91 1878.678 NaN
92 1878.661 1.0 max
93 1878.171 1.0
94 1877.371 1.0
95 1877.359 1.0
96 1878.381 1.0
97 1875.185 1.0 min
98 1875.367 NaN
99 1865.492 NaN
100 1865.495 1.0 min
101 1866.995 1.0
102 1866.672 1.0
103 1867.465 1.0
104 1867.663 1.0
105 1867.186 1.0
106 1867.687 1.0
107 1867.459 1.0
108 1867.168 1.0
109 1869.689 1.0
110 1869.693 1.0
111 1871.676 1.0
112 1873.174 1.0 max
113 1873.691 NaN
114 1873.685 NaN
[115 rows x 3 columns]
data.columns=['Value','Interval']
data['Ingroup'] = (data['Interval'].notnull() + 0)
Use data['Interval'].notnull() to separate the groups...
Use cumsum() to number them with `groupno`...
Use groupby(groupno)..
Finally you want something using apply/idxmax/idxmin to label the max/min
But of course a for-loop as you suggested is the non-Pythonic but possibly simpler hack.