Pandas column backfill decreasing / increasing - pandas

I have DataFrame
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | NaN | -2.737 | <-
| 1.13 | NaN | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
and would like to back-fill NaN values using decreasing sequence ending with next valid value
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | 13 | -2.737 | <-
| 1.13 | 12 | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
Is there a way to do this?

Get positions where NaNs are found
positions = df['A'].isna().astype(int)
| positions |
--------------
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 0 |
| 0 |
| 0 |
then doing reverse cumulative sum:
mask = df['A'].isna().astype(int).loc[::-1]
cumSum = mask.cumsum()
posCumSum = (cumSum - cumSum.where(~mask).ffill().fillna(0).astype(int)).loc[::-1]
| posCumSum |
--------------
| 0 |
| 0 |
| 0 |
| 2 |
| 1 |
| 0 |
| 0 |
| 0 |
adding it to backfilled original column:
df['A'] = df['A'].bfill() + posCumSum
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | 13 | -2.737 | <-
| 1.13 | 12 | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |

Related

How do I transpose or pivot subgroups into to a single row?

I have a group-by-top-n-results query that is shown in the example input data. The subgroups (grouped by ID) are limited to the top 10 results and they are sorted ASC by rank. How do I go from the input example to the output example?
I was thinking it might be some sort of pivot function or a crosstab solution, or maybe it needs to be joined on itself. I'm just not sure exactly how to make this work. It's almost as if we concatenate each subgroup on its own row.
Each subgroup can have a maximum of 10 top results, but may also not have the full 10. In the example, the subgroup 1003 & 1007 do not have results past the top 6 and past the top 3 respectively. The schema in the example output is what I am looking for with all fields of 10 possible top-ranked rows as columns.
Example input data
+------+------+----------------------+---------+---------+-------+---------+---------+
| rank | ID | NAME | FIELD 1 | FIELD 2 | MAIN | FIELD 3 | FIELD 4 |
+------+------+----------------------+---------+---------+-------+---------+---------+
| 1 | 1001 | Backdash | 123053 | 2 | 21.1 | 17.09 | 20 |
| 2 | 1001 | cold Stone Creamery | 115404 | 2 | 19.78 | 1.04 | 0.93 |
| 3 | 1001 | Mado | 97650 | 2 | 16.74 | 0.1 | 0.14 |
| 4 | 1001 | Friendly's | 57638 | 1 | 9.88 | 0.21 | 0.4 |
| 5 | 1001 | Fosters Freeze | 53187 | 2 | 9.12 | 0.24 | 1 |
| 6 | 1001 | Marble Slab Creamery | 51381 | 2 | 8.81 | 15.75 | 28.57 |
| 7 | 1001 | Lappert's | 35929 | 1 | 6.16 | 0.01 | 0.04 |
| 8 | 1001 | Greater's | 23050 | 1 | 3.95 | 0.01 | 0.05 |
| 9 | 1001 | Happy Joe's | 20422 | 1 | 3.5 | 12.73 | 25 |
| 10 | 1001 | Shake Shack | 4260 | 1 | 0.73 | 8.1 | 50 |
| 1 | 1003 | Mauds Ice Cream | 949152 | 11 | 22.29 | 0.98 | 0.75 |
| 2 | 1003 | Mr Whippy | 433590 | 5 | 10.18 | 0.61 | 0.78 |
| 3 | 1003 | New Zeland Natural | 411348 | 7 | 9.66 | 0.03 | 0.12 |
| 4 | 1003 | Tropical Sno | 394558 | 10 | 9.27 | 0.15 | 0.4 |
| 5 | 1003 | Culver's | 375755 | 5 | 8.82 | 3.47 | 2.96 |
| 6 | 1003 | Marble Slab Creamery | 276971 | 7 | 6.5 | 13.05 | 12.07 |
| 1 | 1007 | Greater's | 105866 | 2 | 58.96 | 19.91 | 12.5 |
| 2 | 1007 | Hagan-Daz | 37697 | 3 | 20.99 | 26.17 | 37.5 |
| 3 | 1007 | cold Stone Creamery | 25520 | 1 | 14.21 | 0.23 | 0.47 |
+------+------+----------------------+---------+---------+-------+---------+---------+
Example Output Format
+------+-----------------+---------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+--------------+----------------+----------------+-------------+----------------+-----------------+----------------+----------------+----------------+-------------+----------------+-----------------+----------------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+--------------+-----------------+-----------------+--------------+-----------------+------------------+
| ID | TOP 1 NAME | TOP 1 FIELD 1 | TOP 1 FIELD 2 | TOP 1 MAIN | TOP 1 FIELD 3 | TOP 1 FIELD 4 | TOP 2 NAME | TOP 2 FIELD 1 | TOP 2 FIELD 2 | TOP 2 MAIN | TOP 2 FIELD 3 | TOP 2 FIELD 4 | TOP 3 NAME | TOP 3 FIELD 1 | TOP 3 FIELD 2 | TOP 3 MAIN | TOP 3 FIELD 3 | TOP 3 FIELD 4 | TOP 4 NAME | TOP 4 FIELD 1 | TOP 4 FIELD 2 | TOP 4 MAIN | TOP 4 FIELD 3 | TOP 4 FIELD 4 | TOP 5 NAME | TOP 5 FIELD 1 | TOP 5 FIELD 2 | TOP 5 MAIN | TOP 5 FIELD 3 | TOP 5 FIELD 4 | TOP 6 NAME | TOP 6 FIELD 1 | TOP 6 FIELD 2 | TOP 6 MAIN | TOP 6 FIELD 3 | TOP 6 FIELD 4 | TOP 7 NAME | TOP 7 FIELD 1 | TOP 7 FIELD 2 | TOP 7 MAIN | TOP 7 FIELD 3 | TOP 7 FIELD 4 | TOP 8 NAME | TOP 8 FIELD 1 | TOP 8 FIELD 2 | TOP 8 MAIN | TOP 8 FIELD 3 | TOP 8 FIELD 4 | TOP 9 NAME | TOP 9 FIELD 1 | TOP 9 FIELD 2 | TOP 9 MAIN | TOP 9 FIELD 3 | TOP 9 FIELD 4 | TOP 10 NAME | TOP 10 FIELD 1 | TOP 10 FIELD 2 | TOP 10 MAIN | TOP 10 FIELD 3 | TOP 10 FIELD 4 |
+------+-----------------+---------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+--------------+----------------+----------------+-------------+----------------+-----------------+----------------+----------------+----------------+-------------+----------------+-----------------+----------------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+--------------+-----------------+-----------------+--------------+-----------------+------------------+
| 1001 | Backdash | 123053 | 2 | 21.1 | 17.09 | 20 | cold Stone Creamery | 115404 | 2 | 19.78 | 1.04 | 0.93 | Mado | 97650 | 2 | 16.74 | 0.1 | 0.14 | Friendly's | 57638 | 1 | 9.88 | 0.21 | 0.4 | Fosters Freeze | 53187 | 2 | 9.12 | 0.24 | 1 | Marble Slab Creamery | 51381 | 2 | 8.81 | 15.75 | 28.57 | Lappert's | 35929 | 1 | 6.16 | 0.01 | 0.04 | Greater's | 23050 | 1 | 3.95 | 0.01 | 0.05 | Happy Joe's | 20422 | 1 | 3.5 | 12.73 | 25 | Shake Shack | 4260 | 1 | 0.73 | 8.1 | 50 |
| 1003 | Mauds Ice Cream | 949152 | 11 | 22.29 | 0.98 | 0.75 | Mr Whippy | 433590 | 5 | 10.18 | 0.61 | 0.78 | New Zeland Natural | 411348 | 7 | 9.66 | 0.03 | 0.12 | Tropical Sno | 394558 | 10 | 9.27 | 0.15 | 0.4 | Culver's | 375755 | 5 | 8.82 | 3.47 | 2.96 | Marble Slab Creamery | 276971 | 7 | 6.5 | 13.05 | 12.07 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| 1007 | Greater's | 105866 | 2 | 58.96 | 19.91 | 12.5 | Hagan-Daz | 37697 | 3 | 20.99 | 26.17 | 37.5 | cold Stone Creamery | 25520 | 1 | 14.21 | 0.23 | 0.47 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
+------+-----------------+---------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+---------------------+----------------+----------------+-------------+----------------+-----------------+--------------+----------------+----------------+-------------+----------------+-----------------+----------------+----------------+----------------+-------------+----------------+-----------------+----------------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+-------------+----------------+----------------+-------------+----------------+-----------------+--------------+-----------------+-----------------+--------------+-----------------+------------------+
PS.
I am coming from a coding mindset, and haven't done SQL in a while. In sudocode this could be solved with a simple nested for-loop. Eg.
foreach subgroup i
foreach rank j <= 10
print i, rank[j].name, rank[j].FIELD 1, rank[j].FIELD 2, rank[j].MAIN, rank[j].FIELD 3, rank[j].FIELD 4
print \r\n
You can use conditional aggregation:
select id,
max(case when rank = 1 then name end) as name_1,
max(case when rank = 1 then field1 end) as field1_1,
. . .
max(case when rank = 2 then name end) as name_2,
max(case when rank = 2 then field1 end) as field1_2,
. . .
max(case when rank = 10 then field3 end) as field3_10,
max(case when rank = 10 then field4 end) as field4_10
from inputdata id
group by id;

SQL - Window Functions with dense_rank()

I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!
This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;

Find the highest and lowest value locations within an interval on a column?

Given this pandas dataframe with two columns, 'Values' and 'Intervals'. How do I get a third column 'MinMax' indicating whether the value is a maximum or a minimum within that interval? The challenge for me is that the interval length and the distance between intervals are not fixed, therefore I post the question.
import pandas as pd
import numpy as np
data = pd.DataFrame([
[1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1],
[1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1],
[1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1],
[1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1],
[1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1],
[1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1],
[1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan],
[1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1],
[1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1],
[1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1],
[1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1],
[1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1],
[1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan],
[1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1],
[1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1],
[1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1],
[1873.174,1],[1873.691,np.nan],[1873.685,np.nan]
])
In the third column below you can see where the max and min is for each interval.
+-------+----------+-----------+---------+
| index | Value | Intervals | Min/Max |
+-------+----------+-----------+---------+
| 0 | 1879.289 | np.nan | |
| 1 | 1879.281 | np.nan | |
| 2 | 1879.292 | 1 | |
| 3 | 1879.295 | 1 | |
| 4 | 1879.481 | 1 | |
| 5 | 1879.294 | 1 | |
| 6 | 1879.268 | 1 | min |
| 7 | 1879.293 | 1 | |
| 8 | 1879.277 | 1 | |
| 9 | 1879.285 | 1 | |
| 10 | 1879.464 | 1 | |
| 11 | 1879.475 | 1 | |
| 12 | 1879.971 | 1 | |
| 13 | 1879.779 | 1 | |
| 17 | 1879.986 | 1 | |
| 18 | 1880.791 | 1 | max |
| 19 | 1880.29 | 1 | |
| 55 | 1879.253 | np.nan | |
| 56 | 1878.268 | np.nan | |
| 57 | 1875.73 | 1 | |
| 58 | 1876.792 | 1 | |
| 59 | 1875.977 | 1 | min |
| 60 | 1876.408 | 1 | |
| 61 | 1877.159 | 1 | |
| 62 | 1877.187 | 1 | |
| 63 | 1883.164 | 1 | |
| 64 | 1883.171 | 1 | |
| 65 | 1883.495 | 1 | |
| 66 | 1883.962 | 1 | |
| 67 | 1885.158 | 1 | |
| 68 | 1885.974 | 1 | max |
| 69 | 1886.479 | np.nan | |
| 70 | 1885.969 | np.nan | |
| 71 | 1884.693 | 1 | |
| 72 | 1884.977 | 1 | |
| 73 | 1884.967 | 1 | |
| 74 | 1884.691 | 1 | min |
| 75 | 1886.171 | 1 | max |
| 76 | 1886.166 | np.nan | |
| 77 | 1884.476 | np.nan | |
| 78 | 1884.66 | 1 | max |
| 79 | 1882.962 | 1 | |
| 80 | 1881.496 | 1 | |
| 81 | 1871.163 | 1 | min |
| 82 | 1874.985 | 1 | |
| 83 | 1874.979 | 1 | |
| 84 | 1871.173 | np.nan | |
| 85 | 1871.973 | np.nan | |
| 86 | 1871.682 | np.nan | |
| 87 | 1872.476 | np.nan | |
| 88 | 1882.361 | 1 | max |
| 89 | 1880.869 | 1 | |
| 90 | 1882.165 | 1 | |
| 91 | 1881.857 | 1 | |
| 92 | 1880.375 | 1 | |
| 93 | 1880.66 | 1 | |
| 94 | 1880.891 | 1 | |
| 95 | 1880.377 | 1 | |
| 96 | 1881.663 | 1 | |
| 97 | 1881.66 | 1 | |
| 98 | 1877.888 | 1 | |
| 99 | 1875.69 | 1 | |
| 100 | 1875.161 | 1 | min |
| 101 | 1876.697 | np.nan | |
| 102 | 1876.671 | np.nan | |
| 103 | 1879.666 | np.nan | |
| 111 | 1877.182 | np.nan | |
| 112 | 1878.898 | 1 | |
| 113 | 1878.668 | 1 | |
| 114 | 1878.871 | 1 | |
| 115 | 1878.882 | 1 | |
| 116 | 1879.173 | 1 | max |
| 117 | 1878.887 | 1 | |
| 118 | 1878.68 | 1 | |
| 119 | 1878.872 | 1 | |
| 120 | 1878.677 | 1 | |
| 121 | 1877.877 | 1 | |
| 122 | 1877.669 | 1 | |
| 123 | 1877.69 | 1 | |
| 124 | 1877.684 | 1 | |
| 125 | 1877.68 | 1 | |
| 126 | 1877.885 | 1 | |
| 127 | 1877.863 | 1 | |
| 128 | 1877.674 | 1 | |
| 129 | 1877.676 | 1 | |
| 130 | 1877.687 | 1 | |
| 131 | 1878.367 | 1 | |
| 132 | 1878.179 | 1 | |
| 133 | 1877.696 | 1 | |
| 134 | 1877.665 | 1 | min |
| 135 | 1877.667 | np.nan | |
| 136 | 1878.678 | np.nan | |
| 137 | 1878.661 | 1 | max |
| 138 | 1878.171 | 1 | |
| 139 | 1877.371 | 1 | |
| 140 | 1877.359 | 1 | |
| 141 | 1878.381 | 1 | |
| 142 | 1875.185 | 1 | min |
| 143 | 1875.367 | np.nan | |
| 144 | 1865.492 | np.nan | |
| 145 | 1865.495 | 1 | max |
| 146 | 1866.995 | 1 | |
| 147 | 1866.672 | 1 | |
| 148 | 1867.465 | 1 | |
| 149 | 1867.663 | 1 | |
| 150 | 1867.186 | 1 | |
| 151 | 1867.687 | 1 | |
| 152 | 1867.459 | 1 | |
| 153 | 1867.168 | 1 | |
| 154 | 1869.689 | 1 | |
| 155 | 1869.693 | 1 | |
| 156 | 1871.676 | 1 | |
| 157 | 1873.174 | 1 | min |
| 158 | 1873.691 | np.nan | |
| 159 | 1873.685 | np.nan | |
+-------+----------+-----------+---------+
isnull = data.iloc[:, 1].isnull()
minmax = data.groupby(isnull.cumsum()[~isnull])[0].agg(['idxmax', 'idxmin'])
data.loc[minmax['idxmax'], 'MinMax'] = 'max'
data.loc[minmax['idxmin'], 'MinMax'] = 'min'
data.MinMax = data.MinMax.fillna('')
print(data)
0 1 MinMax
0 1879.289 NaN
1 1879.281 NaN
2 1879.292 1.0
3 1879.295 1.0
4 1879.481 1.0
5 1879.294 1.0
6 1879.268 1.0 min
7 1879.293 1.0
8 1879.277 1.0
9 1879.285 1.0
10 1879.464 1.0
11 1879.475 1.0
12 1879.971 1.0
13 1879.779 1.0
14 1879.986 1.0
15 1880.791 1.0 max
16 1880.290 1.0
17 1879.253 NaN
18 1878.268 NaN
19 1875.730 1.0 min
20 1876.792 1.0
21 1875.977 1.0
22 1876.408 1.0
23 1877.159 1.0
24 1877.187 1.0
25 1883.164 1.0
26 1883.171 1.0
27 1883.495 1.0
28 1883.962 1.0
29 1885.158 1.0
.. ... ... ...
85 1877.687 1.0
86 1878.367 1.0
87 1878.179 1.0
88 1877.696 1.0
89 1877.665 1.0 min
90 1877.667 NaN
91 1878.678 NaN
92 1878.661 1.0 max
93 1878.171 1.0
94 1877.371 1.0
95 1877.359 1.0
96 1878.381 1.0
97 1875.185 1.0 min
98 1875.367 NaN
99 1865.492 NaN
100 1865.495 1.0 min
101 1866.995 1.0
102 1866.672 1.0
103 1867.465 1.0
104 1867.663 1.0
105 1867.186 1.0
106 1867.687 1.0
107 1867.459 1.0
108 1867.168 1.0
109 1869.689 1.0
110 1869.693 1.0
111 1871.676 1.0
112 1873.174 1.0 max
113 1873.691 NaN
114 1873.685 NaN
[115 rows x 3 columns]
data.columns=['Value','Interval']
data['Ingroup'] = (data['Interval'].notnull() + 0)
Use data['Interval'].notnull() to separate the groups...
Use cumsum() to number them with `groupno`...
Use groupby(groupno)..
Finally you want something using apply/idxmax/idxmin to label the max/min
But of course a for-loop as you suggested is the non-Pythonic but possibly simpler hack.

Selecting the maximum value only for another maximum value

If I have two int data type columns in SQL Server, how can I write a query so that I get the maximum number, at the maximum number of the other column?
Let me give an example. Lets say I have this table:
| Name | Version | Category | Value | Number | Replication |
|:-----:|:-------:|:--------:|:-----:|:------:|:-----------:|
| File1 | 1.0 | Time | 123 | 1 | 1 |
| File1 | 1.0 | Size | 456 | 1 | 1 |
| File2 | 1.0 | Time | 312 | 1 | 1 |
| File2 | 1.0 | Size | 645 | 1 | 1 |
| File1 | 1.0 | Time | 369 | 1 | 2 |
| File1 | 1.0 | Size | 258 | 1 | 2 |
| File2 | 1.0 | Time | 741 | 1 | 2 |
| File2 | 1.0 | Size | 734 | 1 | 2 |
| File1 | 1.1 | Time | 997 | 2 | 1 |
| File1 | 1.1 | Size | 997 | 2 | 1 |
| File2 | 1.1 | Time | 438 | 2 | 1 |
| File2 | 1.1 | Size | 735 | 2 | 1 |
| File1 | 1.1 | Time | 786 | 2 | 2 |
| File1 | 1.1 | Size | 486 | 2 | 2 |
| File2 | 1.1 | Time | 379 | 2 | 2 |
| File2 | 1.1 | Size | 943 | 2 | 2 |
| File1 | 1.2 | Time | 123 | 3 | 1 |
| File1 | 1.2 | Size | 456 | 3 | 1 |
| File2 | 1.2 | Time | 312 | 3 | 1 |
| File2 | 1.2 | Size | 645 | 3 | 1 |
| File1 | 1.2 | Time | 369 | 3 | 2 |
| File1 | 1.2 | Size | 258 | 3 | 2 |
| File2 | 1.2 | Time | 741 | 3 | 2 |
| File2 | 1.2 | Size | 734 | 3 | 2 |
| File1 | 1.3 | Time | 997 | 4 | 1 |
| File1 | 1.3 | Size | 997 | 4 | 1 |
| File2 | 1.3 | Time | 438 | 4 | 1 |
| File2 | 1.3 | Size | 735 | 4 | 1 |
How could I write a query so that I selected the maximum Replication value at the maximum Number value? As you can see, in this table, the maximum value in Number is 4 but the maximum number in Replication where Number = 4 is 1
All I can think to do is this:
SELECT MAX(Replication) FROM Table
WHERE Number IS MAX;
which is obviously wrong and doesn't work.
You can try Group By and Having
select max(Replication) from Table_Name group by [Number] having
[Number]=(select max([Number]) from Table_Name)
Just use a subquery to find the max number in the where clause. If you just want one single number as the result there is no need to use group by and having (which would make the query a lot more expensive):
select max([replication]) from tab
where number = (select max(number) from tab)

SQL Performance multiple exclusion from the same table

I have a table where I have a list of people, lets say i have 100 people listed in that table
I need to filter out the people using different criteria's and put them in groups, problem is when i start excluding on the 4th-5th level, performance issues come up and it becomes slow
with lst_tous_movements as (
select
t1.refid_eClinibase
t1.[dthrfinmouvement]
t1.[unite_service_id]
t1.[unite_service_suiv_id]
from sometable t1
)
,lst_patients_hospitalisés as (
select distinct
t1.refid_eClinibase
from lst_tous_movements t1
where
t1.[dthrfinmouvement] = '4000-01-01'
)
,lst_patients_admisUIB_transferes as (
select distinct
t1.refid_eClinibase
from lst_tous_movements t1
left join lst_patients_hospitalisés t2 on t1.refid_eClinibase = t2.refid_eClinibase
where
t1.[unite_service_id] = 4
and t1.[unite_service_suiv_id] <> 0
and t2.refid_eClinibase is null
)
,lst_patients_admisUIB_nonTransferes as (
select distinct
t1.refid_eClinibase
from lst_tous_movements t1
left join lst_patients_admisUIB_transferes t2 on t1.refid_eClinibase = t2.refid_eClinibase
left join lst_patients_hospitalisés t3 on t1.refid_eClinibase = t3.refid_eClinibase
where
t1.[unite_service_id] = 4
and t1.[unite_service_suiv_id] = 0
and t2.refid_eClinibase is null
and t3.refid_eClinibase is null
)
,lst_patients_autres as (
select distinct
t1.refid_eClinibase
from lst_patients t1
left join lst_patients_admisUIB_transferes t2 on t1.refid_eClinibase = t2.refid_eClinibase
left join lst_patients_hospitalisés t3 on t1.refid_eClinibase = t3.refid_eClinibase
left join lst_patients_admisUIB_nonTransferes t4 on t1.refid_eClinibase = t4.refid_eClinibase
where
t2.refid_eClinibase is null
and t3.refid_eClinibase is null
and t4.refid_eClinibase is null
)
as you can see i have a multi level filtering out going on here...
1st i get the people where t1.[dthrfinmouvement] = '4000-01-01'
2nd i get the people with another criteria EXCLUDING the 1st group
3rd i get the people with yet another criteria EXCLUDING the 1st and
the 2nd group
etc..
when i get to the 4th level, my query takes 6 - 10 seconds to complete
is there any way to speed this up ?
this is my dataset i'm working with:
+------------------+-------------------------------+------------------+------------------+-----------------------+
| refid_eClinibase | nodossierpermanent_eClinibase | dthrfinmouvement | unite_service_id | unite_service_suiv_id |
+------------------+-------------------------------+------------------+------------------+-----------------------+
| 25611 | P0017379 | 2013-04-27 | 58 | 0 |
| 25611 | P0017379 | 2013-05-02 | 4 | 2 |
| 25611 | P0017379 | 2013-05-18 | 2 | 0 |
| 85886 | P0077918 | 2013-04-10 | 58 | 0 |
| 85886 | P0077918 | 2013-05-06 | 6 | 12 |
| 85886 | P0077918 | 4000-01-01 | 12 | 0 |
| 91312 | P0083352 | 2013-07-24 | 3 | 14 |
| 91312 | P0083352 | 2013-07-24 | 14 | 3 |
| 91312 | P0083352 | 2013-07-30 | 3 | 8 |
| 91312 | P0083352 | 4000-01-01 | 8 | 0 |
| 93835 | P0085879 | 2013-04-30 | 58 | 0 |
| 93835 | P0085879 | 2013-05-07 | 4 | 2 |
| 93835 | P0085879 | 2013-05-16 | 2 | 0 |
| 93835 | P0085879 | 2013-05-22 | 58 | 0 |
| 93835 | P0085879 | 2013-05-24 | 4 | 0 |
| 93835 | P0085879 | 2013-05-31 | 58 | 0 |
| 93836 | P0085880 | 2013-05-20 | 58 | 0 |
| 93836 | P0085880 | 2013-05-22 | 4 | 2 |
| 93836 | P0085880 | 2013-05-31 | 2 | 0 |
| 97509 | P0089576 | 2013-04-09 | 58 | 0 |
| 97509 | P0089576 | 2013-04-11 | 4 | 0 |
| 102787 | P0094886 | 2013-04-08 | 58 | 0 |
| 102787 | P0094886 | 2013-04-11 | 4 | 2 |
| 102787 | P0094886 | 2013-05-21 | 2 | 0 |
| 103029 | P0095128 | 2013-04-04 | 58 | 0 |
| 103029 | P0095128 | 2013-04-10 | 4 | 1 |
| 103029 | P0095128 | 2013-05-03 | 1 | 0 |
| 103813 | P0095922 | 2013-07-02 | 58 | 0 |
| 103813 | P0095922 | 2013-07-03 | 4 | 6 |
| 103813 | P0095922 | 2013-08-14 | 6 | 0 |
| 105106 | P0097215 | 2013-08-09 | 58 | 0 |
| 105106 | P0097215 | 2013-08-13 | 4 | 0 |
| 105106 | P0097215 | 2013-08-14 | 58 | 0 |
| 105106 | P0097215 | 4000-01-01 | 4 | 0 |
| 106223 | P0098332 | 2013-06-11 | 1 | 0 |
| 106223 | P0098332 | 2013-08-01 | 58 | 0 |
| 106223 | P0098332 | 4000-01-01 | 1 | 0 |
| 106245 | P0098354 | 2013-04-02 | 58 | 0 |
| 106245 | P0098354 | 2013-05-24 | 58 | 0 |
| 106245 | P0098354 | 2013-05-29 | 4 | 1 |
| 106245 | P0098354 | 2013-07-12 | 1 | 0 |
| 106280 | P0098389 | 2013-04-07 | 58 | 0 |
| 106280 | P0098389 | 2013-04-09 | 4 | 0 |
| 106416 | P0098525 | 2013-04-19 | 58 | 0 |
| 106416 | P0098525 | 2013-04-23 | 4 | 0 |
| 106444 | P0098553 | 2013-04-22 | 58 | 0 |
| 106444 | P0098553 | 2013-04-25 | 4 | 0 |
| 106609 | P0098718 | 2013-05-08 | 58 | 0 |
| 106609 | P0098718 | 2013-05-10 | 4 | 11 |
| 106609 | P0098718 | 2013-07-24 | 11 | 12 |
| 106609 | P0098718 | 4000-01-01 | 12 | 0 |
| 106616 | P0098725 | 2013-05-09 | 58 | 0 |
| 106616 | P0098725 | 2013-05-09 | 4 | 1 |
| 106616 | P0098725 | 2013-07-27 | 1 | 0 |
| 106698 | P0098807 | 2013-05-16 | 58 | 0 |
| 106698 | P0098807 | 2013-05-22 | 4 | 6 |
| 106698 | P0098807 | 2013-06-14 | 6 | 1 |
| 106698 | P0098807 | 2013-06-28 | 1 | 0 |
| 106714 | P0098823 | 2013-05-20 | 58 | 0 |
| 106714 | P0098823 | 2013-05-21 | 58 | 0 |
| 106714 | P0098823 | 2013-05-24 | 58 | 0 |
| 106729 | P0098838 | 2013-05-21 | 58 | 0 |
| 106729 | P0098838 | 2013-05-23 | 4 | 1 |
| 106729 | P0098838 | 2013-06-03 | 1 | 0 |
| 107038 | P0099147 | 2013-06-25 | 58 | 0 |
| 107038 | P0099147 | 2013-06-28 | 4 | 1 |
| 107038 | P0099147 | 2013-07-04 | 1 | 0 |
| 107038 | P0099147 | 2013-08-13 | 58 | 0 |
| 107038 | P0099147 | 2013-08-15 | 4 | 6 |
| 107038 | P0099147 | 4000-01-01 | 6 | 0 |
| 107082 | P0099191 | 2013-06-29 | 58 | 0 |
| 107082 | P0099191 | 2013-07-04 | 4 | 6 |
| 107082 | P0099191 | 2013-07-19 | 6 | 0 |
| 107157 | P0099267 | 4000-01-01 | 13 | 0 |
| 107336 | P0099446 | 4000-01-01 | 6 | 0 |
+------------------+-------------------------------+------------------+------------------+-----------------------+
thanks.
It is hard to understand exactly what all your rules are from the question, but the general approach should be to add a "Grouping" column to a singl query that uses a CASE statement to categorize the people.
The conditions in a CASE are evaluated in order, so that if the first criteria is met, then the subsequent criteria are not even evaluated for that row.
Here is some code to get you started....
select t1.refid_eClinibase
,t1.[dthrfinmouvement]
,t1.[unite_service_id]
,t1.[unite_service_suiv_id]
CASE WHEN [dthrfinmouvement] = '4000-01-01' THEN 'Group1 Label'
WHEN condition2 = something THEN 'Group2 Label'
....
WHEN conditionN = something THEN 'GroupN Label'
ELSE 'Catch All Label'
END as person_category
from sometable t1