S3 Select with Presto - hive

I am trying out S3 Select from Presto using hive connector and Minio Object store. I am able to create an external table and run all the SQL queries. But, S3 Select does not seem to be working, even with the hive.s3select-pushdown.enabled=true set in the properties file in the catalog folder. I ran a packet trace on the Minio server, I only see GET/LIST calls being made, do not see any POST /{Key+}?select&select-type=2 HTTP/1.1 being made.
Below is the hive properties file.
hive.metastore.uri=thrift://hadoop-master:9083
hive.s3.path-style-access=true
hive.s3.endpoint=http://X.X.X.X:9000
hive.s3.aws-access-key=minioadmin
hive.s3.aws-secret-key=minioadmin
hive.non-managed-table-writes-enabled=true
hive.storage-format=ORC
hive.s3select-pushdown.enabled=true
I see that the same is set from the SESSION parameters in presto.
minio.s3_select_pushdown_enabled | true | true
minio.projection_pushdown_enabled | true | true
This is how I am creating the external table from presto cli.
presto:default> CREATE TABLE nyc_9 ( vendorid VARCHAR, tpep_pickup_datetime VARCHAR, tpep_dropoff_datetime VARCHAR, passenger_count VARCHAR, trip_distance VARCHAR, ratecodeid VARCHAR, store_and_fwd_flag VARCHAR, pulocationid VARCHAR, dolocationid VARCHAR, payment_type VARCHAR, fare_amount VARCHAR, extra VARCHAR, mta_tax VARCHAR, tip_amount VARCHAR, tolls_amount VARCHAR, improvement_surcharge VARCHAR, total_amount VARCHAR) WITH (FORMAT = 'CSV', skip_header_line_count = 1, EXTERNAL_LOCATION = 's3a://test10gb5/');
Query being run
presto:default> SELECT * FROM nyc_9 WHERE trip_distance > '20' AND fare_amount > '10' AND tip_amount > '2' AND passenger_count = '2' LIMIT 10;
vendorid | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_sur
----------+------------------------+------------------------+-----------------+---------------+------------+--------------------+--------------+--------------+--------------+-------------+-------+---------+------------+--------------+----------------
2 | 04/26/2018 08:51:16 AM | 04/26/2018 09:42:03 AM | 2 | 5.06 | 1 | N | 236 | 170 | 1 | 31 | 0 | 0.5 | 6.36 | 0 | 0.3
2 | 04/26/2018 08:14:17 AM | 04/26/2018 08:35:08 AM | 2 | 6.88 | 1 | N | 263 | 45 | 1 | 22 | 0 | 0.5 | 6.84 | 0 | 0.3
1 | 04/26/2018 08:19:47 AM | 04/26/2018 09:17:45 AM | 2 | 9.7 | 1 | N | 138 | 144 | 1 | 39 | 0 | 0.5 | 8 | 0 | 0.3
2 | 04/26/2018 08:38:15 AM | 04/26/2018 09:09:58 AM | 2 | 4.73 | 1 | N | 142 | 144 | 1 | 22 | 0 | 0.5 | 4.56 | 0 | 0.3
2 | 04/26/2018 08:38:26 AM | 04/26/2018 09:22:12 AM | 2 | 5.95 | 1 | N | 239 | 13 | 1 | 29 | 0 | 0.5 | 2.98 | 0 | 0.3
2 | 04/26/2018 08:47:03 AM | 04/26/2018 09:17:02 AM | 2 | 3.27 | 1 | N | 158 | 162 | 1 | 19 | 0 | 0.5 | 3.96 | 0 | 0.3
2 | 04/26/2018 08:21:19 AM | 04/26/2018 08:46:55 AM | 2 | 3.89 | 1 | N | 262 | 107 | 1 | 18.5 | 0 | 0.5 | 3.86 | 0 | 0.3
2 | 04/26/2018 08:35:32 AM | 04/26/2018 09:01:54 AM | 2 | 4.09 | 1 | N | 236 | 137 | 1 | 17.5 | 0 | 0.5 | 3.66 | 0 | 0.3
1 | 04/26/2018 08:43:45 AM | 04/26/2018 09:03:41 AM | 2 | 3 | 1 | N | 163 | 145 | 1 | 15 | 0 | 0.5 | 6 | 0 | 0.3
1 | 04/26/2018 08:01:47 AM | 04/26/2018 08:13:08 AM | 2 | 3.1 | 1 | N | 264 | 137 | 1 | 12 | 0 | 0.5 | 2.55 | 0 | 0.3
(10 rows)
Is there anything else that needs to be done for S3 Select to work?

Related

Solar-Heating: Data analytics for Grafana, advanced query

I would need some help with a very specific use case I have for my homelab.
I do have some solar panels on my roof, and I do extract a lot of data points to my server. I am using a specific app for that, making it easy to consume and automate stuff for that data (iobroker). The data I do save into a progres database. (No questions please why not Influx or TimescaleDB, postgres is what I need to live with...)
I use everything on docker right now, works perfectly. While I was able to create numerous dashboard on Grafana, display everything I like there, there is one specific "thing" I was unable to do, and after month of trying to get it done I finally ask for help. I do have a device supporting my heating from generated power to warm up the water. The device is using energy that we would normally feed back to the grid. The device is updating the power it pushes to the heating device pretty much every second. I am pulling the data from the device also every second. However I do have the logging configured in the way, that is only logs data when there is a difference to the previous datapoint.
One example:
Time
consumption in W
2018-02-21 12:00:00
3500
2018-02-21 12:00:01
1470
2018-02-21 12:00:02
1470
2018-02-21 12:00:03
1470
2018-02-21 12:00:00
1600
The second and third entry with the value of "1470" would not exist!
So first issue I have is a missing data point(s). What I would like to achieve is to have a calculation showing the consumption by individual day, month, year and all-time.
This does not need to happen inside Grafana, and I don't think Grafana can do this at all. There are options to do similar things in Grafana, but they do not provide an accurate result ($__unixEpochGroupAlias(ts,1s,previous)). I do have every option that is needed to create the data, so there should not be any obstacle in your ideas, and store it again inside the DB.
The data is polled/stored every 1000ms, so every second. Idea is to use Ws (Watt-seconds) to easily calculate with accurate numbers, as well as to display them better in Wh or kWh.
The DB can be only queried with SQL - but as mentioned if calculations needs to be done in a different language or so, then this is also fine.
Tried everything I could think of. SQL queries, searching numerous posts, all avaialble SQL based Grafana options. Guess I need custom code, but that above my skillset.
Anything more you'd need to know? Let me know. Thanks in advance!
The data structure looks the following:
id=entry for the application to identify the datapoint ts=timestamp
val=value in Ws
The other values are not important, but I wanted to show them for completeness.
id | ts | val | ack | _from | q
----+---------------+------+-----+-------+---
23 | 1661439981910 | 1826 | t | 3 | 0
23 | 1661439982967 | 1830 | t | 3 | 0
23 | 1661439984027 | 1830 | t | 3 | 0
23 | 1661439988263 | 1828 | t | 3 | 0
23 | 1661439985088 | 1829 | t | 3 | 0
23 | 1661439987203 | 1829 | t | 3 | 0
23 | 1661439989322 | 1831 | t | 3 | 0
23 | 1661439990380 | 1830 | t | 3 | 0
23 | 1661439991439 | 1827 | t | 3 | 0
23 | 1661439992498 | 1829 | t | 3 | 0
23 | 1661440021097 | 1911 | t | 3 | 0
23 | 1661439993558 | 1830 | t | 3 | 0
23 | 1661440022156 | 1924 | t | 3 | 0
23 | 1661439994624 | 1830 | t | 3 | 0
23 | 1661440023214 | 1925 | t | 3 | 0
23 | 1661439995683 | 1828 | t | 3 | 0
23 | 1661440024273 | 1924 | t | 3 | 0
23 | 1661439996739 | 1830 | t | 3 | 0
23 | 1661440025332 | 1925 | t | 3 | 0
23 | 1661440052900 | 1694 | t | 3 | 0
23 | 1661439997797 | 1831 | t | 3 | 0
23 | 1661440026391 | 1927 | t | 3 | 0
23 | 1661439998855 | 1831 | t | 3 | 0
23 | 1661440027450 | 1925 | t | 3 | 0
23 | 1661439999913 | 1828 | t | 3 | 0
23 | 1661440028509 | 1925 | t | 3 | 0
23 | 1661440029569 | 1927 | t | 3 | 0
23 | 1661440000971 | 1830 | t | 3 | 0
23 | 1661440030634 | 1926 | t | 3 | 0
23 | 1661440002030 | 1838 | t | 3 | 0
23 | 1661440031694 | 1925 | t | 3 | 0
23 | 1661440053955 | 1692 | t | 3 | 0
23 | 1659399542399 | 0 | t | 3 | 0
23 | 1659399543455 | 1 | t | 3 | 0
23 | 1659399544511 | 0 | t | 3 | 0
23 | 1663581880895 | 2813 | t | 3 | 0
23 | 1663581883017 | 2286 | t | 3 | 0
23 | 1663581881952 | 2646 | t | 3 | 0
23 | 1663581884074 | 1905 | t | 3 | 0
23 | 1661440004144 | 1838 | t | 3 | 0
23 | 1661440032752 | 1926 | t | 3 | 0
23 | 1661440005202 | 1839 | t | 3 | 0
23 | 1661440034870 | 1924 | t | 3 | 0
23 | 1661440006260 | 1840 | t | 3 | 0
23 | 1661440035929 | 1922 | t | 3 | 0
23 | 1661440007318 | 1840 | t | 3 | 0
23 | 1661440036987 | 1918 | t | 3 | 0
23 | 1661440008377 | 1838 | t | 3 | 0
23 | 1661440038045 | 1919 | t | 3 | 0
23 | 1661440009437 | 1839 | t | 3 | 0
23 | 1661440039104 | 1900 | t | 3 | 0
23 | 1661440010495 | 1839 | t | 3 | 0
23 | 1661440040162 | 1877 | t | 3 | 0
23 | 1661440011556 | 1838 | t | 3 | 0
23 | 1661440041220 | 1862 | t | 3 | 0
23 | 1661440012629 | 1840 | t | 3 | 0
23 | 1661440042279 | 1847 | t | 3 | 0
23 | 1661440013687 | 1840 | t | 3 | 0
23 | 1661440043340 | 1829 | t | 3 | 0
23 | 1661440014746 | 1833 | t | 3 | 0
23 | 1661440044435 | 1817 | t | 3 | 0
23 | 1661440015804 | 1833 | t | 3 | 0
23 | 1661440045493 | 1789 | t | 3 | 0
23 | 1661440046551 | 1766 | t | 3 | 0
23 | 1661440016862 | 1846 | t | 3 | 0
23 | 1661440047610 | 1736 | t | 3 | 0
23 | 1661440048670 | 1705 | t | 3 | 0
23 | 1661440017920 | 1863 | t | 3 | 0
23 | 1661440049726 | 1694 | t | 3 | 0
23 | 1661440050783 | 1694 | t | 3 | 0
23 | 1661440018981 | 1876 | t | 3 | 0
23 | 1661440051840 | 1696 | t | 3 | 0
23 | 1661440055015 | 1692 | t | 3 | 0
23 | 1661440056071 | 1693 | t | 3 | 0
23 | 1661440322966 | 1916 | t | 3 | 0
23 | 1661440325082 | 1916 | t | 3 | 0
23 | 1661440326142 | 1926 | t | 3 | 0
23 | 1661440057131 | 1693 | t | 3 | 0
23 | 1661440327199 | 1913 | t | 3 | 0
23 | 1661440058189 | 1692 | t | 3 | 0
23 | 1661440328256 | 1915 | t | 3 | 0
23 | 1661440059247 | 1691 | t | 3 | 0
23 | 1661440329315 | 1923 | t | 3 | 0
23 | 1661440060306 | 1692 | t | 3 | 0
23 | 1661440330376 | 1912 | t | 3 | 0
23 | 1661440061363 | 1676 | t | 3 | 0
23 | 1661440331470 | 1913 | t | 3 | 0
23 | 1661440062437 | 1664 | t | 3 | 0
23 | 1663581885133 | 1678 | t | 3 | 0
23 | 1661440332530 | 1923 | t | 3 | 0
23 | 1661440064552 | 1667 | t | 3 | 0
23 | 1661440334647 | 1915 | t | 3 | 0
23 | 1661440335708 | 1913 | t | 3 | 0
23 | 1661440065608 | 1665 | t | 3 | 0
23 | 1661440066665 | 1668 | t | 3 | 0
23 | 1661440336763 | 1912 | t | 3 | 0
23 | 1661440337822 | 1913 | t | 3 | 0
23 | 1661440338879 | 1911 | t | 3 | 0
23 | 1661440068780 | 1664 | t | 3 | 0
23 | 1661440339939 | 1912 | t | 3 | 0
(100 rows)```
iobroker=# \d ts_number
Table "public.ts_number"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
ts | bigint | | not null |
val | real | | |
ack | boolean | | |
_from | integer | | |
q | integer | | |
Indexes:
"ts_number_pkey" PRIMARY KEY, btree (id, ts)
You can do this with a mix of generate_series() and some window functions.
First we use generate_series() to get all the second timestamps in a desired range. Then we join to our readings to find what consumption values we have. Group nulls with their most recent non-null reading. Then set the consumption the same for the whole group.
So: if we have readings like this:
richardh=> SELECT * FROM readings;
id | ts | consumption
----+------------------------+-------------
1 | 2023-02-16 20:29:13+00 | 900
2 | 2023-02-16 20:29:16+00 | 1000
3 | 2023-02-16 20:29:20+00 | 925
(3 rows)
We can get all of the seconds we might want like this:
richardh=> SELECT generate_series(timestamptz '2023-02-16 20:29:13+00', timestamptz '2023-02-16 20:29:30+00', interval '1 second');
generate_series
------------------------
2023-02-16 20:29:13+00
2023-02-16 20:29:14+00
...etc...
2023-02-16 20:29:29+00
2023-02-16 20:29:30+00
(18 rows)
Then we join our complete set of timestamps to our readings:
WITH wanted_timestamps (ts) AS (
SELECT generate_series(timestamptz '2023-02-16 20:29:13+00', timestamptz '2023-02-16 20:29:30+00', interval '1 second')
)
SELECT
wt.ts
, r.consumption
, sum(CASE WHEN r.consumption IS NOT NULL THEN 1 ELSE 0 END)
OVER (ORDER BY ts) AS group_num
FROM
wanted_timestamps wt
LEFT JOIN readings r USING (ts)
ORDER BY wt.ts;
ts | consumption | group_num
------------------------+-------------+-----------
2023-02-16 20:29:13+00 | 900 | 1
2023-02-16 20:29:14+00 | | 1
2023-02-16 20:29:15+00 | | 1
2023-02-16 20:29:16+00 | 1000 | 2
2023-02-16 20:29:17+00 | | 2
2023-02-16 20:29:18+00 | | 2
2023-02-16 20:29:19+00 | | 2
2023-02-16 20:29:20+00 | 925 | 3
2023-02-16 20:29:21+00 | | 3
2023-02-16 20:29:22+00 | | 3
2023-02-16 20:29:23+00 | | 3
2023-02-16 20:29:24+00 | | 3
2023-02-16 20:29:25+00 | | 3
2023-02-16 20:29:26+00 | | 3
2023-02-16 20:29:27+00 | | 3
2023-02-16 20:29:28+00 | | 3
2023-02-16 20:29:29+00 | | 3
2023-02-16 20:29:30+00 | | 3
(18 rows)
Finally, fill in the missing consumption values:
WITH wanted_timestamps (ts) AS (
SELECT generate_series(timestamptz '2023-02-16 20:29:13+00', timestamptz '2023-02-16 20:29:30+00', interval '1 second')
), grouped_values AS (
SELECT
wt.ts
, r.consumption
, sum(CASE WHEN r.consumption IS NOT NULL THEN 1 ELSE 0 END)
OVER (ORDER BY ts) AS group_num
FROM wanted_timestamps wt
LEFT JOIN readings r USING (ts)
)
SELECT
gv.ts
, first_value(gv.consumption) OVER (PARTITION BY group_num)
AS consumption
FROM
grouped_values gv
ORDER BY ts;
ts | consumption
------------------------+-------------
2023-02-16 20:29:13+00 | 900
2023-02-16 20:29:14+00 | 900
2023-02-16 20:29:15+00 | 900
2023-02-16 20:29:16+00 | 1000
2023-02-16 20:29:17+00 | 1000
2023-02-16 20:29:18+00 | 1000
2023-02-16 20:29:19+00 | 1000
2023-02-16 20:29:20+00 | 925
2023-02-16 20:29:21+00 | 925
2023-02-16 20:29:22+00 | 925
2023-02-16 20:29:23+00 | 925
2023-02-16 20:29:24+00 | 925
2023-02-16 20:29:25+00 | 925
2023-02-16 20:29:26+00 | 925
2023-02-16 20:29:27+00 | 925
2023-02-16 20:29:28+00 | 925
2023-02-16 20:29:29+00 | 925
2023-02-16 20:29:30+00 | 925
(18 rows)

Pandas column backfill decreasing / increasing

I have DataFrame
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | NaN | -2.737 | <-
| 1.13 | NaN | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
and would like to back-fill NaN values using decreasing sequence ending with next valid value
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | 13 | -2.737 | <-
| 1.13 | 12 | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |
Is there a way to do this?
Get positions where NaNs are found
positions = df['A'].isna().astype(int)
| positions |
--------------
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 0 |
| 0 |
| 0 |
then doing reverse cumulative sum:
mask = df['A'].isna().astype(int).loc[::-1]
cumSum = mask.cumsum()
posCumSum = (cumSum - cumSum.where(~mask).ffill().fillna(0).astype(int)).loc[::-1]
| posCumSum |
--------------
| 0 |
| 0 |
| 0 |
| 2 |
| 1 |
| 0 |
| 0 |
| 0 |
adding it to backfilled original column:
df['A'] = df['A'].bfill() + posCumSum
| ind | A | B |
------------------------
| 1.01 | 10 | -1.734 |
| 1.04 | 10 | -1.244 |
| 1.05 | 10 | 0.016 |
| 1.11 | 13 | -2.737 | <-
| 1.13 | 12 | -4.232 | <-
| 1.19 | 11 | -3.241 | <=
| 1.20 | 12 | -2.832 |
| 1.21 | 10 | -4.277 |

Cumulative sum of multiple window functions V3

I have this table:
id | date | player_id | score | all_games | all_wins | n_games | n_wins
============================================================================================
6747 | 2018-08-10 | 1 | 0 | 1 | | 1 |
6751 | 2018-08-10 | 1 | 0 | 2 | 0 | 2 |
6764 | 2018-08-10 | 1 | 0 | 3 | 0 | 3 |
6783 | 2018-08-10 | 1 | 0 | 4 | 0 | 4 |
6804 | 2018-08-10 | 1 | 0 | 5 | 0 | 5 |
6821 | 2018-08-10 | 1 | 0 | 6 | 0 | 6 |
6828 | 2018-08-10 | 1 | 0 | 7 | 0 | 7 |
17334 | 2018-08-23 | 1 | 0 | 8 | 0 | 8 | 0
17363 | 2018-08-23 | 1 | 0 | 9 | 0 | 9 | 0
17398 | 2018-08-23 | 1 | 0 | 10 | 0 | 10 | 0
17403 | 2018-08-23 | 1 | 0 | 11 | 0 | 11 | 0
17409 | 2018-08-23 | 1 | 0 | 12 | 0 | 12 | 0
33656 | 2018-09-13 | 1 | 0 | 13 | 0 | 13 | 0
33687 | 2018-09-13 | 1 | 0 | 14 | 0 | 14 | 0
45393 | 2018-09-27 | 1 | 0 | 15 | 0 | 15 | 0
45402 | 2018-09-27 | 1 | 0 | 16 | 0 | 16 | 0
45422 | 2018-09-27 | 1 | 1 | 17 | 0 | 17 | 0
45453 | 2018-09-27 | 1 | 0 | 18 | 1 | 18 | 0
45461 | 2018-09-27 | 1 | 0 | 19 | 1 | 19 | 0
45474 | 2018-09-27 | 1 | 0 | 20 | 1 | 20 | 0
57155 | 2018-10-11 | 1 | 0 | 21 | 1 | 21 | 1
57215 | 2018-10-11 | 1 | 0 | 22 | 1 | 22 | 1
57225 | 2018-10-11 | 1 | 0 | 23 | 1 | 23 | 1
69868 | 2018-10-25 | 1 | 0 | 24 | 1 | 24 | 1
The issue that I now need to solve is that I need n_games to be a rolling count of the last number of games per day, i.e. a user can play multiple games per day, as present it is just the same as row_number(*) OVER all_games
The other issues is that the column n_wins only does a sum(*) of the rolling windows wins for the day, so if a user wins a couple of games early on in day, that will not be added to the n_wins column until the next day.
I have the example DEMO:
I have tried this query
SELECT id,
date,
player_id,
score,
row_number(*) OVER all_races AS all_games,
sum(score) OVER all_races AS all_wins,
row_number(*) OVER last_n AS n_games,
sum(score) OVER last_n AS n_wins
FROM scores
WINDOW
all_races AS (PARTITION BY player_id ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
last_n AS (PARTITION BY player_id ORDER BY date ASC RANGE BETWEEN interval '7 days' PRECEDING AND interval '1 day' PRECEDING);
Ideally I need a query that will output something like this table
id | date | player_id | score | all_games | all_wins | n_games | n_wins
============================================================================================
6747 | 2018-08-10 | 1 | 0 | 1 | | 1 |
6751 | 2018-08-10 | 1 | 0 | 2 | 0 | 2 |
6764 | 2018-08-10 | 1 | 0 | 3 | 0 | 3 |
6783 | 2018-08-10 | 1 | 0 | 4 | 0 | 4 |
6804 | 2018-08-10 | 1 | 0 | 5 | 0 | 5 |
6821 | 2018-08-10 | 1 | 0 | 6 | 0 | 6 |
6828 | 2018-08-10 | 1 | 0 | 7 | 0 | 7 |
17334 | 2018-08-23 | 1 | 0 | 8 | 0 | 1 | 0
17363 | 2018-08-23 | 1 | 0 | 9 | 0 | 2 | 0
17398 | 2018-08-23 | 1 | 0 | 10 | 0 | 3 | 0
17403 | 2018-08-23 | 1 | 0 | 11 | 0 | 4 | 0
17409 | 2018-08-23 | 1 | 0 | 12 | 0 | 5 | 0
33656 | 2018-09-13 | 1 | 1 | 13 | 1 | 6 | 0
33687 | 2018-09-13 | 1 | 0 | 14 | 1 | 7 | 1
45393 | 2018-09-27 | 1 | 0 | 15 | 1 | 1 | 1
45402 | 2018-09-27 | 1 | 0 | 16 | 1 | 2 | 1
45422 | 2018-09-27 | 1 | 1 | 17 | 1 | 3 | 1
45453 | 2018-09-27 | 1 | 0 | 18 | 2 | 4 | 2
45461 | 2018-09-27 | 1 | 0 | 19 | 2 | 5 | 2
45474 | 2018-09-27 | 1 | 0 | 20 | 2 | 6 | 1
57155 | 2018-10-11 | 1 | 0 | 21 | 2 | 7 | 1
57215 | 2018-10-11 | 1 | 0 | 22 | 2 | 1 | 1
57225 | 2018-10-11 | 1 | 0 | 23 | 2 | 2 | 1
69868 | 2018-10-25 | 1 | 0 | 24 | 2 | 3 | 1

Find the highest and lowest value locations within an interval on a column?

Given this pandas dataframe with two columns, 'Values' and 'Intervals'. How do I get a third column 'MinMax' indicating whether the value is a maximum or a minimum within that interval? The challenge for me is that the interval length and the distance between intervals are not fixed, therefore I post the question.
import pandas as pd
import numpy as np
data = pd.DataFrame([
[1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1],
[1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1],
[1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1],
[1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1],
[1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1],
[1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1],
[1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan],
[1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1],
[1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1],
[1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1],
[1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1],
[1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1],
[1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan],
[1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1],
[1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1],
[1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1],
[1873.174,1],[1873.691,np.nan],[1873.685,np.nan]
])
In the third column below you can see where the max and min is for each interval.
+-------+----------+-----------+---------+
| index | Value | Intervals | Min/Max |
+-------+----------+-----------+---------+
| 0 | 1879.289 | np.nan | |
| 1 | 1879.281 | np.nan | |
| 2 | 1879.292 | 1 | |
| 3 | 1879.295 | 1 | |
| 4 | 1879.481 | 1 | |
| 5 | 1879.294 | 1 | |
| 6 | 1879.268 | 1 | min |
| 7 | 1879.293 | 1 | |
| 8 | 1879.277 | 1 | |
| 9 | 1879.285 | 1 | |
| 10 | 1879.464 | 1 | |
| 11 | 1879.475 | 1 | |
| 12 | 1879.971 | 1 | |
| 13 | 1879.779 | 1 | |
| 17 | 1879.986 | 1 | |
| 18 | 1880.791 | 1 | max |
| 19 | 1880.29 | 1 | |
| 55 | 1879.253 | np.nan | |
| 56 | 1878.268 | np.nan | |
| 57 | 1875.73 | 1 | |
| 58 | 1876.792 | 1 | |
| 59 | 1875.977 | 1 | min |
| 60 | 1876.408 | 1 | |
| 61 | 1877.159 | 1 | |
| 62 | 1877.187 | 1 | |
| 63 | 1883.164 | 1 | |
| 64 | 1883.171 | 1 | |
| 65 | 1883.495 | 1 | |
| 66 | 1883.962 | 1 | |
| 67 | 1885.158 | 1 | |
| 68 | 1885.974 | 1 | max |
| 69 | 1886.479 | np.nan | |
| 70 | 1885.969 | np.nan | |
| 71 | 1884.693 | 1 | |
| 72 | 1884.977 | 1 | |
| 73 | 1884.967 | 1 | |
| 74 | 1884.691 | 1 | min |
| 75 | 1886.171 | 1 | max |
| 76 | 1886.166 | np.nan | |
| 77 | 1884.476 | np.nan | |
| 78 | 1884.66 | 1 | max |
| 79 | 1882.962 | 1 | |
| 80 | 1881.496 | 1 | |
| 81 | 1871.163 | 1 | min |
| 82 | 1874.985 | 1 | |
| 83 | 1874.979 | 1 | |
| 84 | 1871.173 | np.nan | |
| 85 | 1871.973 | np.nan | |
| 86 | 1871.682 | np.nan | |
| 87 | 1872.476 | np.nan | |
| 88 | 1882.361 | 1 | max |
| 89 | 1880.869 | 1 | |
| 90 | 1882.165 | 1 | |
| 91 | 1881.857 | 1 | |
| 92 | 1880.375 | 1 | |
| 93 | 1880.66 | 1 | |
| 94 | 1880.891 | 1 | |
| 95 | 1880.377 | 1 | |
| 96 | 1881.663 | 1 | |
| 97 | 1881.66 | 1 | |
| 98 | 1877.888 | 1 | |
| 99 | 1875.69 | 1 | |
| 100 | 1875.161 | 1 | min |
| 101 | 1876.697 | np.nan | |
| 102 | 1876.671 | np.nan | |
| 103 | 1879.666 | np.nan | |
| 111 | 1877.182 | np.nan | |
| 112 | 1878.898 | 1 | |
| 113 | 1878.668 | 1 | |
| 114 | 1878.871 | 1 | |
| 115 | 1878.882 | 1 | |
| 116 | 1879.173 | 1 | max |
| 117 | 1878.887 | 1 | |
| 118 | 1878.68 | 1 | |
| 119 | 1878.872 | 1 | |
| 120 | 1878.677 | 1 | |
| 121 | 1877.877 | 1 | |
| 122 | 1877.669 | 1 | |
| 123 | 1877.69 | 1 | |
| 124 | 1877.684 | 1 | |
| 125 | 1877.68 | 1 | |
| 126 | 1877.885 | 1 | |
| 127 | 1877.863 | 1 | |
| 128 | 1877.674 | 1 | |
| 129 | 1877.676 | 1 | |
| 130 | 1877.687 | 1 | |
| 131 | 1878.367 | 1 | |
| 132 | 1878.179 | 1 | |
| 133 | 1877.696 | 1 | |
| 134 | 1877.665 | 1 | min |
| 135 | 1877.667 | np.nan | |
| 136 | 1878.678 | np.nan | |
| 137 | 1878.661 | 1 | max |
| 138 | 1878.171 | 1 | |
| 139 | 1877.371 | 1 | |
| 140 | 1877.359 | 1 | |
| 141 | 1878.381 | 1 | |
| 142 | 1875.185 | 1 | min |
| 143 | 1875.367 | np.nan | |
| 144 | 1865.492 | np.nan | |
| 145 | 1865.495 | 1 | max |
| 146 | 1866.995 | 1 | |
| 147 | 1866.672 | 1 | |
| 148 | 1867.465 | 1 | |
| 149 | 1867.663 | 1 | |
| 150 | 1867.186 | 1 | |
| 151 | 1867.687 | 1 | |
| 152 | 1867.459 | 1 | |
| 153 | 1867.168 | 1 | |
| 154 | 1869.689 | 1 | |
| 155 | 1869.693 | 1 | |
| 156 | 1871.676 | 1 | |
| 157 | 1873.174 | 1 | min |
| 158 | 1873.691 | np.nan | |
| 159 | 1873.685 | np.nan | |
+-------+----------+-----------+---------+
isnull = data.iloc[:, 1].isnull()
minmax = data.groupby(isnull.cumsum()[~isnull])[0].agg(['idxmax', 'idxmin'])
data.loc[minmax['idxmax'], 'MinMax'] = 'max'
data.loc[minmax['idxmin'], 'MinMax'] = 'min'
data.MinMax = data.MinMax.fillna('')
print(data)
0 1 MinMax
0 1879.289 NaN
1 1879.281 NaN
2 1879.292 1.0
3 1879.295 1.0
4 1879.481 1.0
5 1879.294 1.0
6 1879.268 1.0 min
7 1879.293 1.0
8 1879.277 1.0
9 1879.285 1.0
10 1879.464 1.0
11 1879.475 1.0
12 1879.971 1.0
13 1879.779 1.0
14 1879.986 1.0
15 1880.791 1.0 max
16 1880.290 1.0
17 1879.253 NaN
18 1878.268 NaN
19 1875.730 1.0 min
20 1876.792 1.0
21 1875.977 1.0
22 1876.408 1.0
23 1877.159 1.0
24 1877.187 1.0
25 1883.164 1.0
26 1883.171 1.0
27 1883.495 1.0
28 1883.962 1.0
29 1885.158 1.0
.. ... ... ...
85 1877.687 1.0
86 1878.367 1.0
87 1878.179 1.0
88 1877.696 1.0
89 1877.665 1.0 min
90 1877.667 NaN
91 1878.678 NaN
92 1878.661 1.0 max
93 1878.171 1.0
94 1877.371 1.0
95 1877.359 1.0
96 1878.381 1.0
97 1875.185 1.0 min
98 1875.367 NaN
99 1865.492 NaN
100 1865.495 1.0 min
101 1866.995 1.0
102 1866.672 1.0
103 1867.465 1.0
104 1867.663 1.0
105 1867.186 1.0
106 1867.687 1.0
107 1867.459 1.0
108 1867.168 1.0
109 1869.689 1.0
110 1869.693 1.0
111 1871.676 1.0
112 1873.174 1.0 max
113 1873.691 NaN
114 1873.685 NaN
[115 rows x 3 columns]
data.columns=['Value','Interval']
data['Ingroup'] = (data['Interval'].notnull() + 0)
Use data['Interval'].notnull() to separate the groups...
Use cumsum() to number them with `groupno`...
Use groupby(groupno)..
Finally you want something using apply/idxmax/idxmin to label the max/min
But of course a for-loop as you suggested is the non-Pythonic but possibly simpler hack.

How to make a long table wide without aggreagation?

I have this table of financial transactions..
PersonID | SeqId | FundId | PortfolioDbu | Date
----------------------------------------------------------
456 | 1 | B | 0.1 | 2012-04-03
456 | 1 | F | 0.5 | 2012-04-03
456 | 1 | H | 0.3 | 2012-04-03
456 | 1 | Z | 0.1 | 2012-04-03
8 | 1 | B | 0.5 | 2012-03-23
8 | 1 | A | 0.5 | 2012-03-23
8 | 2 | C | 0.3 | 2011-03-24
8 | 2 | X | 0.3 | 2011-03-24
8 | 2 | F | 0.4 | 2011-03-24
6001 | 1 | J | 0.5 | 2008-01-01
6001 | 1 | R | 0.5 | 2008-01-01
76 | 1 | A | 0.25 | 2010-09-26
76 | 1 | B | 0.25 | 2010-09-26
76 | 1 | C | 0.25 | 2010-09-26
76 | 1 | D | 0.25 | 2010-09-26
321 | 1 | X | 0.2 | 2012-02-21
321 | 1 | Y | 0.2 | 2012-02-21
321 | 1 | U | 0.2 | 2012-02-21
321 | 1 | P | 0.2 | 2012-02-21
321 | 1 | W | 0.2 | 2012-02-21
456 | 2 | Y | 1 | 2012-11-01
which I need to convert to a "wide" format, like so..
Date | PersonId | SeqId | Fund1 | Fund2 | Fund3 | Fund4 | Fund5 | Dbu1 | Dbu2 | Dbu3 | Dbu4 | Dbu5
----------------------------------------------------------------------------------------------------------
2012-04-03 | 456 | 1 | B | F | H | Z | . | 0.1 | 0.5 | 0.3 | 0.1 | .
2012-03-23 | 8 | 1 | B | A | . | . | . | 0.5 | 0.5 | . | . | .
2012-03-24 | 8 | 2 | C | X | F | . | . | 0.3 | 0.3 | 0.4 | . | .
2008-01-01 | 6001 | 1 | J | R | . | . | . | 0.5 | 0.5 | . | . | .
2010-09-26 | 76 | 1 | A | B | C | D | . | 0.25 | 0.25 | 0.25 | 0.25 | .
2010-02-21 | 321 | 1 | X | Y | U | P | W | 0.2 | 0.2 | 0.2 | 0.2 | 0.2
2012-11-01 | 456 | 2 | Y | . | . | . | . | 1 | . | . | . | .
Is this possible even though I don't want to aggregate the data in any way?
SQL Fiddle
I'm not real good a PIVOT tables, but you can use the following alternative CASE statement pattern to get the output you're looking for:
WITH T AS (
SELECT
personid,
seqid,
row_number() over (partition BY personid,seqid ORDER BY FundId) AS ROW,
FundId,
portfoliodbu,
date
FROM
transactions
)
SELECT
date,
personid,
seqid,
max(CASE WHEN ROW=1 THEN fundid END) AS fund1,
max(CASE WHEN ROW=2 THEN fundid END) AS fund2,
max(CASE WHEN ROW=3 THEN fundid END) AS fund3,
max(CASE WHEN ROW=4 THEN fundid END) AS fund4,
max(CASE WHEN ROW=5 THEN fundid END) AS fund5,
max(CASE WHEN ROW=1 THEN portfoliodbu END) AS dbu1,
max(CASE WHEN ROW=2 THEN portfoliodbu END) AS dbu2,
max(CASE WHEN ROW=3 THEN portfoliodbu END) AS dbu3,
max(CASE WHEN ROW=4 THEN portfoliodbu END) AS dbu4,
max(CASE WHEN ROW=5 THEN portfoliodbu END) AS dbu5
FROM
T
GROUP BY
date,personid,seqid
Demo: SQL Fiddle
Results:
| DATE | PERSONID | SEQID | FUND1 | FUND2 | FUND3 | FUND4 | FUND5 | DBU1 | DBU2 | DBU3 | DBU4 | DBU5 |
----------------------------------------------------------------------------------------------------------------------------------------------
| January, 01 2008 00:00:00+0000 | 6001 | 1 | J | R | (null) | (null) | (null) | 0.5 | 0.5 | (null) | (null) | (null) |
| September, 26 2010 00:00:00+0000 | 76 | 1 | A | B | C | D | (null) | 0.25 | 0.25 | 0.25 | 0.25 | (null) |
| March, 24 2011 00:00:00+0000 | 8 | 2 | C | F | X | (null) | (null) | 0.3 | 0.4 | 0.3 | (null) | (null) |
| February, 21 2012 00:00:00+0000 | 321 | 1 | P | U | W | X | Y | 0.2 | 0.2 | 0.2 | 0.2 | 0.2 |
| March, 23 2012 00:00:00+0000 | 8 | 1 | A | B | (null) | (null) | (null) | 0.5 | 0.5 | (null) | (null) | (null) |
| April, 03 2012 00:00:00+0000 | 456 | 1 | B | F | H | Z | (null) | 0.1 | 0.5 | 0.3 | 0.1 | (null) |
| November, 01 2012 00:00:00+0000 | 456 | 2 | Y | (null) | (null) | (null) | (null) | 1 | (null) | (null) | (null) | (null) |