how to make a rolling mean in pandas but only for items that have the same id/value? - pandas

columns: datetime | clientid | amounts | *new_column_to_be_implemented* (rolling mean of values before but only for values that are the same in clientid)
`day 1` | 2 | 50 | (na)
`day 2` | 2 | 60 | 50
`day 3` | 1 | 45 | (na)
`day 4` | 2 | 45 | 110
`day 5` | 3 | 90 | (na)
`day 6` | 3 | 10 | 90
`day 7` | 2 | 10 | 105
so this gets the mean of the last 2 amounts of the same clientid for example.
I know it is possible to add a list and append/pop values to remember them, but is there a better way in pandas?

Please make sure to following the guidelines described in How to make good reproducible pandas examples when asking pandas related questions, it helps a lot for reproducibility.
The key element for the answer is the pairing of the groupby and rolling methods. groupby will group all the records with the same clientid and rolling will select the correct amount of records for the mean calculation.
import pandas as pd
import numpy as np
# setting up the dataframe
data = [
['day 1', 2, 50],
['day 2', 2, 60],
['day 3', 1, 45],
['day 4', 2, 45],
['day 5', 3, 90],
['day 6', 3, 10],
['day 7', 2, 10]
]
columns = ['date', 'clientid', 'amounts']
df = pd.DataFrame(data=data, columns=columns)
rolling_mean = df.groupby('clientid').rolling(2)['amounts'].mean()
rolling_mean.index = rolling_mean.index.get_level_values(1)
df['client_rolling_mean'] = rolling_mean

Related

Regex to get nth occurrence of pattern (Trino)

WITH t(x,y) AS (
VALUES
(1,'[2]'),
(2,'[1, 2]'),
(3,'[2, 1]'),
(4,'[3, 2, 5]'),
(5,'[3, 2, 5, 2, 4]'),
(6,'[3, 2, 2, 0, 4]')
)
--- my wrong answer below
SELECT
REGEXP_EXTRACT(y, '(\d+,\s)?(2)(,\s\d+)?') AS _1st,
REGEXP_EXTRACT(y,'(.*?(2)){1}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _2nd,
REGEXP_EXTRACT(y,'(.*?(2)){2}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _3rd
FROM t
Expected ans:
| x | y | 1st | 2nd | nth |
| - | --------------- | ------- | ------- | ------- |
| 1 | [2] | 2 | | |
| 2 | [1, 2] | 1, 2 | | |
| 3 | [2, 1] | 2, 1 | | |
| 4 | [3, 2, 5] | 3, 2, 5 | | |
| 5 | [3, 2, 5, 2, 4] | 3, 2, 5 | 5, 2, 4 | |
| 6 | [3, 2, 2, 0, 4] | 3, 2, 2 | 2, 2, 0 | |
Need help on the Regex for REGEXP_EXTRACT function in Presto to get the nth occurrence of number '2' and include the figures before and after it (if any)
Additional info:
The figures in column y are not necessary single digit.
Orders of the numbers are important
1st, 2nd, 3rd refers to the nth occurrence of the number that I am seeking
Will be looking for a list of numbers, not just 2. Using 2 for illustration purpose.
Must it be a regular-expression?
If you see the text (VARCHAR) [1,2,3] as array-representation (JSON or internal data-type Array), you have more functions available to solve your task.
See related functions supported by Presto:
JSON functions
Array function
I would recommend to cast it as array of integers: CAST('[1,23,456]' AS ARRAY(INTEGER))
Finding the n-th occurrence
From Array functions, array_position(x, element, instance) → bigint to find the n-th occurrence:
If instance > 0, returns the position of the instance-th occurrence of the element in array x.
If instance < 0, returns the position of the instance-to-last occurrence of the element in array x.
If no matching element instance is found, 0 is returned.
Example:
SELECT CAST('[1,2,23,2,456]' AS ARRAY(INTEGER));
SELECT array_position(2, CAST('[1,2,23,2,456]' AS ARRAY(INTEGER)), 1); -- found in position 2
Now use the found position to build your slice (relatively from that).
Slicing and extracting sub-arrays
either parse it as JSON to a JSON-array. Then use a JSON-path to slice (extract a sub-array) as desired: Array slice operator in JSON-path: [start, stop, step]
or cast it as Array and then use slice(x, start, length) → array
Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.
Examples:
SELECT json_extract(json_parse('[1,2,3]'), '$[-2, -1]'); -- the last two elements
SELECT slice(CAST('[1,23,456]' AS ARRAY(INTEGER)), -2, 2); -- [23, 456]

Pandas Decile Rank

I just used the pandas qcut function to create a decile ranking, but how do I look at the bounds of each ranking. Basically, how do I know what numbers fall in the range of the ranking of 1 or 2 or 3 etc?
I hope the following python code with 2 short examples can help you. For the second example I used the isin method.
import numpy as np
import pandas as pd
df = {'Name' : ['Mike', 'Anton', 'Simon', 'Amy',
'Claudia', 'Peter', 'David', 'Tom'],
'Score' : [42, 63, 75, 97, 61, 30, 80, 13]}
df = pd.DataFrame(df, columns = ['Name', 'Score'])
df['decile_rank'] = pd.qcut(df['Score'], 10,
labels = False)
print(df)
Output:
Name Score decile_rank
0 Mike 42 2
1 Anton 63 5
2 Simon 75 7
3 Amy 97 9
4 Claudia 61 4
5 Peter 30 1
6 David 80 8
7 Tom 13 0
rank_1 = df[df['decile_rank']==1]
print(rank_1)
Output:
Name Score decile_rank
5 Peter 30 1
rank_1_and_2 = df[df['decile_rank'].isin([1,2])]
print(rank_1_and_2)
Output:
Name Score decile_rank
0 Mike 42 2
5 Peter 30 1

Filtering based on value and creating list in spark dataframe

I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])

Update row index when all columns of the next row ara NaN in a Pandas DataFrame

I have a Pandas DataFrame extracted from a PDF with tabula-py.
The PDF is like this:
+--------------+--------+-------+
| name | letter | value |
+--------------+--------+-------+
| A short name | a | 1 |
+-------------------------------+
| Another | b | 2 |
+-------------------------------+
| A very large | c | 3 |
| name | | |
+-------------------------------+
| other one | d | 4 |
+-------------------------------+
| My name is | e | 5 |
| big | | |
+--------------+--------+-------+
As you can see A very large name has a line break and, as the original pdf does not have borders, a row with ['name', NaN, NaN] and another with ['A very large', 'c', 3] are created in the DataFrame, when I want only a sigle one with content: ['A very large name', 'c', 3].
Same happens with My name is big
As this happens for several rows which I'm trying to achieve is concatenate the content of the name cell with the previous one when the rest of the cells in the row are NaN. Then delete the NaN rows.
But any other strategy that obtain the same result is welcome.
import pandas as pd
import numpy as np
data = {
"name": ["A short name", "Another", "A very large", "name", "other one", "My name is", "big"],
"letter": ["a", "b", "c", np.NaN, "d", "e", np.NaN],
"value": [1, 2, 3, np.NaN, 4, 5, np.NaN],
}
df = pd.DataFrame(data)
data_expected = {
"name": ["A short name", "Another", "A very large name", "other one", "My name is big"],
"letter": ["a", "b", "c", "d", "e"],
"value": [1, 2, 3, 4, 5],
}
df_expected = pd.DataFrame(data_expected)
I'm trying code like this, but is not working
# Not works and not very `pandastonic`
nan_indexes = df[df.iloc[:, 1:].isna().all(axis='columns')].index
df.loc[nan_indexes - 1, "name"] = df.loc[nan_indexes - 1, "name"].str.cat(df.loc[nan_indexes, "name"], ' ')
# remove NaN rows
you can try with groupby.agg with join or first depending on the columns. the groups are created with checking where it is notna in the column letter and value and cumsum.
print (df.groupby(df[['letter', 'value']].notna().any(1).cumsum())
.agg({'name': ' '.join, 'letter':'first', 'value':'first'})
)
name letter value
1 A short name a 1.0
2 Another b 2.0
3 A very large name c 3.0
4 other one d 4.0
5 My name is big e 5.0

How to calculate power consumption from power records?

I have table which contains power values (kW) for devices. Values are read from each device once a minute and inserted into table with timestamp. What I need to do is calculate power consumption (kWh) for given time span and return 10 most power consuming devices. Right now I query results for given time span and do calculation in backend looping all records. This works fine with small amount of devices and with short time span, but in real use case I could have thousands of devices and long time span.
So my question is how could I do this all in PostgreSQL 9.4.4 so that my query would return only 10 most power consuming (device_id, power_consumption) pairs?
Example table:
CREATE TABLE measurements (
id serial primary key,
device_id integer,
power real,
created_at timestamp
);
Simple data example:
| id | device_id | power | created_at |
|----|-----------|-------|--------------------------|
| 1 | 1 | 10 | August, 26 2015 08:23:25 |
| 2 | 1 | 13 | August, 26 2015 08:24:25 |
| 3 | 1 | 12 | August, 26 2015 08:25:25 |
| 4 | 2 | 103 | August, 26 2015 08:23:25 |
| 5 | 2 | 134 | August, 26 2015 08:24:25 |
| 6 | 2 | 2 | August, 26 2015 08:25:25 |
| 7 | 3 | 10 | August, 26 2015 08:23:25 |
| 8 | 3 | 13 | August, 26 2015 08:24:25 |
| 9 | 3 | 20 | August, 26 2015 08:25:25 |
Wanted results for query:
| id | device_id | power_consumption |
|----|-----------|-------------------|
| 1 | 1 | 24.0 |
| 2 | 2 | 186.5 |
| 3 | 3 | 28.0 |
Simplified example (created_at in hours) how I calculate kWh value:
data = [
[
{ 'id': 1, 'device_id': 1, 'power': 10.0, 'created_at': 0 },
{ 'id': 2, 'device_id': 1, 'power': 13.0, 'created_at': 1 },
{ 'id': 3, 'device_id': 1, 'power': 12.0, 'created_at': 2 }
],
[
{ 'id': 4, 'device_id': 2, 'power': 103.0, 'created_at': 0 },
{ 'id': 5, 'device_id': 2, 'power': 134.0, 'created_at': 1 },
{ 'id': 6, 'device_id': 2, 'power': 2.0, 'created_at': 2 }
],
[
{ 'id': 7, 'device_id': 3, 'power': 10.0, 'created_at': 0 },
{ 'id': 8, 'device_id': 3, 'power': 13.0, 'created_at': 1 },
{ 'id': 9, 'device_id': 3, 'power': 20.0, 'created_at': 2 }
]
]
# device_id: power_consumption
results = { 1: 0, 2: 0, 3: 0 }
for d in data:
for i in range(0, len(d)):
if i < len(d)-1:
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
results[d[i]['device_id']] += ((x2-x1)*(y2+y1))/2
print results
EDIT: Check this to see how I ended up solving this.
Some of the elements that you'll need in order to do this are:
Sum() aggregations, to calculate the total of a number of records
Lag()/Lead() functions, to calculate for a given record what the "previous" or "next" record's values were.
So where for a given row you can get the current created_at and power records, in SQL you'd probably use a Lead() windowing function to get the created_at and power records for the record for the same device id that has the next highest value for created_at.
Docs for Lead() are here: http://www.postgresql.org/docs/9.4/static/functions-window.html
When for each row you have calculated the power consumption by reference to the "next" record, you can use a Sum() to aggregate up all of the calculated powers for that one device.
When you have calculated the power per device, you can use ORDER BY and LIMIT to select the top n power-consuming devices.
Steps to follow, if you're not confident to plunge in and just write the final SQL -- after each step make sure you haveSQL you understand, and which returns just the data you need:
Start small, by selecting the data rows that you want.
Work out the Lead() function, defining the appropriate partition and order clauses to get the next row.
Add the calculation of power per row.
Define the Sum() function, and group by the device id.
Add the ORDER BY and LIMIT clauses.
If you have trouble with any one of these steps, they would each make a decent StackOverflow question.
If someone happens to wonder same thing here is how I solved this.
I followed instructions by David and made this:
SELECT
t.device_id,
sum(len_y*(extract(epoch from date_trunc('milliseconds', len_x)))/7200) AS total
FROM (
SELECT
m.id,
m.device_id,
m.power,
m.created_at,
m.power+lag(m.power) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_y,
m.created_at-lag(m.created_at) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_x
FROM
mes AS m
WHERE m.created_at BETWEEN '2015-08-26 13:39:57.834674'::timestamp
AND '2015-08-26 13:43:57.834674'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total
DESC LIMIT 10;