I have a table with millions of data. I'm having trouble making reports on data.
This is the table I have:
"channel_id" "datetime" "parameter" "raw"
10 "2022-12-02 16:16:00" "Günlük Debi" 3423.89
9 "2022-12-02 16:16:00" "KABIN NEM" 36.27
8 "2022-12-02 16:16:00" "KABIN SICAKLIK" 20.18
7 "2022-12-02 16:16:00" "AKM" 4.54
6 "2022-12-02 16:16:00" "KOi" 24.4
5 "2022-12-02 16:16:00" "AkisHizi" 0.59
4 "2022-12-02 16:16:00" "Sicaklik" 13.53
3 "2022-12-02 16:16:00" "Debi" 3.04
2 "2022-12-02 16:16:00" "CozunmusOksijen" 5.05
1 "2022-12-02 16:16:00" "Iletkenlik" 1125.64
0 "2022-12-02 16:16:00" "pH" 7.09
9 "2022-12-02 16:17:00" "KABIN NEM" 20.22
8 "2022-12-02 16:17:00" "KABIN SICAKLIK" 6.49
7 "2022-12-02 16:17:00" "AKM" 6.36
6 "2022-12-02 16:17:00" "KOi" 30.12
5 "2022-12-02 16:17:00" "AkisHizi" 0.82
4 "2022-12-02 16:17:00" "Sicaklik" 20.36
3 "2022-12-02 16:17:00" "Debi" 16.15
2 "2022-12-02 16:17:00" "CozunmusOksijen" 2.45
1 "2022-12-02 16:17:00" "Iletkenlik" 1570.75
0 "2022-12-02 16:17:00" "pH" 7.48
7 "2022-12-02 16:13:00" "AKM" 16.02
6 "2022-12-02 16:13:00" "KOi" 25.98
5 "2022-12-02 16:13:00" "AkisHizi" 0.83
4 "2022-12-02 16:13:00" "Sicaklik" 17.87
3 "2022-12-02 16:13:00" "Debi" 27.85
2 "2022-12-02 16:13:00" "CozunmusOksijen" 5.91
1 "2022-12-02 16:13:00" "Iletkenlik" 2221.36
0 "2022-12-02 16:13:00" "pH" 7.25
9 "2022-12-02 16:14:00" "KABIN NEM" 62.28
8 "2022-12-02 16:14:00" "KABIN SICAKLIK" 13.99
7 "2022-12-02 16:14:00" "AKM" 6.02
6 "2022-12-02 16:14:00" "KOi" 21.36
5 "2022-12-02 16:14:00" "AkisHizi" 0.56
4 "2022-12-02 16:14:00" "Sicaklik" 21.6
3 "2022-12-02 16:14:00" "Debi" 10.35
2 "2022-12-02 16:14:00" "CozunmusOksijen" 0.32
1 "2022-12-02 16:14:00" "Iletkenlik" 7325.54
0 "2022-12-02 16:14:00" "pH" 7.57
10 "2022-12-02 16:15:00" "Günlük Debi" 5363.51
9 "2022-12-02 16:15:00" "KABIN NEM" 34.65
8 "2022-12-02 16:15:00" "KABIN SICAKLIK" 20.25
7 "2022-12-02 16:15:00" "AKM" 6.52
6 "2022-12-02 16:15:00" "KOi" 12.71
5 "2022-12-02 16:15:00" "AkisHizi" 0.54
4 "2022-12-02 16:15:00" "Sicaklik" 14.41
3 "2022-12-02 16:15:00" "Debi" 5.09
2 "2022-12-02 16:15:00" "CozunmusOksijen" 5.86
1 "2022-12-02 16:15:00" "Iletkenlik" 1933.55
0 "2022-12-02 16:15:00" "pH" 7.24
7 "2022-12-02 16:13:00" "AKM" 38.64
6 "2022-12-02 16:13:00" "KOi" 26.17
5 "2022-12-02 16:13:00" "AkisHizi" 0.52
4 "2022-12-02 16:13:00" "Sicaklik" 12.46
3 "2022-12-02 16:13:00" "Debi" 1.32
2 "2022-12-02 16:13:00" "CozunmusOksijen" 9.06
1 "2022-12-02 16:13:00" "Iletkenlik" 2566.5
0 "2022-12-02 16:13:00" "pH" 7.33
9 "2022-12-02 16:14:00" "KABIN NEM" 21.71
8 "2022-12-02 16:14:00" "KABIN SICAKLIK" 16.5
7 "2022-12-02 16:14:00" "AKM" 12.56
6 "2022-12-02 16:14:00" "KOi" 18.64
5 "2022-12-02 16:14:00" "AkisHizi" 0.63
4 "2022-12-02 16:14:00" "Sicaklik" 12.56
3 "2022-12-02 16:14:00" "Debi" 4.84
2 "2022-12-02 16:14:00" "CozunmusOksijen" 2.15
1 "2022-12-02 16:14:00" "Iletkenlik" 621.05
0 "2022-12-02 16:14:00" "pH" 5.16
9 "2022-12-02 16:14:00" "KABIN NEM" 20.65
8 "2022-12-02 16:14:00" "KABIN SICAKLIK" 21.32
7 "2022-12-02 16:14:00" "AKM" 9.28
6 "2022-12-02 16:14:00" "KOi" 23.24
5 "2022-12-02 16:14:00" "AkisHizi" 0.63
4 "2022-12-02 16:14:00" "Sicaklik" 12.79
3 "2022-12-02 16:14:00" "Debi" 3.09
2 "2022-12-02 16:14:00" "CozunmusOksijen" 2.53
1 "2022-12-02 16:14:00" "Iletkenlik" 1473.54
0 "2022-12-02 16:14:00" "pH" 7.69
10 "2022-12-02 16:14:00" "Günlük Debi" 8453.81
9 "2022-12-02 16:14:00" "KABIN NEM" 32.88
8 "2022-12-02 16:14:00" "KABIN SICAKLIK" 24.88
7 "2022-12-02 16:14:00" "AKM" 6.16
6 "2022-12-02 16:14:00" "KOi" 51.93
5 "2022-12-02 16:14:00" "AkisHizi" 0.54
4 "2022-12-02 16:14:00" "Sicaklik" 17.91
3 "2022-12-02 16:14:00" "Debi" 9.3
2 "2022-12-02 16:14:00" "CozunmusOksijen" 2.69
1 "2022-12-02 16:14:00" "Iletkenlik" 2318.17
0 "2022-12-02 16:14:00" "pH" 7.27
10 "2022-12-02 16:14:00" "Günlük Debi" 3342.46
9 "2022-12-02 16:14:00" "KABIN NEM" 57.81
8 "2022-12-02 16:14:00" "KABIN SICAKLIK" 42.21
7 "2022-12-02 16:14:00" "AKM" 14.7
6 "2022-12-02 16:14:00" "KOi" 38.02
5 "2022-12-02 16:14:00" "AkisHizi" 0.61
4 "2022-12-02 16:14:00" "Sicaklik" 19.88
3 "2022-12-02 16:14:00" "Debi" 3.39
2 "2022-12-02 16:14:00" "CozunmusOksijen" 3.94
1 "2022-12-02 16:14:00" "Iletkenlik" 901.02
0 "2022-12-02 16:14:00" "pH" 7.33
The result I want to achieve is like this:
datetime values
2022-12-02 16:16:00 [{..PULSAR,Günlük Debi,3423.89},{...GENTEK...}...]
2022-12-02 16:17:00 [{..Pi,pH,7.09},{...GENTEK...}...]
.
.
.
I want to group data recorded on the same date in one row.
How can I achieve this? Is there a way?
I pulled the data by time period and then grouped it with a python for loop, but this was a very long process in large time intervals.
Assume you meant to group data on the same value of datetime column, you can do this:
select datetime,
array_to_json(array_agg(json_build_object(parameter, raw))) as parameters
from a_table
group by 1
order by 1;
Result:
datetime |parameters |
-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2022-12-02 16:13:00.000|[{"AKM" : 16.02},{"KOi" : 25.98},{"AkisHizi" : 0.83},{"Sicaklik" : 17.87},{"Debi" : 27.85},{"CozunmusOksijen" : 5.91},{"Iletkenlik" : 2221.36},{"pH" : 7.25},{"AKM" : 38.64},{"KOi" : 26.17},{"AkisHizi" : 0.52},{"Sicaklik" : 12.46},{"Debi" : 1.32},{"Cozunmu|
2022-12-02 16:14:00.000|[{"KABIN NEM" : 62.28},{"KABIN SICAKLIK" : 13.99},{"AKM" : 6.02},{"KOi" : 21.36},{"AkisHizi" : 0.56},{"Sicaklik" : 21.6},{"Debi" : 10.35},{"CozunmusOksijen" : 0.32},{"Iletkenlik" : 7325.54},{"pH" : 7.57},{"KABIN NEM" : 21.71},{"KABIN SICAKLIK" : 16.5},{"A|
2022-12-02 16:15:00.000|[{"Günlük Debi" : 5363.51},{"KABIN NEM" : 34.65},{"KABIN SICAKLIK" : 20.25},{"AKM" : 6.52},{"KOi" : 12.71},{"AkisHizi" : 0.54},{"Sicaklik" : 14.41},{"Debi" : 5.09},{"CozunmusOksijen" : 5.86},{"Iletkenlik" : 1933.55},{"pH" : 7.24}] |
2022-12-02 16:16:00.000|[{"Günlük Debi" : 3423.89},{"KABIN NEM" : 36.27},{"KABIN SICAKLIK" : 20.18},{"AKM" : 4.54},{"KOi" : 24.4},{"AkisHizi" : 0.59},{"Sicaklik" : 13.53},{"Debi" : 3.04},{"CozunmusOksijen" : 5.05},{"Iletkenlik" : 1125.64},{"pH" : 7.09}] |
2022-12-02 16:17:00.000|[{"KABIN NEM" : 20.22},{"KABIN SICAKLIK" : 6.49},{"AKM" : 6.36},{"KOi" : 30.12},{"AkisHizi" : 0.82},{"Sicaklik" : 20.36},{"Debi" : 16.15},{"CozunmusOksijen" : 2.45},{"Iletkenlik" : 1570.75},{"pH" : 7.48}] |
Related
The following list
[[22415, 7, Timestamp('2022-02-15 00:00:00'), 'KEY', nan], [22415, 7, Timestamp('2022-02-24 00:00:00'), 'MELOXICA', nan], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'CEPFR', 12.0], [22415, 7, Timestamp('2022-10-11 00:00:00'), 'MELOXICA', nan], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'TOX FL', 11.0], [25302, 8, Timestamp('2022-06-05 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-06-07 00:00:00'), 'FLUNIX', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'MAS', nan], [25302, 8, Timestamp('2022-07-07 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FR', 12.0], [26662, 8, Timestamp('2022-07-08 00:00:00'), 'FLUNIX', nan], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'SFFR', 12.0], [26662, 8, Timestamp('2022-07-17 00:00:00'), 'MELOXICA', nan]]
Translates to the following dataframe example
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA NaN
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX NaN
2117 25302 8 2022-06-07 FLUNIX NaN
10 25302 8 2022-07-07 MAS NaN
2118 25302 8 2022-07-07 FLUNIX NaN
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX NaN
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA NaN
I would like to forward and backward fill "QUARTER" when it is missing and there is a value for the same ID and Lact and the interval between the "Date" is < 7 days
I have used bfill and ffill with groupby with other data when there has not been the requirement for the date constraint .
The output I am looking for in this example is.
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA 12.0
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX 11.0
2117 25302 8 2022-06-07 FLUNIX 11.0
10 25302 8 2022-07-07 MAS 11.0
2118 25302 8 2022-07-07 FLUNIX 11.0
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX 12.0
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA 12.0
The source dataset is large with varied intervals between dates for the same ID and lact
Appreciate any ideas how to fill na values based on the id lact grouping within the 6 day date constraint.
Thanks
Let's group the dataframe by ID and LACT and apply a custom function to backfill and forward fill the values
def pad(grp):
g = pd.Grouper(key='Date', freq='7D', origin='start')
return grp.groupby(g)['QUARTER'].apply(lambda s: s.ffill().bfill())
df['QUARTER'] = df.groupby(['ID', 'LACT'], group_keys=False).apply(pad)
Result
ID LACT Date Remark QUARTER
0 22415 7 2022-02-15 KEY NaN
1 22415 7 2022-02-24 MELOXICA NaN
2 22415 7 2022-10-11 CEPFR 12.0
3 22415 7 2022-10-11 MELOXICA 12.0
4 25302 8 2022-06-05 TOX FL 11.0
5 25302 8 2022-06-05 FLUNIX 11.0
6 25302 8 2022-06-07 FLUNIX 11.0
7 25302 8 2022-07-07 MAS NaN
8 25302 8 2022-07-07 FLUNIX NaN
9 26662 8 2022-07-08 FR 12.0
10 26662 8 2022-07-08 FLUNIX 12.0
11 26662 8 2022-07-17 SFFR 12.0
12 26662 8 2022-07-17 MELOXICA 12.0
Use DataFrameGroupBy.diff for test intervals between dates and for less like 6 days create helper groups for use last forward filling missing values per ID, LACT and groups by GroupBy.ffill and bfill:
#if necesary convert to datetimes
df['Date'] = pd.to_datetime(df['Date'])
#if necessary sorting dates per ID LACT
df = df.sort_values(['ID','LACT','Date'])
groups = df.groupby(['ID','LACT'])['Date'].diff().dt.days.fillna(0).gt(6).cumsum()
f = lambda x: x.ffill().bfill()
df['QUARTER'] = df.groupby(['ID','LACT', groups])['QUARTER'].transform(f)
print (df)
ID LACT Date Remark QUARTER
2105 22415 7 2022-02-15 KEY NaN
2106 22415 7 2022-02-24 MELOXICA NaN
4 22415 7 2022-10-11 CEPFR 12.0
2107 22415 7 2022-10-11 MELOXICA 12.0
9 25302 8 2022-06-05 TOX FL 11.0
2116 25302 8 2022-06-05 FLUNIX 11.0
2117 25302 8 2022-06-07 FLUNIX 11.0
10 25302 8 2022-07-07 MAS NaN
2118 25302 8 2022-07-07 FLUNIX NaN
14 26662 8 2022-07-08 FR 12.0
2125 26662 8 2022-07-08 FLUNIX 12.0
15 26662 8 2022-07-17 SFFR 12.0
2126 26662 8 2022-07-17 MELOXICA 12.0
The ID represents levels of the same thing. This means that the dataset has many duplicates in each sample. I want to keep the longest ID value as this contains the most information.
df_test=pd.DataFrame({'ID':[
"k__",
"k__|p__|c__|o__",
"k__|p__|c__|o__|f__",
"k__|p__|c__|o__|f__|g_",
"k__|p__|c__|o__|f__|g_|s__",
"k__|p__|c__|o__|f__|g_|s__|a"],
'sample_1':[95,3.64,3.64,3.1,3.1,3.1],
'sample_2':[93,2.45,2.45,4.5,4.5,4.5],
'sample_3':[93,2.45,2.45,4.5,4.5,7.5]})
ID sample_1 sample_2 sample_3
0 k__ 95.00 93.00 93.00
1 k__|p__|c__|o__ 3.64 2.45 2.45
2 k__|p__|c__|o__|f__ 3.64 2.45 2.45
3 k__|p__|c__|o__|f__|g_ 3.10 4.50 4.50
4 k__|p__|c__|o__|f__|g_|s__ 3.10 4.50 4.50
5 k__|p__|c__|o__|f__|g_|s__|a 3.10 4.50 7.50
How I was handling this is to drop the duplicates, keep the last occurrence of the duplicate (which contains the most data in the ID column) and subset by sample:
sample_cols = [col for col in df_test.columns if 'sample' in col]
df_test.drop_duplicates(subset=sample_cols, keep='last')
ID sample_1 sample_2 sample_3
0 k__ 95.00 93.00 93.00
2 k__|p__|c__|o__|f__ 3.64 2.45 2.45
4 k__|p__|c__|o__|f__|g_|s__ 3.10 4.50 4.50
5 k__|p__|c__|o__|f__|g_|s__|a 3.10 4.50 7.50
What is happening though at index 4 and 5 for sample 1 and 2 is that duplicate values are remaining when another column or sample contains a different value.
Is there a way in pandas to check if duplicate values occur on 0 axis to fill the last occurrence with 0:
ID sample_1 sample_2 sample_3
0 k__ 95.00 93.00 93.00
2 k__|p__|c__|o__|f__ 3.64 2.45 2.45
4 k__|p__|c__|o__|f__|g_|s__ 0 0 4.50
5 k__|p__|c__|o__|f__|g_|s__|a 3.10 4.50 7.50
I used df.duplicated: documentation on pandas duplicated:
First Removing the duplicates and keeping the last line (works the same as in your code, just a one-liner):
df_test = df_test[df_test.iloc[:,1:].duplicated(keep = 'last') == False]
df_test
ID sample_1 sample_2 sample_3
0 k__ 95.00 93.00 93.00
2 k__|p__|c__|o__|f__ 3.64 2.45 2.45
4 k__|p__|c__|o__|f__|g_|s__ 3.10 4.50 4.50
5 k__|p__|c__|o__|f__|g_|s__|a 3.10 4.50 7.50
Then for the replacement with zero:
for sample in df_test.iloc[:,1:]:
df_test.loc[df_test[sample].duplicated(keep = 'last'), sample] = 0
df_test
ID sample_1 sample_2 sample_3
0 k__ 95.00 93.00 93.00
2 k__|p__|c__|o__|f__ 3.64 2.45 2.45
4 k__|p__|c__|o__|f__|g_|s__ 0.00 0.00 4.50
5 k__|p__|c__|o__|f__|g_|s__|a 3.10 4.50 7.50
It does come out with a warning, which I was not able to avoid, but it does work as intended
Dear friends i want to transpose the following dataframe into a single column. I cant figure out a way to transform it so your help is welcome!! I tried pivottable but sofar no succes
X 0.00 1.25 1.75 2.25 2.99 3.25
X 3.99 4.50 4.75 5.25 5.50 6.00
X 6.25 6.50 6.75 7.50 8.24 9.00
X 9.50 9.75 10.25 10.50 10.75 11.25
X 11.50 11.75 12.00 12.25 12.49 12.75
X 13.25 13.99 14.25 14.49 14.99 15.50
and it should look like this
X
0.00
1.25
1.75
2.25
2.99
3.25
3.99
4.5
4.75
5.25
5.50
6.00
6.25
etc..
This will do it, df.columns[0] is used as I don't know what are your headers:
df = pd.DataFrame({'X': df.set_index(df.columns[0]).stack().reset_index(drop=True)})
df
X
0 0.00
1 1.25
2 1.75
3 2.25
4 2.99
5 3.25
6 3.99
7 4.50
8 4.75
9 5.25
10 5.50
11 6.00
12 6.25
13 6.50
14 6.75
15 7.50
16 8.24
17 9.00
18 9.50
19 9.75
20 10.25
21 10.50
22 10.75
23 11.25
24 11.50
25 11.75
26 12.00
27 12.25
28 12.49
29 12.75
30 13.25
31 13.99
32 14.25
33 14.49
34 14.99
35 15.50
ty so much!! A follow up question(a)
Is it also possible to stack the df into 2 columns X and Y
this is the data set
This is the data set.
1 2 3 4 5 6 7
X 0.00 1.25 1.75 2.25 2.99 3.25
Y -1.08 -1.07 -1.07 -1.00 -0.81 -0.73
X 3.99 4.50 4.75 5.25 5.50 6.00
Y -0.37 -0.20 -0.15 -0.17 -0.15 -0.16
X 6.25 6.50 6.75 7.50 8.24 9.00
Y -0.17 -0.18 -0.24 -0.58 -0.93 -1.24
X 9.50 9.75 10.25 10.50 10.75 11.25
Y -1.38 -1.42 -1.51 -1.57 -1.64 -1.75
X 11.50 11.75 12.00 12.25 12.49 12.75
Y -1.89 -2.00 -2.00 -2.04 -2.04 -2.10
X 13.25 13.99 14.25 14.49 14.99 15.50
Y -2.08 -2.13 -2.18 -2.18 -2.27 -2.46
I have this table 'meteorecords' with date, temperature, rh and the meteo station which made the record.
rerowid date temp rh meteostid
1 2019-09-9 28.8 55.6 AITNIA2
2 2019-09-10 30.3 51.3 AITNIA2
3 2019-09-11 28.6 49.0 AITNIA2
4 2019-09-12 26.7 51.9 AITNIA2
5 2019-09-13 25.3 48.1 AITNIA2
6 2019-09-14 25.3 38.5 AITNIA2
7 2019-09-15 25.0 42.2 AITNIA2
8 2019-09-16 24.1 52.1 AITNIA2
9 2019-09-17 23.3 65.2 AITNIA2
10 2019-09-18 22.7 72.2 AITNIA2
11 2019-09-19 23.4 73.9 AITNIA2
12 2019-09-20 23.1 76.7 AITNIA2
13 2019-09-21 22.5 60.3 AITNIA2
14 2019-09-22 20.9 61.6 AITNIA2
15 2019-09-23 21.9 73.9 AITNIA2
16 2019-09-24 23.2 79.6 AITNIA2
17 2019-09-25 21.8 73.6 AITNIA2
18 2019-09-26 22.2 77.6 AITNIA2
19 2019-09-27 22.9 77.1 AITNIA2
20 2019-09-28 22.8 68.4 AITNIA2
21 2019-09-29 22.6 75.5 AITNIA2
...........................
I want to select all the fields plus the average temperature of the last 3 days.
I'm using postgresql because I have some geometric and spatial data in the db.
I tried this with no luck:
SELECT rerowid,redate,retemp,rerh,meteostid,
(SELECT AVG(retemp)
FROM meteorecords m
WHERE meteostid = m.meteostid AND m.redate BETWEEN redate-2 AND redate)
FROM meteorecords
which returns a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 22.2824
2 2019-09-10 30.3 51.3 AITNIA2 22.2824
3 2019-09-11 28.6 49.0 AITNIA2 22.2824
4 2019-09-12 26.7 51.9 AITNIA2 22.2824
5 2019-09-13 25.3 48.1 AITNIA2 22.2824
6 2019-09-14 25.3 38.5 AITNIA2 22.2824
7 2019-09-15 25.1 42.2 AITNIA2 22.2824
..................
But I want a result like this:
rerowid date temp rh meteostid AVG_Last_3_Days
1 2019-09-09 28.8 55.6 AITNIA2 28.8
2 2019-09-10 30.3 51.3 AITNIA2 29.5
3 2019-09-11 28.6 49.0 AITNIA2 29.2
4 2019-09-12 26.7 51.9 AITNIA2 28.5
5 2019-09-13 25.3 48.1 AITNIA2 26.9
6 2019-09-14 25.3 38.5 AITNIA2 25.8
7 2019-09-15 25.1 42.2 AITNIA2 25.2
..................
Use window functions. If you have one row per date or you want the previous three dates *in the data):
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid ORDER BY redate ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;
If you want 3 chronological days, use RANGE:
SELECT rerowid, redate, retemp, rerh, meteostid,
AVG(retemp) OVER (PARTITION BY meteostid
ORDER BY redate
RANGE BETWEEN '2 DAY' PRECEDING AND CURRENT ROW) as avg_retemp_3
FROM meteorecords;
My dataframe look like this,
loc prod period qty
0 Customer10 FG1 2483 200.000000
1 Customer10 FG1 2484 220.000000
2 Customer10 FG1 2485 240.000000
3 Customer10 FG1 2486 260.000000
4 Customer11 FG1 2483 300.000000
5 Customer11 FG1 2484 320.000000
6 Customer11 FG1 2485 340.000000
7 Customer11 FG1 2486 360.000000
8 Customer12 FG1 2483 400.000000
9 Customer12 FG1 2484 420.000000
10 Customer12 FG1 2485 440.000000
11 Customer12 FG1 2486 460.000000
12 Customer13 FG1 2483 500.000000
13 Customer13 FG1 2484 520.000000
14 Customer13 FG1 2485 540.000000
15 Customer13 FG1 2486 560.000000
16 Customer9 FG1 2483 100.000000
17 Customer9 FG1 2484 120.000000
18 Customer9 FG1 2485 140.000000
19 Customer9 FG1 2486 160.000000
I want to reshape dataframe (distinct period as columns and prod as rows)
and groupby period, loc
Expected O/P
2483 2484 2485 2486
FG1 1500 1580 1660 1740
You can use pivot_table:
In [37]: df.pivot_table(index='prod', columns='period', values='qty', aggfunc='sum')
Out[37]:
period 2483 2484 2485 2486
prod
FG1 1500.0 1600.0 1700.0 1800.0
or
In [39]: df.groupby(['prod','period'])['qty'].sum().unstack()
Out[39]:
period 2483 2484 2485 2486
prod
FG1 1500.0 1600.0 1700.0 1800.0
UPDATE:
how to get period of max(qty)?
In [69]: pvt = df.pivot_table(index='prod', columns='period', values='qty', aggfunc='sum')
In [70]: pvt
Out[70]:
period 2483 2484 2485 2486
prod
FG1 1500.0 1600.0 1700.0 1800.0
In [71]: pvt.idxmax(axis=1)
Out[71]:
prod
FG1 2486
dtype: int64