Groupby multiple rows and get values in Pandas Dataframe - pandas

I would like to know that how would I get a data like, Date in first column and all stations ID in second column and all respective values of Average Wind speed, Sunshine duration and all and below that, Next date in next row and all stations in column beside this. I have data as below,
Stations_ID Date Average windspeed (Beaufort) sunshine duration average cloud cover
0 102 2016-01-01 6.8 5.733 NaN
1 164 2016-01-01 1.6 0.000 8.0
2 232 2016-01-01 2.0 0.000 7.8
3 282 2016-01-01 1.2 0.000 7.8
4 183 2016-01-01 2.9 0.000 7.8
... ... ... ... ... ...
164035 4271 2021-12-31 6.7 0.000 7.6
164036 4625 2021-12-31 7.1 0.000 7.3
164037 4177 2021-12-31 2.6 4.167 3.9
164038 4887 2021-12-31 4.7 5.333 3.8
164039 4336 2021-12-31 3.4 0.000 7.4
I have used below line of code,
df_data_3111 = pd.DataFrame(df_data_311.groupby(['Date','Stations_ID'])['Minimum Temperature'])
But this does not work.

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

creating multi index from data grouped by month in Pandas

Consider this sample data:
Month Location Products Sales Profit
JAN 1 43 32 20
JAN 2 82 54 25
JAN 3 64 43 56
FEB 1 37 28 78
FEB 2 18 15 34
FEB 3 5 2 4
MAR 1 47 40 14
The multi-index transformation I am trying to achieve is this:
JAN FEB MAR
Location Products Sales Profit Products Sales Profit Products Sales Profit
1 43 32 29 37 28 78 47 40 14
2 82 54 25 18 15 34 null null null
3 64 43 56 5 2 4 null null null
I tried versions of this:
df.stack().to_frame().T
It put all the data into one row. So, that's not the goal.
I presume I am close in that it should be a stacking or unstacking, melting or unmelting, but my attempts have all resulted in data oatmeal at this point. Appreciate your time trying to solve this one.
You can use pivot with reorder_levels and sort_index():
df.pivot(index='Location',columns='Month').reorder_levels(order=[1,0],axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
In case you are interested, this answer elaborates between swaplevel and reoder_levels.
Use pivot:
>>> df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
Month FEB JAN MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 37.0 78.0 28.0 43.0 20.0 32.0 47.0 14.0 40.0
2 18.0 34.0 15.0 82.0 25.0 54.0 NaN NaN NaN
3 5.0 4.0 2.0 64.0 56.0 43.0 NaN NaN NaN
To preserve order, you have to transform your Month column as CategoricalDtype before:
df['Month'] = df['Month'].astype(pd.CategoricalDtype(df['Month'].unique(), ordered=True))
out = df.pivot('Location', 'Month').swaplevel(axis=1).sort_index(axis=1)
print(out)
# Output:
Month JAN FEB MAR
Products Profit Sales Products Profit Sales Products Profit Sales
Location
1 43.0 20.0 32.0 37.0 78.0 28.0 47.0 14.0 40.0
2 82.0 25.0 54.0 18.0 34.0 15.0 NaN NaN NaN
3 64.0 56.0 43.0 5.0 4.0 2.0 NaN NaN NaN
Update 2
Try to force the order of level 2 columns:
df1 = df.set_index(['Month', 'Location'])
df1.columns = pd.CategoricalIndex(df1.columns, ordered=True)
df1 = df1.unstack('Month').swaplevel(axis=1).sort_index(axis=1)

Merge old and new table and fill values by date

I have df1:
Date
Symbol
Time
Quantity
Price
2020-09-04
AAPL
09:54:48
11.0
115.97
2020-09-16
AAPL
09:30:02
-11.0
115.33
2020-02-24
AMBA
09:30:02
22.0
64.24
2020-02-25
AMBA
14:01:28
-22.0
62.64
2020-07-14
AMGN
09:30:01
5.0
243.90
...
...
...
...
...
2020-12-08
YUMC
09:30:00
-22.0
56.89
2020-11-18
Z
14:20:01
12.0
100.68
2020-11-20
Z
09:30:01
-12.0
109.25
2020-09-04
ZS
09:45:24
9.0
135.94
2020-09-14
ZS
09:38:23
-9.0
126.41
and df2:
Date
USD
2
2020-02-01
22.702
3
2020-03-01
22.753
4
2020-06-01
22.601
5
2020-07-01
22.626
6
2020-08-01
22.739
..
...
...
248
2020-12-23
21.681
249
2020-12-28
21.482
250
2020-12-29
21.462
251
2020-12-30
21.372
252
2020-12-31
21.387
I want to add a new column "USD" from df2 by date in df1.
Trying
new_df = (dane5.reset_index()
.merge(kurz2,how='outer')
.fillna(0)
.set_index('Date'))
new_df.sort_index(inplace=True)
new_df= new_df[new_df['Symbol'] != 0]
print(new_df.head(50))
But I return zero value some rows:
Date
Symbol
Time
Quantity
Price
USD
2020-01-02
GL
10:31:14
13.0
104.550000
0.000
2020-01-02
ATEC
13:35:04
211.0
6.860000
0.000
2020-01-03
IOVA
14:02:32
56.0
25.790000
0.000
2020-01-03
TGNA
09:30:00
90.0
16.080000
0.000
2020-01-03
SCS
09:30:01
-70.0
20.100000
0.000
2020-01-03
SKX
09:30:09
34.0
41.940000
0.000
2020-01-06
IOVA
09:45:19
-56.0
24.490000
24.163
2020-01-06
GL
09:30:02
-13.0
103.430000
24.163
2020-01-06
SKX
15:55:15
-34.0
43.900000
24.163
2020-01-07
TGNA
15:55:16
-90.0
16.945000
23.810
2020-01-07
MRTX
09:46:18
-13.0
101.290000
23.810
2020-01-07
MRTX
09:34:10
13.0
109.430000
23.810
2020-01-08
ITCI
09:30:01
49.0
27.640000
0.000
Could you some help me please?
Sorry my bad English language.

Pandas df.shift(axis=1) adds extra entries, why?

Here is a sample of the original table.
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83 11
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41 11
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91 11
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68 11
I need to shift the entire table over by 1 column to produce this:
z speed dir U_geo V_geo U U[QCC] U[ign] U[siC] U[siD] V
0 40 2.83 181.0 0.05 2.83 -0.20 11 -0.20 2.24 0.95 2.83
1 50 2.41 184.8 0.20 2.40 -0.01 11 -0.01 2.47 0.94 2.41
2 60 1.92 192.4 0.41 1.88 0.25 11 0.25 2.46 0.94 1.91
3 70 1.75 201.7 0.65 1.63 0.50 11 0.50 2.47 0.94 1.68
Here is the code that ingests the data and tries to shift it over by one column
wind_rass_table_df=pd.read_csv(file_path, header=j+3, engine='python', nrows=77,sep=r'\s{2,}',skip_blank_lines=False,index_col=False)
wind_rass_table_df=wind_rass_table_df.shift(periods=1,axis=1)
Supposedly df.shift(axis=1) should shift the dataframe over by 1 column but it does more than that, it does this:
# z speed dir U_geo V_geo U U[QCC] U[ign] U[siC]
0 NaN NaN 2.83 181.0 0.05 2.83 40.0 -0.20 -0.20 2.24
1 NaN NaN 2.41 184.8 0.20 2.40 50.0 -0.01 -0.01 2.47
2 NaN NaN 1.92 192.4 0.41 1.88 60.0 0.25 0.25 2.46
3 NaN NaN 1.75 201.7 0.65 1.63 70.0 0.50 0.50 2.47
The shift function has taken the first column, inserted into the 7th column, shifted the 7th into the 8th and repeated the 8th, shifting the 9th over and so on.
What is the correct way of shifting a dataframe over by one column?
Many thanks!
You can use iloc and create another dataframe:
df = pd.DataFrame(data=df.iloc[:, :-1], columns=df.columns[1:], index=df.index)

Randomly sum rows

I would like to randomly insert in a new temp_table the records from the Initial Table below, grouping them by a new PO number (1234-1, 1234-2,etc..) where each group sum(TKG) is <20 and sum(TVOL) is <0.1
INITIAL TABLE
lineID PO Item QTY Weight Volume T.KG T.VOL
1 1234 ABCD 12 0.40 0.0030 4.80 0.036
2 1234 EFGH 8 0.39 0.0050 3.12 0.040
3 1234 IJKL 5 0.48 0.0070 2.40 0.035
4 1234 MNOP 8 0.69 0.0040 5.53 0.032
5 1234 QRST 9 0.58 0.0025 5.22 0.023
6 1234 UVWX 7 0.87 0.0087 6.09 0.061
7 1234 YZAB 10 0.71 0.0064 7.10 0.064
8 1234 CDEF 6 0.69 0.0054 4.14 0.032
9 1234 GHIJ 7 0.65 0.0036 4.55 0.025
10 1234 KLMN 9 0.67 0.0040 6.03 0.036
NEW Temp_Table should look like:
LineID PO Item QTY Weight Volume T.KG T.VOL
1 1234-1 ABCD 12 0.40 0.0030 4.80 0.036
2 1234-1 EFGH 8 0.39 0.0050 3.12 0.040
5 1234-1 QRST 9 0.58 0.0025 5.22 0.023
3 1234-2 IJKL 5 0.48 0.0070 2.40 0.035
4 1234-2 MNOP 8 0.69 0.0040 5.53 0.032
8 1234-2 CDEF 6 0.69 0.0054 4.14 0.032
6 1234-3 UVWX 7 0.87 0.0087 6.09 0.061
10 1234-3 KLMN 9 0.67 0.0040 6.03 0.036
9 1234-4 GHIJ 7 0.65 0.0036 4.55 0.025
7 1234-4 YZAB 10 0.71 0.0064 7.10 0.064
I can't figure out how to code this...
It's probably a job for a cursor.
The algorithm could basically be like this:
Collect the rows from the initial table one by one, accumulating sum(TKG) and sum(TVOL):
pick out the rows into the temp while the conditions are still met (omit those that exceed either sum);
use lineID as the order;
iterate up to the end of the list.
Upon hitting the end of the table call it a group, then start all over again, omitting the rows that have already been collected into the temp.
Continue while there still are rows not collected.
But I'm too lazy at the moment to give out the actual code, besides it's a homework, and cursors hate me anyway.
The logic of the 1234-1, 1234-2, etc is to break the records into groups that represent a carton. If the order has 100 line items, I may need n cartons (n groups) to pack all the items.