Changing a pandas dataframe format into another format? - pandas

The given dataframe looks like this:
sensorA sensorB deviceA deviceB inputA inputB machineA machineB flagA flagB mainA
Time
2021-11-26 20:20:00 379.0 0.0 0.0 489.0 0.77 35.0 0.0 51.0 -13.0 230.0 1.6
2021-11-26 20:30:00 344.0 0.0 0.0 143.0 0.76 31.0 0.0 50.0 -11.0 230.0 1.8
I want to map this to a the following format separting the individual columns into a combination of Field and attribute.
Time
Type
attribute
Value
2021-11-26 20:20:00
sensor
a
999
I have tried mutiple directions to approch this using multi indexing, groupby etc but cant seem to get around on how to exactly impliment this ?
Any help would be appreciated!!

Edit
If your column names contain '_' as separator, you can use:
df.columns = df.columns.str.split('_', expand=True).rename(['Type', 'Tag'])
out = df.unstack().rename('Value').reset_index(level=['Type', 'Tag']).sort_index()
Extract type/tag from column names with a regular expression:
types = ['sensor', 'device', 'input', 'machine', 'flag', 'main']
pat = fr"({'|'.join(types)})(.*)"
df.columns = pd.MultiIndex.from_frame(df.columns.str.extract(pat),
names=['Type', 'Tag'])
out = df.unstack().rename('Value').reset_index(level=['Type', 'Tag']).sort_index()
Output:
>>> out
Type Tag Value
Time
2021-11-26 20:20:00 sensor A 379.00
2021-11-26 20:20:00 flag B 230.00
2021-11-26 20:20:00 flag A -13.00
2021-11-26 20:20:00 machine B 51.00
2021-11-26 20:20:00 machine A 0.00
2021-11-26 20:20:00 main A 1.60
2021-11-26 20:20:00 input A 0.77
2021-11-26 20:20:00 input B 35.00
2021-11-26 20:20:00 device B 489.00
2021-11-26 20:20:00 device A 0.00
2021-11-26 20:20:00 sensor B 0.00
2021-11-26 20:30:00 input A 0.76
2021-11-26 20:30:00 device A 0.00
2021-11-26 20:30:00 input B 31.00
2021-11-26 20:30:00 machine A 0.00
2021-11-26 20:30:00 sensor B 0.00
2021-11-26 20:30:00 machine B 50.00
2021-11-26 20:30:00 flag A -11.00
2021-11-26 20:30:00 sensor A 344.00
2021-11-26 20:30:00 flag B 230.00
2021-11-26 20:30:00 device B 143.00
2021-11-26 20:30:00 main A 1.80

Related

How to match Datetimeindex for all but the year?

I have a dataset with missing values and a Datetimeindex. I would like to fill this values with the mean values of other values reported at the same month, day and hour. If there is no values reported at this specific month/day/hour for all years I would like to get the interpolated value mean values of the nearest hour reported. How can I achieve this? Right now my approach is this:
df_Na = df_Na[df_Na['Generation'].isna()]
df_raw = df_raw[~df_raw['Generation'].isna()]
# reduce to month
same_month = df_raw[df_raw.index.month.isin(df_Na.index.month)]
# reduce to same day
same_day = same_month[same_month.index.day.isin(df_Na.index.day)]
# reduce to hour
same_hour = same_day[same_day.index.hour.isin(df_Na.index.hour)]
df_Na are all missing values I liked to fill and df_raw are all reported values from which I liked to get the mean value. I have a huge dataset which is why I would like to avoid a for loop at all cost.
My Data looks like this:
df_Na
Generation
2017-12-02 19:00:00 NaN
2021-01-12 00:00:00 NaN
2021-01-12 01:00:00 NaN
..............................
2021-02-12 20:00:00 NaN
2021-02-12 21:00:00 NaN
2021-02-12 22:00:00 NaN
df_raw
Generation
2015-09-12 00:00:00 0.0
2015-09-12 01:00:00 19.0
2015-09-12 02:00:00 0.0
..............................
2021-12-11 21:00:00 0.0
2021-12-11 22:00:00 180.0
2021-12-11 23:00:00 0.0
Use GroupBy.transform with mean for averages per MM-DD HH and replace missing values by DataFrame.fillna:
df = df.fillna(df.groupby(df.index.strftime('%m-%d %H')).transform('mean'))
And then if necessary add DataFrame.interpolate:
df = df.interpolate(method='nearest')

Merge old and new table and fill values by date

I have df1:
Date
Symbol
Time
Quantity
Price
2020-09-04
AAPL
09:54:48
11.0
115.97
2020-09-16
AAPL
09:30:02
-11.0
115.33
2020-02-24
AMBA
09:30:02
22.0
64.24
2020-02-25
AMBA
14:01:28
-22.0
62.64
2020-07-14
AMGN
09:30:01
5.0
243.90
...
...
...
...
...
2020-12-08
YUMC
09:30:00
-22.0
56.89
2020-11-18
Z
14:20:01
12.0
100.68
2020-11-20
Z
09:30:01
-12.0
109.25
2020-09-04
ZS
09:45:24
9.0
135.94
2020-09-14
ZS
09:38:23
-9.0
126.41
and df2:
Date
USD
2
2020-02-01
22.702
3
2020-03-01
22.753
4
2020-06-01
22.601
5
2020-07-01
22.626
6
2020-08-01
22.739
..
...
...
248
2020-12-23
21.681
249
2020-12-28
21.482
250
2020-12-29
21.462
251
2020-12-30
21.372
252
2020-12-31
21.387
I want to add a new column "USD" from df2 by date in df1.
Trying
new_df = (dane5.reset_index()
.merge(kurz2,how='outer')
.fillna(0)
.set_index('Date'))
new_df.sort_index(inplace=True)
new_df= new_df[new_df['Symbol'] != 0]
print(new_df.head(50))
But I return zero value some rows:
Date
Symbol
Time
Quantity
Price
USD
2020-01-02
GL
10:31:14
13.0
104.550000
0.000
2020-01-02
ATEC
13:35:04
211.0
6.860000
0.000
2020-01-03
IOVA
14:02:32
56.0
25.790000
0.000
2020-01-03
TGNA
09:30:00
90.0
16.080000
0.000
2020-01-03
SCS
09:30:01
-70.0
20.100000
0.000
2020-01-03
SKX
09:30:09
34.0
41.940000
0.000
2020-01-06
IOVA
09:45:19
-56.0
24.490000
24.163
2020-01-06
GL
09:30:02
-13.0
103.430000
24.163
2020-01-06
SKX
15:55:15
-34.0
43.900000
24.163
2020-01-07
TGNA
15:55:16
-90.0
16.945000
23.810
2020-01-07
MRTX
09:46:18
-13.0
101.290000
23.810
2020-01-07
MRTX
09:34:10
13.0
109.430000
23.810
2020-01-08
ITCI
09:30:01
49.0
27.640000
0.000
Could you some help me please?
Sorry my bad English language.

Output last row of each year

My dataframe look like this:
Close Volume Dividends
Date
2014-08-07 14.21 4848000 0.00
2014-08-08 13.95 5334000 0.00
2014-08-11 14.07 4057000 0.00
2014-08-12 14.13 2611000 0.00
2014-08-13 14.15 3743000 0.28
... ... ... ...
2020-08-03 19.45 7352600 0.00
2020-08-04 19.69 4250500 0.00
2020-08-05 19.83 3414080 0.00
2020-08-06 20.40 6128100 0.00
2020-08-07 20.60 8295000 0.00
I like to output the closing price for the last day of each year. I tried the following:
df = df.groupby(df.index.year)['Close'].tail(1)
Date
2014-12-31 16.39
2015-12-31 13.67
2016-12-30 14.78
2017-12-29 21.83
2018-12-31 21.64
2019-12-31 25.00
2020-08-07 20.60
I want the output to be:
Date
2014 16.39
2015 13.67
2016 14.78
2017 21.83
...
Any help would be very much appreciated. Many Thanks!
Try with last
df = df.groupby(df.index.year)['Close'].last()

Creating values from datetime objects in certain fixed divisions

I am trying to create a new column, in which e.g. the time 14:02 should be saved as 14.0, whereas 14:16 should be 14.5. This would equal half-hour units. Of course 15min units should also be creatable and so on. This is my approach for full hours, but I need a higher resolution.
df["Time"] = df.StartDateTime.apply(lambda x: x.hour)
So long as the units evenly divide an hour you can round with that frequency and then divide by an hour.
import pandas as pd
df = pd.DataFrame({'Time': pd.timedelta_range('14:00:00', freq='4min', periods=10)})
for freq in ['30min', '15min', '20min', '10min']:
df[freq] = df['Time'].dt.round(freq)/pd.Timedelta('1H')
Time 30min 15min 20min 10min
0 14:00:00 14.0 14.00 14.000000 14.000000
1 14:04:00 14.0 14.00 14.000000 14.000000
2 14:08:00 14.0 14.25 14.000000 14.166667
3 14:12:00 14.0 14.25 14.333333 14.166667
4 14:16:00 14.5 14.25 14.333333 14.333333
5 14:20:00 14.5 14.25 14.333333 14.333333
6 14:24:00 14.5 14.50 14.333333 14.333333
7 14:28:00 14.5 14.50 14.333333 14.500000
8 14:32:00 14.5 14.50 14.666667 14.500000
9 14:36:00 14.5 14.50 14.666667 14.666667
If you start from a datetime64[ns] column you can isolate the time by subtracting off the normalized date. For example:
df = pd.DataFrame({'Time': pd.date_range('2010-01-01 14:00:00', freq='4min', periods=5)})
df['Time_only'] = df['Time'] - df['Time'].dt.normalize()
# Time Time_only
#0 2010-01-01 14:00:00 14:00:00
#1 2010-01-01 14:04:00 14:04:00
#2 2010-01-01 14:08:00 14:08:00
#3 2010-01-01 14:12:00 14:12:00
#4 2010-01-01 14:16:00 14:16:00
print(df.dtypes)
#Time datetime64[ns]
#Time_only timedelta64[ns]
#dtype: object

multi index(time series) slicing error in pandas

i have below dataframe. date/time is multi-indexed indexes.
when i doing this code,
<code>
idx = pd.IndexSlice
print(df_per_wday_temp.loc[idx[:,datetime.time(4, 0, 0): datetime.time(7, 0, 0)]])"
but i got error 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'. this may be error in
index slicing but i don't know why this happened. anybody can solve it ?
a b
date time
2018-01-26 19:00:00 25.08 -7.85
19:15:00 24.86 -7.81
19:30:00 24.67 -8.24
19:45:00 NaN -9.32
20:00:00 NaN -8.29
20:15:00 NaN -8.58
20:30:00 NaN -9.48
20:45:00 NaN -8.73
21:00:00 NaN -8.60
21:15:00 NaN -8.70
21:30:00 NaN -8.53
21:45:00 NaN -8.90
22:00:00 NaN -8.55
22:15:00 NaN -8.48
22:30:00 NaN -9.90
22:45:00 NaN -9.70
23:00:00 NaN -8.98
23:15:00 NaN -9.17
23:30:00 NaN -9.07
23:45:00 NaN -9.45
00:00:00 NaN -9.64
00:15:00 NaN -10.08
00:30:00 NaN -8.87
00:45:00 NaN -9.91
01:00:00 NaN -9.91
01:15:00 NaN -9.93
01:30:00 NaN -9.55
01:45:00 NaN -9.51
02:00:00 NaN -9.75
02:15:00 NaN -9.44
... ... ...
03:45:00 NaN -9.28
04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81
05:45:00 NaN -10.51
06:00:00 NaN -10.41
06:15:00 NaN -10.49
06:30:00 NaN -10.13
06:45:00 NaN -10.36
07:00:00 NaN -10.71
07:15:00 NaN -12.11
07:30:00 NaN -10.76
07:45:00 NaN -10.76
08:00:00 NaN -11.63
08:15:00 NaN -11.18
08:30:00 NaN -10.49
08:45:00 NaN -11.18
09:00:00 NaN -10.67
09:15:00 NaN -10.60
09:30:00 NaN -10.36
09:45:00 NaN -9.39
10:00:00 NaN -9.77
10:15:00 NaN -9.54
10:30:00 NaN -8.99
10:45:00 NaN -9.01
11:00:00 NaN -10.01
thanks in advance
If is not possible sorting index, is necessary create boolean mask and filter by boolean indexing:
from datetime import time
mask = df1.index.get_level_values(1).to_series().between(time(4, 0, 0), time(7, 0, 0)).values
df = df1[mask]
print (df)
a b
date time
2018-01-26 04:00:00 NaN -9.96
04:15:00 NaN -10.19
04:30:00 NaN -10.20
04:45:00 NaN -9.85
05:00:00 NaN -10.33
05:15:00 NaN -10.18
05:30:00 NaN -10.81