How to groupby in Pandas by datetime range from different DF

How to groupby in Pandas by datetime range from different DF - pandas

I've stuck and can't solve this...
I have 2 dataframes.
One has datetimes intervals, another has datetimes and values.
I need to get MIN() values based on datetime ranges.
import pandas as pd
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
Required output:
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
And ideas?

Use IntervalIndex created by timeseries columns, then get positions by Index.get_indexer, aggregate min and last add to column to timeseries:
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'],
closed='both')
values['new'] = timeseries.index[s.get_indexer(values['DT'])]
print (values)
DT Value new
0 2018-01-01 00:00:00 1 0
1 2018-01-01 01:00:00 2 0
2 2018-01-01 02:00:00 0 0
3 2018-01-02 00:00:00 -1 1
4 2018-01-02 01:00:00 3 1
5 2018-01-02 02:00:00 10 1
6 2018-01-03 00:00:00 7 2
7 2018-01-03 01:00:00 11 2
8 2018-01-03 02:00:00 2 2
df = timeseries.join(values.groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2
EDIT: If no match is added missing values instead -1, so was selected last index value, here 2:
timeseries = pd.DataFrame(
[
['2018-01-01T00:00:00.000000000', '2018-01-01T03:00:00.000000000'],
['2018-01-02T00:00:00.000000000', '2018-01-02T03:00:00.000000000'],
['2018-01-03T00:00:00.000000000', '2018-01-03T03:00:00.000000000'],
], dtype='datetime64[ns]', columns=['Start DT', 'End DT'])
values = pd.DataFrame(
[ ['2017-12-31T00:00:00.000000000', -10],
['2018-01-01T00:00:00.000000000', 1],
['2018-01-01T01:00:00.000000000', 2],
['2018-01-01T02:00:00.000000000', 0],
['2018-01-02T00:00:00.000000000', -1],
['2018-01-02T01:00:00.000000000', 3],
['2018-01-02T02:00:00.000000000', 10],
['2018-01-03T00:00:00.000000000', 7],
['2018-01-03T01:00:00.000000000', 11],
['2018-01-03T02:00:00.000000000', 2],
], columns=['DT', 'Value'])
values['DT'] = pd.to_datetime(values['DT'])
print (values)
DT Value
0 2017-12-31 00:00:00 -10
1 2018-01-01 00:00:00 1
2 2018-01-01 01:00:00 2
3 2018-01-01 02:00:00 0
4 2018-01-02 00:00:00 -1
5 2018-01-02 01:00:00 3
6 2018-01-02 02:00:00 10
7 2018-01-03 00:00:00 7
8 2018-01-03 01:00:00 11
9 2018-01-03 02:00:00 2
s = pd.IntervalIndex.from_arrays(timeseries['Start DT'],
timeseries['End DT'], closed='both')
pos = s.get_indexer(values['DT'])
values['new'] = timeseries.index[pos].where(pos != -1)
print (values)
DT Value new
0 2017-12-31 00:00:00 -10 NaN
1 2018-01-01 00:00:00 1 0.0
2 2018-01-01 01:00:00 2 0.0
3 2018-01-01 02:00:00 0 0.0
4 2018-01-02 00:00:00 -1 1.0
5 2018-01-02 01:00:00 3 1.0
6 2018-01-02 02:00:00 10 1.0
7 2018-01-03 00:00:00 7 2.0
8 2018-01-03 01:00:00 11 2.0
9 2018-01-03 02:00:00 2 2.0
df = timeseries.join(values.dropna(subset=['new']).groupby('new')['Value'].min().rename('Min'))
print (df)
Start DT End DT Min
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2

One possible solution is to create a variable (key) on which to join the two datasets
# create 'key' variable
timeseries['key'] = timeseries['Start DT'].astype(str)
values['key'] = pd.to_datetime(values['DT'].str.replace('T', ' '), format='%Y-%m-%d %H:%M:%S.%f').dt.date.astype(str)
# create dataset with minima
mins = values.groupby('key').agg({'Value': 'min'}).reset_index()
# join
timeseries.merge(mins, on='key').drop(columns=['key'])
Start DT End DT Value
0 2018-01-01 2018-01-01 03:00:00 0
1 2018-01-02 2018-01-02 03:00:00 -1
2 2018-01-03 2018-01-03 03:00:00 2

values['DT']=values['DT'].astype(str) #convert to string
s=values['DT'].str.split(' ')#split on space
values['day']=s.str[0] #take the day part
df4=values.groupby(by='day').min()#groupby and take min value
df4.reset_index(inplace=True) #reset index
df4['day']=pd.to_datetime(df4['day'])#convert back to datetime for merging
final=pd.merge(timeseries,df4,left_on='Start DT',right_on='day',how='inner') #merge

Related

Pandas groupby issue after melt bug?

Python version 3.8.12
pandas 1.4.1
Given the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1000] * 4,
'date': ['2022-01-01'] * 4,
'ts': pd.date_range('2022-01-01', freq='5M', periods=4),
'A': np.random.randint(1, 6, size=4),
'B': np.random.rand(4)
})
that looks like this:
id
date
ts
A
B
0
1000
2022-01-01
2022-01-01 00:00:00
4
0.98019
1
1000
2022-01-01
2022-01-01 00:05:00
3
0.82021
2
1000
2022-01-01
2022-01-01 00:10:00
4
0.549684
3
1000
2022-01-01
2022-01-01 00:15:00
5
0.0818311
I transposed the columns A and B with pandas melt:
melted = df.melt(
id_vars=['id', 'date', 'ts'],
value_vars=['A', 'B'],
var_name='label',
value_name='value',
ignore_index=True
)
that looks like this:
id
date
ts
label
value
0
1000
2022-01-01
2022-01-01 00:00:00
A
4
1
1000
2022-01-01
2022-01-01 00:05:00
A
3
2
1000
2022-01-01
2022-01-01 00:10:00
A
4
3
1000
2022-01-01
2022-01-01 00:15:00
A
5
4
1000
2022-01-01
2022-01-01 00:00:00
B
0.98019
5
1000
2022-01-01
2022-01-01 00:05:00
B
0.82021
6
1000
2022-01-01
2022-01-01 00:10:00
B
0.549684
7
1000
2022-01-01
2022-01-01 00:15:00
B
0.0818311
Then I groupby and select the first group:
melted.groupby(['id', 'date']).first()
that gives me this:
ts label value
id date
1000 2022-01-01 2022-01-01 A 4.0
but I would expect this output instead:
ts A B
id date
1000 2022-01-01 2022-01-01 00:00:00 4 0.980190
2022-01-01 2022-01-01 00:05:00 3 0.820210
2022-01-01 2022-01-01 00:10:00 4 0.549684
2022-01-01 2022-01-01 00:15:00 5 0.081831
What am I not getting? Or this is a bug? Also why the ts columns is converted to a date?

my bad!!! I thought first will get the first group but instead it will get the first element for each group, as stated in the documentation for the aggregation functions of pandas. Sorry folks, was doing this late at night and could not think straight :/
To select the first group, I needed to use get_group function.

pandas frequency of each entry of a value in a row for a given column

This dataframe is obtained from a timeseries resample operation as shown below
Ticket Priority
Submit Date
2018-01-02 04:00:00 1 P3 - Normal
2018-01-02 08:00:00 18 P3 - NormalP3 - NormalP3 - NormalP3 - NormalP3...
2018-01-02 12:00:00 23 P2 - HighP3 - NormalP3 - NormalP3 - NormalP3 -...
2018-01-02 16:00:00 1 P3 - Normal
2018-01-02 20:00:00 0 0
2018-01-03 00:00:00 0 0
2018-01-03 04:00:00 1 P3 - Normal
2018-01-03 08:00:00 3 P3 - NormalP3 - NormalP3 - Normal
what I'm looking to get actually is something like this:
Ticket Priority
Submit Date
2018-01-02 04:00:00 1 P3 - Normal = 1
2018-01-02 08:00:00 18 P3 - Normal = 4
2018-01-02 12:00:00 23 P2 - High = 1
P3 - Normal = 3
2018-01-02 16:00:00 1 P3 - Normal = 1
2018-01-02 20:00:00 0 0
2018-01-03 00:00:00 0 0
2018-01-03 04:00:00 1 P3 - Normal = 1
2018-01-03 08:00:00 3 P3 - Normal = 3
where the Priority column lists the type of ticket and the count of occurrence of each of those ticket types.

def get_priorities(x):
types = ['Normal','High']
if x == 0:
return 0
else:
z = []
for y in types:
if y in x:
z.append(str(x[:2]+ '-' + '{} = '.format(y) + str(x.count(y))))
return ' '.join(z)
This should be your custom function and use lambda to apply it on your data frame.
df['Priority'] = df['Priority'].apply(lambda x: get_priorities(x))
Let me know if this does not work for you.

pandas groupby not using for loop (how to make smart)

Suppose you have a pandas Series like this.
a = pd.Series(range(31),index = pd.date_range('2018-01-01','2018-01-31',freq='D'))
If you want to make groupby dataframe with multi index like this
data
date
2018-01-01 2018-01-01 0
2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
2018-01-02 2018-01-02 1
2018-01-03 2
2018-01-04 3
2018-01-05 4
2018-01-06 5
2018-01-03 2018-01-03 2
2018-01-04 3
2018-01-05 4
2018-01-06 5
2018-01-07 6
.....
This data shows that in the first level Multi index, it shows the original date time index. And in the second level, we cut the date into 5 days.
For example, if first level is 2018-01-01, second level is 2018-01-01 to 2018-01-05.
If first level is 2018-01-15, second level is 2018-01-15 to 2018-01-19 and data is 14, 15, 16, 17, 18.
How Can I make this DataFrame or Series without any loop?

Use -
import datetime as dt
first = np.repeat(a.index.values,5)
second = [ i + np.timedelta64(j,'D') for i in a.index for j in range(5)]
arrays = [first, second]
print(np.shape(second))
d=pd.DataFrame(index=pd.MultiIndex.from_arrays(arrays, names=('date1', 'date2')))
Output (d.head())
value
date1 date2
2018-01-01 2018-01-01 0.0
2018-01-02 1.0
2018-01-03 2.0
2018-01-04 3.0
2018-01-05 4.0
2018-01-02 2018-01-02 1.0
2018-01-03 2.0
2018-01-04 3.0
2018-01-05 4.0
2018-01-06 5.0

Pandas: filling missing values by the time occurrence of an event

I already asked a similar question (see here), but unfortunately it was not clear enough, so I decided it was better to create a new one with a better dataset for example and a new explanation of the desired output - an edit would have been really a major change.
So, I have the following dataset (it's already sorted by date and player):
d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '3', '3', '3', '3', '3', '3'],
'date': ['2018-01-01 00:17:01', '2018-01-01 00:17:05','2018-01-01 00:19:05', '2018-01-01 00:21:07', '2018-01-01 00:22:09',
'2018-01-01 00:22:17', '2018-01-01 00:25:09', '2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29',
'2018-01-01 00:30:35', '2018-02-01 00:31:16', '2018-02-01 00:35:22', '2018-02-01 00:38:16',
'2018-02-01 00:38:20', '2018-02-01 00:55:15', '2018-01-03 00:55:22',
'2018-01-03 00:58:16', '2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:20:16', '2018-03-01 01:31:16'],
'id': [np.nan, np.nan, 'a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'e', 'e', np.nan, 'f', 'f',
'g', np.nan, 'f', 'g']}
#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] = pd.to_datetime(df['date'])
df
player date id
0 1 2018-01-01 00:17:01 NaN
1 1 2018-01-01 00:17:05 NaN
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 NaN
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 NaN
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 NaN
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
So, these are my three columns:
'player' - dtype = object
'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
'date' (datetime object) tells us the time at which each action was implemented.
The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.
Ok, so here I have my missing values:
df.loc[df.id.isnull(),'date']
0 2018-01-01 00:17:01
1 2018-01-01 00:17:05
5 2018-01-01 00:22:07
15 2018-02-01 00:55:15
19 2018-03-01 01:00:35
Please note that I have the player code for each one of them: what I miss is just the sessioncode. So, I want to compare the timestamp of each missing value with the sessioncode timestamp of the corresponding players.
I was thinking of computing with a groupby the first and last action for each session, for each player (but I do not know if it is the best approach).
my_agg = df.groupby(['player', 'id']).date.agg([min, max])
my_agg
min max
player id
1 a 2018-01-01 00:19:05 2018-01-01 00:21:07
b 2018-01-01 00:22:09 2018-01-01 00:25:09
c 2018-01-01 00:25:11 2018-01-01 00:30:35
2 d 2018-02-01 00:31:16 2018-02-01 00:35:22
e 2018-02-01 00:38:16 2018-02-01 00:38:20
3 f 2018-01-03 00:55:22 2018-03-01 01:20:16
g 2018-01-03 00:58:21 2018-03-01 01:31:16
Then I would like to match the Nan by player id, and compare the timestamps of each missing values with the range of each session for that player.
In the dataset I try to illustrate three possible scenarios I am interested in:
the action occurred between the first and last date of a certain session. In this case I would like to fill the missing value with the id of that session, as it clearly belongs to that session. Row 5 of the dataset should therefore be labeled as 'b', as it occurs within the range of b.
I would mark as '0' the session where the action occurred outside the range of any session - for example the first two Nans and row 15.
Finally, mark it as '-99' if it is not possible to associate the action to a single session, because it occurred during the time range of different session. This is the case of row 19, the last Nan.
Desired output:
to sum it up, the outcome should look like this df:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g

May not be the best approach but it does work. basically I am creating some new columns using shift and then used your conditions you mentioned with np.select:
df['shift'] = df['id'].shift(1)
df['shift-1'] = df['id'].shift(-1)
df['merge'] = df[['shift','shift-1']].values.tolist()
df.drop(columns=['shift','shift-1'], inplace=True)
alpha = {np.nan:0,'a':1,'b':2,'c':3,'d':4,'e':5,'f':6,'g':7,'h':8}
diff = []
for i in range(len(df)):
diff.append(alpha[df['merge'][i][1]] - alpha[df['merge'][i][0]])
df['diff'] = diff
conditions = [(df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player'].shift(-1)))),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(~df['diff'] < 0)),
(~df['id'].shift(1).eq(df['id'].shift(-1)) & (df['id'].isna()) & (df['player'].shift(1).eq(df['player']) |
df['player'].shift(-1).eq(df['player'])) &
(df['diff'] < 0)),
]
choices = [df['id'].ffill(),
0,
-99
]
df['id'] = np.select(conditions, choices, default = df['id'])
df.drop(columns=['merge','diff'], inplace=True)
df
out:
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:07 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g

In my solution I just had to work a bit to apply correctly the function wrote by #ysearka in a previous stackoverflow question - see here. The basic challenge was to apply his function player by player.
#define a function to sort the missing values (ysearka function from stackoverflow)
def my_custom_function(time):
#compare every date event with the range of the sessions.
current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
#store length, that is the number of matches.
count = len(current_sessions)
#How many matches are there for any missing id value?
# if 0 it means that no matches are found: the event lies outside all the possible ranges
if count == 0:
return 0
#if more than one, it is impossible to say to which session the event belongs
if count > 1:
return -99
#equivalent to if count == 1 return: in this case the event belongs clearly to just one session
return current_sessions.index[0][1]
#create a list storing all the player ids
plist = list(df.player.unique())
#ignore settingcopywarning: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
pd.options.mode.chained_assignment = None
# create an empty new dataframe, where to store the results
final = pd.DataFrame()
#with this for loop iterate over the part of the dataset corresponding to one player at a time
for i in plist:
#slice the dataset by player
players = df.loc[df['player'] == i]
#for every player, take the dates where we are missing the id
mv_per_player = players.loc[players.id.isnull(),'date']
#for each player, groupby player id, and compute the first and last event
my_agg = players.groupby(['player', 'id']).date.agg([min, max])
#apply the function to each chunk of the dataset. You obtain a series, with all the imputed values for the Nan
ema = mv_per_player.apply(my_custom_function)
#now we can sobstitute the missing id with the new imputed values...
players.loc[players.id.isnull(),'id'] = ema.values
#append new values stored in players to the new dataframe
final = final.append(players)
#...and check the new dataset
final
player date id
0 1 2018-01-01 00:17:01 0
1 1 2018-01-01 00:17:05 0
2 1 2018-01-01 00:19:05 a
3 1 2018-01-01 00:21:07 a
4 1 2018-01-01 00:22:09 b
5 1 2018-01-01 00:22:17 b
6 1 2018-01-01 00:25:09 b
7 1 2018-01-01 00:25:11 c
8 1 2018-01-01 00:27:28 c
9 1 2018-01-01 00:29:29 c
10 1 2018-01-01 00:30:35 c
11 2 2018-02-01 00:31:16 d
12 2 2018-02-01 00:35:22 d
13 2 2018-02-01 00:38:16 e
14 2 2018-02-01 00:38:20 e
15 2 2018-02-01 00:55:15 0
16 3 2018-01-03 00:55:22 f
17 3 2018-01-03 00:58:16 f
18 3 2018-01-03 00:58:21 g
19 3 2018-03-01 01:00:35 -99
20 3 2018-03-01 01:20:16 f
21 3 2018-03-01 01:31:16 g
I do not think that my solution is the best, and still would appreciate other ideas, especially if they are more easily scalable (I have a large dataset).

Transform data by time & Class

I have a dataframe nf as following:
DateTime Class Count
0 2017-10-01 00:00:00 1 0
1 2017-10-01 00:00:00 2 240
2 2017-10-01 00:00:00 3 17
3 2017-10-01 00:00:00 4 0
4 2017-10-01 00:00:00 5 1
5 2017-10-01 00:00:00 6 0
6 2017-10-01 00:00:00 7 0
7 2017-10-01 00:00:00 8 0
8 2017-10-01 00:00:00 9 0
9 2017-10-01 00:00:00 10 0
10 2017-10-01 00:00:00 11 0
11 2017-10-01 00:00:00 12 0
12 2017-10-01 00:00:00 13 0
13 2017-10-01 00:00:00 14 0
14 2017-10-01 00:00:00 15 0
..............................
30 2017-10-01 01:00:00 1 0
31 2017-10-01 01:00:00 2 209
32 2017-10-01 01:00:00 3 14
33 2017-10-01 01:00:00 4 0
34 2017-10-01 01:00:00 5 4
35 2017-10-01 01:00:00 6 0
36 2017-10-01 01:00:00 7 0
37 2017-10-01 01:00:00 8 0
38 2017-10-01 01:00:00 9 0
39 2017-10-01 01:00:00 10 0
40 2017-10-01 01:00:00 11 0
41 2017-10-01 01:00:00 12 0
42 2017-10-01 01:00:00 13 0
43 2017-10-01 01:00:00 14 0
44 2017-10-01 01:00:00 15 0
....... and so on
There are total 15 classes and counts for each class for each hour
I want to transform the data into columnwise on a per hour for each count basis as follows
Output req
DateTime Class1 Class2 Class3 Class4.........Class15
2017-10-01 00:00:00 0 240 17 0 ......... 0
2017-10-01 00:01:00 0 209 14 0 ......... 0
....
and so on

You can use pandas to read the data into a pd.Dataframe(), select the counts for each class by slicing the dataframe with conditions and concate the data after that by using the datetime as index:
import pandas as pd
# create dataframe from file
df = pd.read_csv('fname')
# or from numpy array
df = pd.Dataframe(data=np_array, columns=['DateTime', 'Class', 'Count'])
# select the counts for each class
df_c1 = df[df.Class == 1]
df_c2 = df[df.Class == 2]
df_c3 = df[df.Class == 3]
df_c4 = df[df.Class == 4]
df_new = pd.Dataframe()
df_new['DateTime'] = df_c1['DateTime']
df_new['Class1'] = df_c1['Count']
df_new['Class2'] = df_c2['Count']
df_new['Class3'] = df_c3['Count']
df_new['Class4'] = df_c4['Count']
The code example is really dirty and I'm probably missing alot, but maybe it gives you an inspiration. I would also recommend you to check the pandas documentation for concat() and Dataframe()
I'm going to review and refactor my example code tomorrow, in case the problem is not solved already. Meanwhile you could fix the layout of the data in your question it's not readable.

Try pivot_table:
(df.pivot_table(index='DateTime',columns='Class',
values='Count',
aggfunc='sum')
.add_prefix('Class_'))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to groupby in Pandas by datetime range from different DF - pandas

Related

Pandas groupby issue after melt bug?

pandas frequency of each entry of a value in a row for a given column

pandas groupby not using for loop (how to make smart)

Pandas: filling missing values by the time occurrence of an event

Transform data by time & Class

Categories

Resources