Manipulating duplicate rows across a subset of columns in dataframe pandas - pandas

Suppose I have a dataframe as follows:
df = pd.DataFrame({"user":[11,11,11,21,21,21,21,21,32,32],
"event":[0,0,1,0,0,1,1,1,0,0],
"datetime":['05:29:54','05:32:04','05:32:08',
'15:35:26','15:36:07','15:36:16','15:36:50','15:36:54',
'09:29:12', '09:29:25'] })
I would like to handle the repetitive lines across the first column (user) to reach the following.
In this case, we replace the 'event' column with the maximum value related in the 'user' column (for example for user=11, the maximum value for event is 1). And the third column is replaced by the average of the datetime.
P.S. It has been already discussed about dropping the repetitive rows here, however, I do not want to drop rows blindly. Especially when I am dealing with a dataframe with a lot of attributes.

You want to groupby and aggregate
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: pd.to_timedelta(s).mean()})
If you want, you can also just change your datetime column first to timedelta using pd.to_timedelta and just take the mean in the agg
You can use str to represent the way you intend
df.groupby('user').agg({'event': 'max',
'datetime': lambda s: str(pd.to_timedelta(s).mean().to_pytimedelta())})

You can convert datetimes to native integers and aggregate mean, last convert back and for HH:MM:SS strings use strftime:
df['datetime'] = pd.to_datetime(df['datetime']).astype(np.int64)
df1 = df.groupby('user', as_index=False).agg({'event':'max', 'datetime':'mean'})
df1['datetime'] = pd.to_datetime(df1['datetime']).dt.strftime('%H:%M:%S')
print (df1)
user event datetime
0 11 1 05:31:22
1 21 1 15:36:18
2 32 0 09:29:18

Related

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Datetime column coerced to int when setting with .loc and slice

I have a column of datetimes and need to change several of these values to new datetimes. When I set the values using df.loc[indices, 'col'] = new_datetimes, the unaffected values are coerced to int while the new set values are in datetime. If I set the values one at a time, no type coercion occurs.
For illustration I created a sample df with just one column.
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[[1,3,4]] = [dt.datetime(2019,1,2)]*3
df
This produces the following:
output
If I change indices 1,3,4 individually:
df = pd.DataFrame([dt.datetime(2019,1,1)]*5)
df.loc[1] = dt.datetime(2019,1,2)
df.loc[3] = dt.datetime(2019,1,2)
df.loc[4] = dt.datetime(2019,1,2)
df
I get the correct output:
output
A suggestion was to turn the list to a numpy array before setting, which does resolve the issue. However, if you try to set multiple columns (some of which are not datetime) using a numpy array, The issue arises again.
In this example the dataframe has two columns and I try to set both columns.
df = pd.DataFrame({'dt':[dt.datetime(2019,1,1)]*5, 'value':[1,1,1,1,1]})
df.loc[[1,3,4]] = np.array([[dt.datetime(2019,1,2)]*3, [2,2,2]]).T
df
This gives the following output:
output
Can someone please explain what is causing the coercion and how to prevent it from doing so? The code I wrote that uses this was written over a month ago and used to work just fine, could it be one of those warnings about future version of pandas deprecating certain functionalities?
An explanation of what is going on would be greatly appreciated because I wrote a other codes that likely employ similar functionality want to make sure everything works as intended.
The solution proposed by w-m has such an "awkward detail" than
the result column has also the time part (it didn't have it
before).
I have also such a remark, that DataFrames are tables not Series,
so they have columns, each with its name and it is a bad habit to
rely on default column names (consecutive numbers).
So I propose another solution, addressing both above issues:
To create the source DataFrame I executed:
df = pd.DataFrame([dt.datetime(2019, 1, 1)]*5, columns=['c1'])
Note that I provided a name for the only column.
Then I created another DataFrame:
df2 = pd.DataFrame([dt.datetime(2019,1,2)]*3, columns=['c1'], index=[1,3,4])
It contains your "new" dates and the numbers which you used in loc
I set as the index (again with the same column name).
Then, to update df, use (not surprisingly) df.update:
df.update(df2)
This function performs in-place update, so if you print(df), you will get:
c1
0 2019-01-01
1 2019-01-02
2 2019-01-01
3 2019-01-02
4 2019-01-02
As you can see, under indices 1, 3 and 4 you have new dates
and there is no time part, just like before.
[dt.datetime(2019,1,2)]*3 is a Python list of objects. This particular list happens to contain only datetimes, but Pandas does not seem to recognize that, and treats it as it is - a list of any kind of objects.
If you convert it into a typed array, then Pandas will keep the original dtype of the column intact:
df.loc[[1,3,4]] = np.asarray([dt.datetime(2019,1,2)]*3)
I hope this workaround helps you, but you may still want to file a bug with Pandas. I don't have an explanation as to why the datetime objects should be coerced to ints in the first output example.

Compare columns from two different data frames and one column value

I have two different data frames named as df1 and df2.
df1 has columns date1 and value1.
df2 has date2 and val ( initially it contains 0).
The val column value from df2 need to update to 1 when matching date found in df1.
This one was achieved by looping both the data frames with two for loops,
As volume is very high,it is taking more time.
Is there any best way to do that.
You probably need something like this:
import pandas as pd
common = pd.np.intersect1d(df1.date1.values, df2.date2.values)
df2.loc[df2.date2.isin(common), 'val'] = 1

Group DataFrame by binning a column::Float64, in Julia

Say I have a DataFrame with a column of Float64s, I'd like to group the dataframe by binning that column. I hear the cut function might help, but it's not defined over dataframes. Some work has been done (https://gist.github.com/tautologico/3925372), but I'd rather use a library function rather than copy-pasting code from the Internet. Pointers?
EDIT Bonus karma for finding a way of doing this by month over UNIX timestamps :)
You could bin dataframes based on a column of Float64s like this. Here my bins are increments of 0.1 from 0.0 to 1.0, binning the dataframe based on a column of 100 random numbers between 0.0 and 1.0.
using DataFrames #load DataFrames
df = DataFrame(index = rand(Float64,100)) #Make a DataFrame with some random Float64 numbers
df_array = map(x->df[(df[:index] .>= x[1]) .& (df[:index] .<x[2]),:],zip(0.0:0.1:0.9,0.1:0.1:1.0)) #Map an anonymous function that gets every row between two numbers specified by a tuple called x, and map that anonymous function to an array of tuples generated using the zip function.
This will produce an array of 10 dataframes, each one with a different 0.1-sized bin.
As for the UNIX timestamp question, I'm not as familiar with that side of things, but after playing around a bit maybe something like this could work:
using Dates
df = DataFrame(unixtime = rand(1E9:1:1.1E9,100)) #Make a dataframe with floats containing pretend unix time stamps
df[:date] = Dates.unix2datetime.(df[:unixtime]) #convert those timestamps to DateTime types
df[:year_month] = map(date->string(Dates.Year.(date))*" "*string(Dates.Month.(date)),df[:date]) #Make a string for every month in your time range
df_array = map(ym->df[df[:year_month] .== ym,:],unique(df[:year_month])) #Bin based on each unique year_month string