Compute mean for multi group of panda dataframe rows - pandas

From a panda dataframe with theses columns:
DAT_MESURE datetime64[ns]
MES_TEMPERATURE object
and values:
I want to compute the mean temperature value for a group of hours and get a new df.
For example, I want to create a new df with DAT_MESURE rounded to lowest hour and a mean for 4 values of the hour.
I want to get:
DAT_MESURE MES_TEMPERATURES
2020-08-01 00:00:00 21,xx
2020-08-01 01:00:00 22,xx
How to code it in python panda please?

Use:
df['MES_TEMPERATURE'] = df['MES_TEMPERATURE'].str.replace(',','.', regex=True).astype(float)
df1 = df.resample('H', on='DAT_MESURE')['MES_TEMPERATURE'].mean()
Or:
df2 = df.groupby(df['DAT_MESURE'].dt.floor('H'))['MES_TEMPERATURE'].mean()
If need round:
df3 = df.groupby(df['DAT_MESURE'].dt.round('H'))['MES_TEMPERATURE'].mean()

Related

How do I offset a dataframe with values in another dataframe?

I have two dataframes. One is the basevales (df) and the other is an offset (df2).
How do I create a third dataframe that is the first dataframe offset by matching values (the ID) in the second dataframe?
This post doesn't seem to do the offset... Update only some values in a dataframe using another dataframe
import pandas as pd
# initialize list of lists
data = [['1092', 10.02], ['18723754', 15.76], ['28635', 147.87]]
df = pd.DataFrame(data, columns = ['ID', 'Price'])
offsets = [['1092', 100.00], ['28635', 1000.00], ['88273', 10.]]
df2 = pd.DataFrame(offsets, columns = ['ID', 'Offset'])
print (df)
print (df2)
>>> print (df)
ID Price
0 1092 10.02
1 18723754 15.76 # no offset to affect it
2 28635 147.87
>>> print (df2)
ID Offset
0 1092 100.00
1 28635 1000.00
2 88273 10.00 # < no match
This is want I want to produce: The price has been offset by matching
ID Price
0 1092 110.02
1 18723754 15.76
2 28635 1147.87
I've also looked at Pandas Merging 101
I don't want to add columns to the dataframe, and I don;t want to just replace column values with values from another dataframe.
What I want is to add (sum) column values from the other dataframe to this dataframe, where the IDs match.
The closest I come is df_add=df.reindex_like(df2) + df2 but the problem is that it sums all columns - even the ID column.
Try this :
df['Price'] = pd.merge(df, df2, on=["ID"], how="left")[['Price','Offset']].sum(axis=1)

Flightradar24 pandas groupby and vectorize. A no looping solution

I am looking to perform a fast operation on flightradar data to see if the speed in distance matches the speed reported. I have multiple flights and was told not to run double loops on pandas dataframes. Here is a sample dataframe:
import pandas as pd
from datetime import datetime
from shapely.geometry import Point
from geopy.distance import distance
dates = ['2020-12-26 15:13:01', '2020-12-26 15:13:07','2020-12-26 15:13:19','2020-12-26 15:13:32','2020-12-26 15:13:38']
datetimes = [datetime.fromisoformat(date) for date in dates]
data = {'UTC': datetimes,
'Callsign': ["1", "1","2","2","2"],
'Position':[Point(30.542175,-91.13999200000001), Point(30.546204,-91.14020499999999),Point(30.551443,-91.14417299999999),Point(30.553909,-91.15136699999999),Point(30.554489,-91.155075)]
}
df = pd.DataFrame(data)
What I want to do is add a new column called "dist". This column will be 0 if it is the first element of a new callsign but if not it will be the distance between a point and the previous point.
The resulting df should look like this:
df1 = df
dist = [0,0.27783309075379214,0,0.46131362750613436,0.22464461718704595]
df1['dist'] = dist
What I have tried is to first assign a group index:
df['group_index'] = df.groupby('Callsign').cumcount()
Then groupby
Then try and apply the function:
df['dist'] = df.groupby('Callsign').apply(lambda g: 0 if g.group_index == 0 else distance((g.Position.x , g.Position.y),
(g.Position.shift().x , g.Position.shift().y)).miles)
I was hoping this would give me the 0 for the first index of each group and then run the distance function on all others and return a value in miles. However it does not work.
The code errors out for at least one reason which is because the .x and .y attributes of the shapely object are being called on the series rather than the object.
Any ideas on how to fix this would be much appreciated.
Sort df by callsign then timestamp
Compute distances between adjacent rows using a temporary column of shifted points
For the first row of each new callsign, set distance to 0
Drop temporary column
df = df.sort_values(by=['Callsign', 'UTC'])
df['Position_prev'] = df['Position'].shift().bfill()
def get_dist(row):
return distance((row['Position'].x, row['Position'].y),
(row['Position_prev'].x, row['Position_prev'].y)).miles
df['dist'] = df.apply(get_distances, axis=1)
# Flag row if callsign is different from previous row callsign
new_callsign_rows = df['Callsign'] != df['Callsign'].shift()
# Zero out the first distance of each callsign group
df.loc[new_callsign_rows, 'dist'] = 0.0
# Drop shifted column
df = df.drop(columns='Position_prev')
print(df)
UTC Callsign Position dist
0 2020-12-26 15:13:01 1 POINT (30.542175 -91.13999200000001) 0.000000
1 2020-12-26 15:13:07 1 POINT (30.546204 -91.14020499999999) 0.277833
2 2020-12-26 15:13:19 2 POINT (30.551443 -91.14417299999999) 0.000000
3 2020-12-26 15:13:32 2 POINT (30.553909 -91.15136699999999) 0.461314
4 2020-12-26 15:13:38 2 POINT (30.554489 -91.155075) 0.224645

Creating a Pandas Dataframe and Assigning Values Based on Another Dataframe

I have a dataframe that looks like this:
df1
ticker period calendarDate updated dateKey assetsAverage
WMT Q 2021-01-01 2021-03-31 2021-04-02 100000000
What I want to do is take these values and put them into another dataframe that looks like this:
df2
ticker period Calendar Date Updated Date Key Assets Average
WMT Q 2021-01-01 2021-03-31 2021-04-02 100000000
I'm using the 2nd dataframe as my output and using my 1st dataframe as temporary storage.
Any suggestions?
I tried doing something like this:
df2 = pd.DataFrame(
{
"Ticker":df1["ticker"],
"Period":df1["period"],
"Calendar Date":df1["calendarDate"],
"Updated":df1["updated"],
"Date Key":df1["dateKey"],
"Assets Average":df1["assetsAverage"]
}
)
The error message I got was
TypeError: init() takes from 1 to 6 positional arguments but 112 (I'm actually using more columns, but getting my point across only required a few).
were given
Edit #1:
This is what I am trying to do now:
df2 = df1.copy()
df2 = df2.rename(columns = {
"ticker":"Ticker",
"period":"Period",
"calendarDate":"Calendar Date",
"updated":"Updated",
"dateKey":"Date Key",
"assetsAverage":"Assets Average"
}
)
Unfortunately, I got the same error message as before, any suggestions?
Do this.
df2 = df1.copy()
df2.columns = ['ticker', 'period', 'Calendar Date', 'Updated', 'Date Key', 'Assets Average']

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Understanding resampling of datetime in pandas

I have a question regarding resampling of DataFrames.
import pandas as pd
df = pd.DataFrame([['2005-01-20', 10], ['2005-01-21', 20],
['2005-01-27', 40], ['2005-01-28', 50]],
columns=['date', 'num'])
# Convert the column to datetime
df['date'] = pd.to_datetime(df['date'])
# Resample and aggregate results by week
df = df.resample('W', on='date')['num'].sum().reset_index()
print(df.head())
# OUTPUT:
# date num
# 0 2005-01-23 30
# 1 2005-01-30 90
Everything works as expected, but I would like to better understand what exactly resample(),['num'] and sum() do here.
QUESTION #1
Why the following happens:
The result of df.resample('W', on='date') is DatetimeIndexResampler.
The result of df.resample('W', on='date')['num'] is pandas.core.groupby.SeriesGroupBy.
The result of df.resample('W', on='date')['num'].sum() is
date
2005-01-23 30
2005-01-30 90
Freq: W-SUN, Name: num, dtype: int64
QUESTION #2
Is there a way to produce the same results without resampling? For example, using groupby.
Answer1
As the docs says, .resample returns a Resampler Object. Hence you get DatetimeIndexResampler because date is a datetime object.
Now, you get <pandas.core.groupby.SeriesGroupBy because you are looking for Series from the dataframe based of off the Resampler object.
Oh by the way,
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num']
Would return
<pandas.core.groupby.SeriesGroupBy as well.
Now when you do .sum(), you are getting the sum over the requested axis of the dataframe. You get a Series because you are doing sum over the pandas.core.series.Series.
Answer2
You can achieve results using groupby with the help from Grouper as follow:
df.groupby([pd.Grouper(key='date', freq='W-SUN')])['num'].sum()
Output:
date
2005-01-23 30
2005-01-30 90
Name: num, dtype: int64