My dataframe is presented above. The dtypes are
weekday int64
date datetime64[ns]
time object
customers int64
dtype: object
I'd like to sum the customers column to be the count of customers arrived in the past 2 hours (stored in column date). However, using the Pandas Rolling functionality, I can only write
df['customers'] = df['date'].rolling(2).count()
This only counts the previous two date rows completely disregarding datetime values. I'd like to write
df['customers'] = df['date'].rolling('2H').count() #desired: 2H
to get the correct result. However, I'm getting ValueError: window must be an integer. Reading the rolling documentation from pandas, a datetime object should be able to receive a rolling time window (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html). I'm completely clueless why my datetime column cannot use this functionality.
Create sorted DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').sort_index()
df['customers'] = df['customers'].rolling('2H').count()
Related
I have two set of dates, startdate and enddate inside a dataframe
For each startdate, I want to find the smallest enddate that's greater than the startdate.
My minimum example code is below but it is very slow, takes 20 seconds each run. Note in my example the date range is the same so a "shift" is possible here but not in my real data.
is there anyway to speed up the code?
import pandas as pd
dates = pd.DataFrame({'startdate':pd.date_range(start='2000-11-03', end='2021-10-01'),'enddate':pd.date_range(start='2000-11-03', end='2021-10-01')})
dates['mindate_after_startdate']=dates['startdate'].apply(lambda x: min(dates['enddate'][dates['enddate']>x],default=datetime.today().date()))
figured it out using pd.merge_asof and direction='forward' argument.
So I have a table with a date column (example below)
Date1
2020-11-08
2020-12-03
2020-11-21
I am trying to calculate the difference between the column and a specific date with the following code:
df['diff'] = pd.to_datetime('2020-12-31') - df['DELDATED']
I wanted to get a number of days difference, however I obtained the following:
diff
454579200000000000
2419200000000000
3456000000000000
Why am I getting this and how can I get what I anticipate?
Try Series.dt.days:
df['diff'] = (pd.to_datetime('2020-12-31') - df['DELDATED']).dt.days
Working same like Series.rsub for subtract from right side, but less clear in my opinion:
df['diff'] = df['DELDATED'].rsub(pd.to_datetime('2020-12-31')).dt.days
I can inject timestamp in a dataframe column. But I wanted the timestamp column to be unique value (or increasing in nature, even by millisecond). What I currently have -
from datetime import datetime
from pyspark.sql.functions import lit
df = spark.createDataFrame(["10","11","13"], "string").toDF("age")
df = df.withColumn("ts", lit(datetime.now()))
display(df)
You cannot get a timestamp for each row, that is unique over the DataFrame depending on when Spark processes that row, because the data is distributed, so you’ll never have control over when that row was processed. That being said:
If you want the current timestamp to be added as a column, you’ll get better mileage if you use pyspark.sql.functions.current_timestamp.
If you want a column that provides increasing indices, use pyspark.sql.functions.monotonically_increasing_id().
I am adding a column to a dataframe calculating the number of days between each previous date for each of the customers with the following formula but I end up with out of memory
lapsed['Days']=lapsed[['Customer Number','GL Date']].groupby(['Customer Number']).diff()
The dataframe contains more than 1mln records
Customer Number is an int64 and I was thinking to run the the above statement withing ranges of numbers but I do not know if this is the best aproach
Any suggestion?
I have a db table containing a datetime column with values stretching over 24hours. If I use pandas dataframe groupby function to give a minute by minute aggregation, this will throw everything into 0-59 buckets regardless of which hour they were in.
How do I get minute by minute aggregations spread over the timeframe of the table, in this case 24 hours? Also, for those minutes in which there is no values in the table, how do I insert a zero count for that minute into the dataframe?
Try using pd.TimeGroupper
import pandas as pd
df = pd.DataFrame(index=pd.date_range("11:00", "21:30", freq="100ms"))
df['x'] = 1
g = df.groupby(pd.TimeGrouper('S')).sum()