Make new subsets for each month - pandas

I try to make subsets of each month or quarter from my data frame. I already tried some StackOverflow suggestions with the DateTime package but without success. My data frame is a Pandas data frame where the date column consists of a TimeStamp object. If someone has a suggestion that could work that would be lovely.
One of the best options from StackOverflow I already tried is:
date_format = "%Y-%m-%d %H:%M:%S"
df['datetime'] = [datetime.strptime(dt, date_format) for dt in df['date']]
df['quarter'] = [dt.quarter for dt in df['datetime']]
dfQ1 = df[df.quarter == 1]
# for each quarter the same
I made timestamps of my data by the use of the following code:
time_stamps = []
for i in data['event_timestamp']:
time_stamps.append(datetime.datetime.strptime(i, '%Y-%m-%d %H:%M:%S'))
You can find a picture of the head of the dataframe in the following link:
head of the data frame

Related

How to categorize a range of hours in Pandas?

In my project I am trying to create a new column to categorize records by range of hours, let me explain, I have a column in the dataframe called 'TowedTime' with time series data, I want another column to categorize by full hour without minutes, for example if the value in the 'TowedTime' column is 09:32:10 I want it to be categorized as 9 AM, if says 12:45:10 it should be categorized as 12 PM and so on with all the other values. I've read about the .cut and bins function but I can't get the result I want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
df = pd.read_excel("Baltimore Towing Division.xlsx",sheet_name="TowingData")
df['Month'] = pd.DatetimeIndex(df['TowedDate']).strftime("%b")
df['Week day'] = pd.DatetimeIndex(df['TowedDate']).strftime("%a")
monthOrder = ['Jan', 'Feb', 'Mar', 'Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dayOrder = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
pivotHours = pd.pivot_table(df, values='TowedDate',index='TowedTime',
columns='Week day',
fill_value=0,
aggfunc= 'count',
margins = False, margins_name='Total').reindex(dayOrder,axis=1)
print(pivotHours)
First, make sure the type of the column 'TowedTime' is datetime. Second, you can easily extract the hour from this data type.
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S')
df['hour'] = df['TowedTime'].dt.hour
hope it answers your question
With the help of #Fabien C I was able to solve the problem.
First, I had to check the data type of values in the 'TowedTime' column with dtypes function. I found that were a Object.
I proceed to try convert 'TowedTime' to datetime:
df['TowedTime'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.time
Then to create a new column in the df, for only the hours:
df['Hour'] = pd.to_datetime(df['TowedTime'],format='%H:%M:%S').dt.hour
And the result was this:
You can notice in the image that 'TowedTime' column remains as an object, but the new 'Hour' column correctly returns the hour value.
Originally, the dataset already had the date and time separated into different columns, I think they used some method to separate date and time in excel and this created the time ('TowedTime') to be an object, I could not convert it, Or at least that's what the dtypes function shows me.
I tried all this Pandas methods for converting the Object to Datetime :
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = pd.to_datetime(df['TowedTime'])
df['TowedTime'] = df['TowedTime'].astype('datetime64[ns]')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')
df['TowedTime'] = pd.to_datetime(df['TowedTime'], format='%H:%M:%S')

Pyspark: how to fix 'could not parse datatype: interval' error

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.
I already had a date_of_birth column available, so I inserted a current_date column with the following code:
import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))
df= df.withColumn("current_date", lit(today))
Displaying my df confirms that it worked. Looks a little something like this:
id
date_of_birth
current_date
01
1995-01-01
2021-12-2021
02
1987-02-16
2021-12-2021
I inserted the age column by substracting the values of date_of_birth and current_date.
df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))
Cell runs without a problem.
Here's where I'm stuck:
Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:
'could not parse datatype: interval'
I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.
How can I fix this?
Is there a way to display the age in years (int) in this particular scenario?
PS: both the date_of_birth and current_date cols have a date type.
Solved it. Mike's comment helped tons. Thank you!
Here's how I solved it:
# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))
# update data with current_date() function
df = df .withColumn("current_date", f.current_date())
# insert new column age with dummy data (in this case, 1s)
df = df .withColumn("age", lit(1))
# update data with months_between() function, divide by 12 to obtain years.
df = df .withColumn("age", f.months_between(df.current_date, df .date_of_birth)/12)
# round and cast as interger to get rid of decimals
df = df .withColumn("age", f.round(df["age"]).cast('integer'))
Would use one of the pyspark functions for calculating difference between dates.
pyspark.sql.functions.datediff
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.datediff.html
pyspark.sql.functions.months_between
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.months_between.html

How do I find the intersection of row and column in Pandas or Openpyxl

so I am trying to create a attendance sheet here where we have the following data in excel
So my plan here is to write code in python using tkinter ,datetime module and pandas
Where the datetime module will get me todays date.
With the help of tkitners entry and todays date I want to find the intersection cell.
For example:
A and 11-01-2021 its intersection would be cell B2 and inside that cell on recording the attendance it will add "Present"
I am finding it really difficult to work with dates here.
Kindly help
Convert file to DataFrame and select by name and date for scalar output, only necessary datetimes in columns names:
df = pd.read_excel(file, index_col=['Name'])
#if necessary
#df.columns = pd.to_datetime(df.columns, dayfirst=True)
name = 'A'
today = pd.to_datetime('today').floor('D')
df.loc[name, today] = "Present"

Plotly Bar Chart Based on Pandas dataframe grouped by year

I have a pandas dataframe that I've tried to group by year on 'Close Date' and then plot 'ARR (USD)' on the y-axis against the year on the x-axis.
All seems fine after grouping:
sumyr = brandarr.groupby(brandarr['Close Date'].dt.year,as_index=True).sum()
ARR (USD)
Close Date
2017 17121174.33
2018 15383130.32
But when I try to plot:
trace = [go.Bar(
x=sumyr['Close Date'],
y=sumyr['ARR (USD)']
)]
I get the error: KeyError: 'Close Date'
I'm sure it's something stupid, I'm a newbie, but I've been messing with it for an hour and well, here I am. Thanks!
In your groupby function you have used as_index=True so Close Date is now an index. If you want to have access to an index, use pandas .loc or .iloc.
To have access to the index values directly, use:
sumyr.index.tolist()
Check here: Pandas - how to get the data frame index as an array

how to group pandas timestamps plot several plots in one figure and stack them together in matplotlib?

I have a data frame with perfectly organised timestamps, like below:
It's a web log, and the timestamps go though the whole year. I want to cut them into each day and show the visits within each hour and plot them into the same figure and stack them all together. Just like the pic shown below:
I am doing well on cutting them into days and plot the visits of a day individually, but I am having trouble plotting them and stacking them together. The primary tool I am using is Pandas and Matplotlib.
Any advices and suggestions? Much Appreciated!
Edited:
My Code is as below:
The timestamps are: https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5
And the dataframe looks like this:
I plotted the timestamp density through the whole period used the code below:
timestamps_series_all = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp))
timestamps_series_all_toBePlotted = pd.Series(1, index=timestamps_series_all)
timestamps_series_all_toBePlotted.resample('D').sum().plot()
and got the result:
I plotted timestamps within one day using the code:
timestamps_series_oneDay = pd.DatetimeIndex(pd.Series(unique_visitors_df.time_stamp.loc[unique_visitors_df["date"] == "2014-08-01"]))
timestamps_series_oneDay_toBePlotted = pd.Series(1, index=timestamps_series_oneDay)
timestamps_series_oneDay_toBePlotted.resample('H').sum().plot()
and the result:
And now I am stuck.
I'd really appreciate all of your help!
I think you need pivot:
#https://gist.github.com/adamleo/04e4147cc6614820466f7bc05e088ac5 to L
df = pd.DataFrame({'date':L})
print (df.head())
date
0 2014-08-01 00:05:46
1 2014-08-01 00:14:47
2 2014-08-01 00:16:05
3 2014-08-01 00:20:46
4 2014-08-01 00:23:22
#convert to datetime if necessary
df['date'] = pd.to_datetime(df['date'] )
#resample by Hours, get count and create df
df = df.resample('H', on='date').size().to_frame('count')
#extract date and hour
df['days'] = df.index.date
df['hours'] = df.index.hour
#pivot and plot
#maybe check parameter kind='density' from http://stackoverflow.com/a/33474410/2901002
#df.pivot(index='days', columns='hours', values='count').plot(rot='90')
#edit: last line change to below:
df.pivot(index='hours', columns='days', values='count').plot(rot='90')