Pandas grouping values and getting most recent date - pandas

I have a large csv file that I read into Pandas that gets me a DataFrame for "Community_Name" and "Date" its about 186k lines with about 120 unique "Community_Names" and a range of Dates. I would like to Group the data by Community and find the most recent Date for each in the file. I will use this later on to pull data from each community up to that most recent date later on.
I am struggling with getting the most recent date value for each community. I thought the .max() would work, but it returns the greatest value overall rather than for each community...
with open('communitydates.csv', 'r', newline='', encoding='utf-8') as csv_file:
csv_reader = csv.DictReader(csv_file)
for line in csv_reader:
date = line['Date'] + " " + line['Year']
date = datetime.datetime.strptime(date, '%B %d %Y').strftime('%Y %d %m')
community_name = line['Community']
entry = community_name, date
dates_list.append(entry)df = pd.DataFrame(dates_list)
df.columns = ["Community", "Date"]
df["Date"] = pd.to_datetime(df["Date"], format='%Y %d %m').max()
grouped_by_community = df.groupby("Community")
recent_date_by_community = grouped_by_community.first()
Ideally I want to convert the DataFrame into a Dictionary or List to do the check later on.
max_dates = recent_date_by_community.to_dict('index')
for k in max_dates:
print(k, max_dates[k]['Date'])
Which currently gives me this...but...the date is the same for all 102 communities vs the actual date in the file.
Addison 2019-10-09 00:00:00
I assuming I am using the .max() statement incorrectly, but have not been able to figure out how change it.

Related

setting pandas series row values to multiple column values

I am having one dataframe object df. wherein i have got some data from excel sheet. then i have added certain Date columns to this df object. this df also has certain stock ticker from yahoo finance. now i try to get the history of prices for these stocks tickers for 2 months history from yahoo finance (which will be 60 rows) and then trying to assign these price values to column header, with relevant dates, in the df object. however i am not able to do so.
In the last line of code, i am trying to set the values of "Volume", which will be in different rows, to the column values for respective dates in df. but i am not able to do so. need help. Thanks
df = pd.read_excel(r"D:\Volume Trading\python\excel"
r"\Nifty-sector-cap.xlsx")
start_date = date(2022,3,1) # Date YYYY MM DD
end_date = date(2022,4,25)
## downloading below data just to get dates which will be columns of df.
temp_data = yf.download("HDFCBANK.NS", start_date, end_date, interval = '1d', index = False)["Adj Close"]
temp_data.index = temp_data.index.date
# setting the dates as columns header in df
df = df.reindex(columns = df.columns.tolist() + temp_data.index.tolist())
i = 0
# putting the volume for each ticker on each date in df
for i in range(0,len(df)):
temp_vol = yf.download(df["Yahoo_Symbol"].iloc[i], start_date, end_date, interval ="1d")["Volume"]
temp_vol.index = temp_vol.index.date
df[temp_vol.index.tolist()].iloc[i] = temp_vol.("Volume").transpose()

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

Pandas: How do I convert a column of time duration in minutes:seconds into datetime when some of minutes are greater than 60?

I have a dataframe column that contains the durations of videos in minutes:seconds. Unfortunately some of the rows are formatted incorrectly where the minutes are greater than 60 (e.g 94:36). When I try to run pd.to_datetime using a format string %M:%S it gives me an error saying that the aforementioned time is incorrectly formatted.
How do I fix this so that the times are correct (e.g converting the extra minutes into hours? 94:36 -> 1:34:36) for all rows?
Here is one way of going about it. It currently goes into a new column but you can have it overwrite by changing 'new time' to just 'time'.
import pandas as pd
data = { 'time': ['15:48','84:52','77:10','10:03'] }
df = pd.DataFrame (data, columns = ['time'])
min = df['time'].str.split(':').str[0].astype(int)
sec = df['time'].str.split(':').str[1].astype(int)
hrs = min//60
min = min%60
df['new time'] = hrs.map(str) + ":" + min.map(str) + ":" + sec.map(str)
print(df)

Make new subsets for each month

I try to make subsets of each month or quarter from my data frame. I already tried some StackOverflow suggestions with the DateTime package but without success. My data frame is a Pandas data frame where the date column consists of a TimeStamp object. If someone has a suggestion that could work that would be lovely.
One of the best options from StackOverflow I already tried is:
date_format = "%Y-%m-%d %H:%M:%S"
df['datetime'] = [datetime.strptime(dt, date_format) for dt in df['date']]
df['quarter'] = [dt.quarter for dt in df['datetime']]
dfQ1 = df[df.quarter == 1]
# for each quarter the same
I made timestamps of my data by the use of the following code:
time_stamps = []
for i in data['event_timestamp']:
time_stamps.append(datetime.datetime.strptime(i, '%Y-%m-%d %H:%M:%S'))
You can find a picture of the head of the dataframe in the following link:
head of the data frame

ArcPy & Python - Get Latest TWO dates, grouped by Value

I've looked around for the last week to an answer but only see partial answers. Being new to python, I really could use some assistance.
I have two fields in a table [number] and [date]. The date format is date and time, so: 07/09/2018 3:30:30 PM. The [number] field is just an integer, but each row may have the same number.
I have tried a few options to gain access to the LATEST date, and I can get these using Pandas:
myarray = arcpy.da.FeatureClassToNumPyArray (fc, ['number', 'date'])
mydf = pd.DataFrame(myarray)
date_index = mydf.groupby(['number'])['date'].transform(max)==mydf['date']
However, I need the latest TWO dates. I've moved on to trying an "IF" statement because I feel arcpy.da.UpdateCursor is better suited to look through the record and update another field by grouping by NUMBER and returning the rows with the latest TWO dates.
End result would like to see the following table grouped by number, latest two dates (as examples):
Number : Date
1 7/29/2018 4:30:44 PM
1 7/30/2018 5:55:34 PM
2 8/2/2018 5:45:23 PM
2 8/3/2018 6:34:32 PM
Try this.
import pandas as pd
import numpy as np
# Some data.
data = pd.DataFrame({'number': np.random.randint(3, size = 15), 'date': pd.date_range('2018-01-01', '2018-01-15')})
# Look at the data.
data
Which gives some sample data like this:
So in our output we'd expect to see number 0 with the 5th and the 9th, 1 with the 14th and 15th, and 2 with the 6th and the 12th.
Then we group by number, grab the last two rows, and set and sort the index.
# Group and label the index.
last_2 = data.groupby('number').tail(2).set_index('number').sort_index()
last_2
Which gives us what we expect.