Date is not working even when date column is set to index - pandas

I have a multiple dataframe dictionary where the index is set to 'Date' but am having a trouble to capture the specific day of a search.
Dictionary created as per link:
Call a report from a dictionary of dataframes
Then I tried to add the following column to create specific days for each row:
df_dict[k]['Day'] = pd.DatetimeIndex(df['Date']).day
It´s not working. The idea is to separate the day of the month only (from 1 to 31) for each row. When I call the report, it will give me the day of month of that occurrence.
More details if needed.
Regards and thanks!

In the case of your code, there is no 'Date' column, because it's set as the index.
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
To extract the day from the index use the following code.
df_dict[k]['Day'] = df.index.day
Pulling the code from this question
# here you can see the Date column is set as the index
df_dict = {f.stem: pd.read_csv(f, parse_dates=['Date'], index_col='Date') for f in files}
data_dict = dict() # create an empty dict here
for k, df in df_dict.items():
df_dict[k]['Return %'] = df.iloc[:, 0].pct_change(-1)*100
# create a day column; this may not be needed
df_dict[k]['Day'] = df.index.day
# aggregate the max and min of Return
mm = df_dict[k]['Return %'].agg(['max', 'min'])
# get the min and max day of the month
date_max = df.Day[df['Return %'] == mm.max()].values[0]
date_min = df.Day[df['Return %'] == mm.min()].values[0]
# add it to the dict, with ticker as the key
data_dict[k] = {'max': mm.max(), 'min': mm.min(), 'max_day': date_max, 'min_day': date_min}
# print(data_dict)
[out]:
{'aapl': {'max': 8.702843218147871,
'max_day': 2,
'min': -4.900700398891522,
'min_day': 20},
'msft': {'max': 6.603769278967109,
'max_day': 2,
'min': -4.084428935702855,
'min_day': 8}}

Related

sort dataframe by string and set a new id

is there a possibility to adjust the strings according to the order for example 1.wav, 2.wav 3.wav etc. and the ID accordingly with ID: 1, 2, 3 etc?
i have already tried several sorting options do any of you have any ideas?
Thank you in advance
dataframe output
def createSampleDF(audioPath):
data = []
for file in Path(audioPath).glob('*.wav'):
print(file)
data.append([os.path.basename(file), file])
df_dataSet = pd.DataFrame(data, columns= ['audio_name', 'filePath'])
df_dataSet['ID'] = df_dataSet.index+1
df_dataSet = df_dataSet[['ID','audio_name','filePath']]
df_dataSet.sort_values(by=['audio_name'],inplace=True)
return df_dataSet
def createSamples(myAudioPath,savePath, sampleLength, overlap = 0):
cutSamples(myAudioPath=myAudioPath,savePath=savePath,sampleLength=sampleLength)
df_dataSet=createSampleDF(audioPath=savePath)
return df_dataSet
You can split the string, make it an integer, and then sort on multiple columns. See the pandas.Dataframe.sort_values for more info. If your links are more complicated you may need to design a regex to pull out the integers you want to sort on using pandas.Series.str.extract.
df = pd.DataFrame({
'ID':[1,2,3,4, 5],
'audio_name' : ['1.wav','10.wav','96.wav','3.wav','55.wav']})
(df
.assign(audio_name=lambda df_ : df_.audio_name.str.split('.', expand=True).iloc[:,0].astype('int'))
.sort_values(by=['audio_name','ID']))

Recursively update the dataframe

I have a dataframe called datafe from which I want to combine the hyphenated words.
for example input dataframe looks like this:
,author_ex
0,Marios
1,Christodoulou
2,Intro-
3,duction
4,Simone
5,Speziale
6,Exper-
7,iment
And the output dataframe should be like:
,author_ex
0,Marios
1,Christodoulou
2,Introduction
3,Simone
4,Speziale
5,Experiment
I have written a sample code to achieve this but I am not able to get out of the recursion safely.
def rm_actual(datafe, index):
stem1 = datafe.iloc[index]['author_ex']
stem2 = datafe.iloc[index + 1]['author_ex']
fixed_token = stem1[:-1] + stem2
datafe.drop(index=index + 1, inplace=True, axis=0)
newdf=datafe.reset_index(drop=True)
newdf.iloc[index]['author_ex'] = fixed_token
return newdf
def remove_hyphens(datafe):
for index, row in datafe.iterrows():
flag = False
token=row['author_ex']
if token[-1:] == '-':
datafe=rm_actual(datafe, index)
flag=True
break
if flag==True:
datafe=remove_hyphens(datafe)
if flag==False:
return datafe
datafe=remove_hyphens(datafe)
print(datafe)
Is there any possibilities I can get out of this recursion with expected output?
Another option:
Given/Input:
author_ex
0 Marios
1 Christodoulou
2 Intro-
3 duction
4 Simone
5 Speziale
6 Exper-
7 iment
Code:
import pandas as pd
# read/open file or create dataframe
df = pd.DataFrame({'author_ex':['Marios', 'Christodoulou', 'Intro-', \
'duction', 'Simone', 'Speziale', 'Exper-', 'iment']})
# check input format
print(df)
# create new column 'Ending' for True/False if column 'author_ex' ends with '-'
df['Ending'] = df['author_ex'].shift(1).str.contains('-$', na=False, regex=True)
# remove the trailing '-' from the 'author_ex' column
df['author_ex'] = df['author_ex'].str.replace('-$', '', regex=True)
# create new column with values of 'author_ex' and shifted 'author_ex' concatenated together
df['author_ex_combined'] = df['author_ex'] + df.shift(-1)['author_ex']
# create a series true/false but shifted up
index = (df['Ending'] == True).shift(-1)
# set the last row to 'False' after it was shifted
index.iloc[-1] = False
# replace 'author_ex' with 'author_ex_combined' based on true/false of index series
df.loc[index,'author_ex'] = df['author_ex_combined']
# remove rows that have the 2nd part of the 'author_ex' string and are no longer required
df = df[~df.Ending]
# remove the extra columns
df.drop(['Ending', 'author_ex_combined'], axis = 1, inplace=True)
# output final dataframe
print('\n\n')
print(df)
# notice index 3 and 6 are missing
Outputs:
author_ex
0 Marios
1 Christodoulou
2 Introduction
4 Simone
5 Speziale
6 Experiment

Dataframe column filter from a list of tuples

I'm trying to create a function to filter a dataframe from a list of tuples. I've created the below function but it doesn't seem to be working.
The list of tuples would be have dataframe column name, and a min value and a max value to filter.
eg:
eg_tuple = [('colname1', 10, 20), ('colname2', 30, 40), ('colname3', 50, 60)]
My attempted function is below:
def col_cut(df, cutoffs):
for c in cutoffs:
df_filter = df[ (df[c[0]] >= c[1]) & (df[c[0]] <= c[2])]
return df_filter
Note that the function should not filter on rows where the value is equal to max or min. Appreciate the help.
The problem is that you each time take df as the source to filter. You should filter with:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
dfcol = df_filter[col]
df_filter = df_filter[(dfcol >= mn) & (dfcol <= mx)]
return df_filter
Note that you can use .between(..) [pandas-doc] here:
def col_cut(df, cutoffs):
df_filter = df
for col, mn, mx in cutoffs:
df_filter = df_filter[df_filter[col].between(mn, mx)]
return df_filter
Use np.logical_and + reduce of all masks created by list comprehension with Series.between:
def col_cut(df, cutoffs):
mask = np.logical_and.reduce([df[col].between(min1,max1) for col,min1,max1 in cutoffs])
return df[mask]

SparkR. SQL. Count records satisfying criteria within rolling time window using timestamps

I have a dataset with a structure similar to the df you get from this:
dates<- base::seq.POSIXt(from=as.POSIXlt(as.Date("2018-01-01"),
format="%Y-%m-%d"), to=as.POSIXlt(as.Date("2018-01-03"), format="%Y-%m-%d"), by = "hour")
possible_statuses<- c('moving', 'stopped')
statuses4demo<- base::sample(possible_statuses, size=98, replace = TRUE, prob = c(.75, .25))
hours_back<- 5
hours_back_milliseconds<- hours_back*3600 * 1000
# Generate dataframe
df<- data.frame(date=rep(dates,2), user_id=c(rep("user_1", 49), rep("user_2", 49)), status=statuses4demo)
df$row_id<- seq(from=1,to=nrow(df), by=1)
df$eventTimestamp<- as.numeric(format(df$date, "%s"))*1000
df$hours_back_timestamp<- df$eventTimestamp - hours_back_milliseconds
df$num_stops_within_past_5_hours<- 0
I would like to get a dataframe with rolling counts for the number of observations with a status of "stopped" for each row. To do this in R, I just made a couple nested loops, i.e., ran this:
for(i in 1:length(unique(df$user_id))){
the_user<- unique(df$user_id)[i]
filtered_data<- df[which(df$user_id == the_user),]
for(j in 1:nrow(filtered_data)){
the_row_id<- filtered_data$row_id[j]
the_time<- filtered_data$eventTimestamp[j]
the_past_time<- filtered_data$hours_back_timestamp[j]
num_stops_in_past_interval<- base::nrow(filtered_data[filtered_data$eventTimestamp >= the_past_time & filtered_data$eventTimestamp < the_time & filtered_data$status == "stopped",])
df$num_stops_within_past_5_hours[which(df$row_id==the_row_id)]<- num_stops_in_past_interval
}
}
View(df)
I am trying to do the same thing, but either by using the built in functions in SparkR or (I think more likely) an SQL statement. I am wondering if anyone knows how I could reproduce the output from the df, but inside a Spark context? Any help is much appreciated. Thank you in advance. --Nate
Start with this data:
sdf<- SparkR::createDataFrame(df[, c("date", "eventTimestamp", "status", "user_id", "row_id")])
This solution works for the sample data as you have it set up, but isn't a more general solution for observations with any arbitrary timestamp.
ddf <- as.DataFrame(df)
ddf$count <- ifelse(ddf$status == "stopped", 1, 0)
# Create a windowSpec partitioning by user_id and ordered by date
ws <- orderBy(windowPartitionBy("user_id"), "date")
# Get the cumulative sum of the count variable by user id
ddf$count <- over(sum(ddf$count), ws)
# Get the lagged value of the cumulative sum from 5hrs ago
ddf$lag_count <- over(lag(ddf$count, offset = 5, default = 0), ws)
# The count of stops in the last 5hrs is the difference between the two
ddf$num_stops_within_past_5_hours <- ddf$count - ddf$lag_count
Edited to add a more general solution that can handle inconsistent time breaks
# Using a sampled version of the original df to create inconsistent
time breaks
ddf <- as.DataFrame(df[base::sample(nrow(df), nrow(df) - 20), ])
ddf$count <- ifelse(ddf$status == "stopped", 1, 0)
to_join <- ddf %>% select("count", "eventTimestamp", "user_id") %>% rename(eventTimestamp_ = .$eventTimestamp, user_id_ = .$user_id)
ddf$count <- NULL
# join in each row where the event timestamp is within the interval
ddf_new <- join(ddf, to_join, ddf$hours_back_timestamp <= to_join$eventTimestamp_ & ddf$eventTimestamp >= to_join$eventTimestamp_ & ddf$user_id == to_join$user_id_, joinType = "left")
ddf_new <- ddf_new %>% groupBy(
'date',
'eventTimestamp',
'user_id',
'status',
'row_id',
'hours_back_timestamp') %>%
agg(num_stops_within_past_5_hours = sum(ddf_new$count))

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final