Pandas Dataframe timeseries - pandas

I want to build a dataframe with datetimestamp (upto minutes) as index and keep adding columns as I get data for each new column. For example, for Col-A, I aggregate by day, hour and minute from another dataset to a value 'k'. I want to insert this value 'k' into a dataframe at the 'right' row-index. The problem I am facing is the current row-identifier is from a groupby object on date,hour, min. Not sure how to 'concatenate' these 3 into a nice timeseries type.
This is what I have currently (output of my groupby object):
currGroupedData = cData.groupby(['DATE', 'HOUR', 'MINUTE'])
numUniqValuesPerDayHrMin = currGroupedData['UID'].nunique()
print numUniqValuesPerDayHrMin
Computing Values for A:
DATE HOUR MINUTE
2015-08-15 6 38 65
Name: UID, dtype: int64
To form a new dataframe to hold many columns (A, B, .., Z), I am doing this:
index = pd.date_range('2015-10-05 10:00:00', '2015-11-10 10:00:00', freq='1min')
df = pd.DataFrame(index=index)
Now, I want to 'somehow' take the value 65 and populate into my dataframe. How do I do this? I must somehow convert the "date, hour, minute" form groupby object to a timeseries obj...???
Also, I will have a series of values for Col-A for many mins of that day. I want to, in one-shot, populate an entire column with those values and fill the rest with '0s'. Then, move on processing/filling the next column.
Can I do this:
str = '2015-10-10 06:10:00'
str
Out[362]: '2015-10-10 06:10:00'
pd.to_datetime(str, format='%Y-%m-%d %H:%M:%S', coerce=True)
Out[363]: Timestamp('2015-10-10 06:10:00')
row_idx = pd.to_datetime(str, format='%Y-%m-%d %H:%M:%S', coerce=True)
type(row_idx)
Out[365]: pandas.tslib.Timestamp
data = pd.DataFrame({'Col-A': 65}, index = pd.Series(row_idx))
df.add(data)
Any thoughts?

you almost got it figured out in your code.
a few changes get the trick done.
initialize the dataframe without data and with the timeindex. (you
can always append more rows later)
initialize the new column with values set to 0.
set the value for the column at the target time.
|
import pandas as pd
index = pd.date_range('2015-10-05 10:00:00', '2015-11-10 10:00:00', freq='1min')
df = pd.DataFrame(index=index)
# initialize the column with all values set to 0.
df['first_column'] = 0
# format the target time into a timestamp
target_time = pd.to_datetime('2015-10-15 6:38')
# set the value for the target time to 65
df['first_column'][ target_time]=65
# output the value at the target time.
df['first_column'][ target_time]

Related

Interpolate values based in date in pandas

I have the following datasets
import pandas as pd
import numpy as np
df = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1")
df2 = pd.read_excel("https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet2")
df2.dropna(inplace = True)
For each group of values on the first df X-Axis Value, Y-Axis Value, where the first one is the date and the second one is a value, I would like to create rows with the same date. For instance, df.iloc[0,0] the timestamp is Timestamp('2020-08-25 23:14:12'). However, in the following columns of the same row maybe there is other dates with different Y-Axis Value associated. The first one in that specific row being X-Axis Value NCVE-064 HPNDE with a timestap 2020-08-25 23:04:12 and a Y-Axis Value associated of value 0.952.
What I want to accomplish is to interpolate those values for a time interval, maybe 10 minutes, and then merge those results to have the same date for each row.
For the df2 is moreless the same, interpolate the values in a time interval and add them to the original dataframe. Is there any way to do this?
The trick is to realize that datetimes can be represented as seconds elapsed with respect to some time.
Without further context part the hardest things is to decide at what times you wants to have the interpolated values.
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.read_excel(
"https://github.com/norhther/datasets/raw/main/ncp1b.xlsx",
sheet_name="Sheet1",
)
x_columns = [col for col in df.columns if col.startswith("X-Axis")]
# What time do we want to align the columsn to?
# You can use anything else here or define equally spaced time points
# or something else.
target_times = df[x_columns].min(axis=1)
def interpolate_column(target_times, x_times, y_values):
ref_time = x_times.min()
# For interpolation we need to represent the values as floats. One options is to
# compute the delta in seconds between a reference time and the "current" time.
deltas = (x_times - ref_time).dt.total_seconds()
# repeat for our target times
target_times_seconds = (target_times - ref_time).dt.total_seconds()
return interp1d(deltas, y_values, bounds_error=False,fill_value="extrapolate" )(target_times_seconds)
output_df = pd.DataFrame()
output_df["Times"] = target_times
output_df["Y-Axis Value NCVE-063 VPNDE"] = interpolate_column(
target_times,
df["X-Axis Value NCVE-063 VPNDE"],
df["Y-Axis Value NCVE-063 VPNDE"],
)
# repeat for the other columns, better in a loop

setting pandas series row values to multiple column values

I am having one dataframe object df. wherein i have got some data from excel sheet. then i have added certain Date columns to this df object. this df also has certain stock ticker from yahoo finance. now i try to get the history of prices for these stocks tickers for 2 months history from yahoo finance (which will be 60 rows) and then trying to assign these price values to column header, with relevant dates, in the df object. however i am not able to do so.
In the last line of code, i am trying to set the values of "Volume", which will be in different rows, to the column values for respective dates in df. but i am not able to do so. need help. Thanks
df = pd.read_excel(r"D:\Volume Trading\python\excel"
r"\Nifty-sector-cap.xlsx")
start_date = date(2022,3,1) # Date YYYY MM DD
end_date = date(2022,4,25)
## downloading below data just to get dates which will be columns of df.
temp_data = yf.download("HDFCBANK.NS", start_date, end_date, interval = '1d', index = False)["Adj Close"]
temp_data.index = temp_data.index.date
# setting the dates as columns header in df
df = df.reindex(columns = df.columns.tolist() + temp_data.index.tolist())
i = 0
# putting the volume for each ticker on each date in df
for i in range(0,len(df)):
temp_vol = yf.download(df["Yahoo_Symbol"].iloc[i], start_date, end_date, interval ="1d")["Volume"]
temp_vol.index = temp_vol.index.date
df[temp_vol.index.tolist()].iloc[i] = temp_vol.("Volume").transpose()

Pyspark: how to fix 'could not parse datatype: interval' error

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.
I already had a date_of_birth column available, so I inserted a current_date column with the following code:
import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))
df= df.withColumn("current_date", lit(today))
Displaying my df confirms that it worked. Looks a little something like this:
id
date_of_birth
current_date
01
1995-01-01
2021-12-2021
02
1987-02-16
2021-12-2021
I inserted the age column by substracting the values of date_of_birth and current_date.
df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))
Cell runs without a problem.
Here's where I'm stuck:
Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:
'could not parse datatype: interval'
I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.
How can I fix this?
Is there a way to display the age in years (int) in this particular scenario?
PS: both the date_of_birth and current_date cols have a date type.
Solved it. Mike's comment helped tons. Thank you!
Here's how I solved it:
# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))
# update data with current_date() function
df = df .withColumn("current_date", f.current_date())
# insert new column age with dummy data (in this case, 1s)
df = df .withColumn("age", lit(1))
# update data with months_between() function, divide by 12 to obtain years.
df = df .withColumn("age", f.months_between(df.current_date, df .date_of_birth)/12)
# round and cast as interger to get rid of decimals
df = df .withColumn("age", f.round(df["age"]).cast('integer'))
Would use one of the pyspark functions for calculating difference between dates.
pyspark.sql.functions.datediff
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.datediff.html
pyspark.sql.functions.months_between
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.months_between.html

Filter a dataframe by different timestamp formats

I have a dataframe with a stime stamp feature that has two different formats, for example
created name
2020-04-30T20:06:00.000Z Back
--T::00.000Z Summary
2020-04-30T20:05:00.000Z Recalculate
2020-04-30T20:05:00.000Z Recalculate
--T::00.000Z Recalculate
I would like to filter this dataframe in order to get only good formatted timestamps 'yyyy-mm-ddTHH:MM:SS.000Z', i.e. to get the dataframe
created name
2020-04-30T20:06:00.000Z Back
2020-04-30T20:05:00.000Z Recalculate
2020-04-30T20:05:00.000Z Recalculate
How to filter by time stamp format?
Use, pd.to_datetime with optional parameter errors=coerce to convert created column to a series of pandas datetime then use Series.notna to create a boolean mask and use this mask to filter the dataframe:
m = pd.to_datetime(df['created'], errors='coerce').notna()
df = df[m]
# print(df)
created name
0 2020-04-30T20:06:00.000Z Back
2 2020-04-30T20:05:00.000Z Recalculate
3 2020-04-30T20:05:00.000Z Recalculate

Extract value from Series

I want to add
I get all values from column:
from collections import Counter
coun_ = set(train_df['time1'].dt.hour)
Then I add new columns to data frame and fill there default values:
for i in coun_:
train_df['hour'+str(i)] = 0
Now I want to get hour from time1 and set 1 to right column. Forexample, if time1 equals 10 then I put 1 to hour10. I do several ways without success, one of them.
for hour in [train_df]:
hour['hour' + hour['time1'].dt.hour.to_string()] = 1
The question is how I can extract only value from Series and concat it?
Use get_dummies with DataFrame.add_prefix adn append to original by DataFrame.join:
df = df.join(pd.get_dummies(train_df['time1'].dt.hour).add_prefix('hour'))