Get day of year from a string date in pandas dataframe - pandas

I want to turn my date string into day of year... I try this code..
import pandas as pd
import datetime
data = pd.DataFrame()
data = pd.read_csv(xFilename, sep=",")
and get this DataFrame
Index Date Tmin Tmax
0 1950-01-02 -16.508 -2.096
1 1950-01-03 -6.769 0.875
2 1950-01-04 -1.795 8.859
3 1950-01-05 1.995 9.487
4 1950-01-06 -17.738 -9.766
I try this...
convert = lambda x: x.DatetimeIndex.dayofyear
data['Date'].map(convert)
with this error:
AttributeError: 'str' object has no attribute 'DatetimeIndex'
I expect to get new date to match 1950-01-02 = 2, 1950-01-03 = 3...
Thank for your help... and sorry Im new on python

I think need pass parameter parse_dates to read_csv and then call Series.dt.dayofyear:
data = pd.read_csv(xFilename, parse_dates=["Date"])
data['dayofyear'] = data['Date'].dt.dayofyear

Related

Increment a time and add it in data frame column

Hi I am new to python and I am looking for below result.
I have From_Curr(3), To_Curr(3) and making currency pairs and adding new column in my data frame as time.
3*3 = 9 currency pairs created So I want same time for currency pairs and then increment by 1 hr again for same pairs as shown below.
Problem statement is time gets incremented after every row.
Actual df:
Expected df:
Thanks for any help and appreciate your time.
`
import pandas as pd
import datetime
from datetime import timedelta
data = pd.DataFrame({'From':["EUR","GBP",'USD'],
'To':["INR","SGD",'HKD'],
'time':''})
init_date = datetime.datetime(1, 1, 1)
for index, row in data.iterrows():
row['time'] = str(init_date)[11:19]
init_date = init_date + timedelta(hours=1.0)
`
I'm not understanding why you are repeating the combinations, and incrementing in one hour in the last half.
But for this case, you can do something like this:
import pandas as pd
data = pd.DataFrame({'From':["EUR","GBP",'USD'],
'To':["INR","SGD",'HKD'],
'time':''})
outlist = [ (i, j) for i in data["From"] for j in data["To"] ]*2 # Create double combinations
data = pd.DataFrame(data=outlist,columns=["From","To"])
data["time"] = "00:00:00"
data["time"].iloc[int(len(data)/2):len(data)] = "01:00:00" # Assign 1 hour to last half
data["time"] = pd.to_datetime(data["time"]).dt.time
Update: After some clarifications
import pandas as pd
data = pd.DataFrame(
{"From": ["EUR", "GBP", "USD"], "To": ["INR", "SGD", "HKD"], "time": ""}
)
outlist = [
(i, j) for i in data["From"] for j in data["To"]
] * 2 # Create double combinations, i think that for your case it would be 24 instead of two
data = pd.DataFrame(data=outlist, columns=["From", "To"])
data["time"] = data.groupby(["From", "To"]).cumcount() # Get counts of pairs values
data["time"] = data["time"] * pd.to_datetime("01:00:00").value # Multiply occurrences by the value of 1 hour
data["time"] = pd.to_datetime(data["time"]).dt.time # Pass to time
I think this script covers all your needs, happy coding :)
Regards,

Dataframe index with isclose function

I have a dataframe with numerical values between 0 and 1. I am trying to create simple summary statistics (manually). I when using boolean I can get the index but when I try to use math.isclose the function does not work and gives an error.
For example:
import pandas as pd
df1 = pd.DataFrame({'col1':[0,.05,0.74,0.76,1], 'col2': [0,
0.05,0.5, 0.75,1], 'x1': [1,2,3,4,5], 'x2':
[5,6,7,8,9]})
result75 = df1.index[round(df1['col2'],2) == 0.75].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This will give the correct result but occasionally the value result is NAN so I tried:
result75 = df1.index[math.isclose(round(df1['col2'],2), 0.75, abs_tol = 0.011)].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This results in the following error message:
TypeError: cannot convert the series to <class 'float'>
Both are type "bool" so not sure what is going wrong here...
This works:
rows_meeting_condition = df1[(df1['col2'] > 0.74) & (df1['col2'] < 0.76)]
print(rows_meeting_condition['x2'].mean())

Pandas - get_loc nearest for whole column

I have a df with date and price.
Given a datetime, I would like to find the price at the nearest date.
This works for one input datetime:
import requests, xlrd, openpyxl, datetime
import pandas as pd
file = "E:/prices.csv" #two columns: Timestamp (UNIX epoch), Price (int)
df = pd.read_csv(file, index_col=None, names=["Timestamp", "Price"])
df['Timestamp'] = pd.to_datetime(df['Timestamp'],unit='s')
df = df.drop_duplicates(subset=['Timestamp'], keep='last')
df = df.set_index('Timestamp')
file = "E:/input.csv" #two columns: ID (string), Date (dd-mm-yyy hh:ss:mm)
dfinput = pd.read_csv(file, index_col=None, names=["ID", "Date"])
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
exampledate = pd.to_datetime("20-3-2020 21:37", dayfirst=True)
exampleprice = df.iloc[df.index.get_loc(exampledate, method='nearest')]["Price"]
print(exampleprice) #price as output
I have another dataframe with the datetimes ("dfinput") I want to lookup prices of and save in a new column "Price".
Something like this which is obviously not working:
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
dfinput['Price'] = df.iloc[df.index.get_loc(dfinput['Date'], method='nearest')]["Price"]
dfinput.to_csv('output.csv', index=False, columns=["Hash", "Date", "Price"])
Can I do this for a whole column or do I need to iterate over all rows?
I think you need merge_asof (cannot test, because no sample data):
df = df.sort_index('Timestamp')
dfinput = dfinput.sort_values('Date')
df = pd.merge_asof(df, dfinput, left_index=True, right_on='Date', direction='nearest')

TypeError: ufunc add cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')

I am trying to set an ARIMA model to some data, for this, I used 'autocorrelation_plot()' with my time series. It's generates however the error in the title.
I have an attribute table composed, among others, of a Date and time fiels.
I extracted them (after transforming the attribute table into a numpy table), put them in a 'datetime' variable and appended them all in a list:
O,A = [],[]
dt = datetime.strptime(dt1, "%Y/%m/%d %H:%M")
A.append(dt)
I tried then to create time series and printed them to be sure of the results:
data2 = pd.Series(A, O)
print data2
The results were satisfying, until I decided to auto-correlate :
Auto-correlation command :
autocorrelation_plot(data2)
After this command, it returns:
TypeError: ufunc add cannot use operands with types dtype('M8[ns]') and dtype('M8[ns]')
I guess it's due to the conversion of the datetime.strptime to a numpy ?
I tried to follow some suggestions from previous questions
index.to_pydatetime() , dtype, M8[ns] error ..., in vain.
Minimal reproducible example:
from pandas import datetime
from pandas import DataFrame
import pandas as pd
from matplotlib import pyplot as plt
from pandas.tools.plotting import autocorrelation_plot
arr = arcpy.da.TableToNumPyArray(inTable ,("PROVINCE","ZONE_CODE","MEAN", "Datetime","Time"))
arr_length = len(arr)
j = 1
O,A = [],[]
while j<=55: #I have 55 provinces
i = 0
while i<arr_length:
if arr[i][1]== j:
O.append(arr[i][2])
c = str(arr[i][3])
d = str(c[0:4]+"/"+c[5:7]+"/"+c[8:10])
t = str(arr[i][4])
if t=="10":
dt1 = str(d+" 10:00")
else:
dt1 = str(d+" 14:00")
dt = datetime.strptime(dt1, "%Y/%m/%d %H:%M")
A.append(dt)
i = i+1
data2 = pd.Series(A, O)
print data2
autocorrelation_plot(data2)
del A[:]
del O[:]
j += 1
Screenshot of the results:
results
I used this to solve my issue:
import matplotlib.dates as mpl_dates
df.reset_index(inplace=True)
df['Date']=df['Date'].apply(mpl_dates.date2num)
df = df.astype(float)
I found a solution, it can look barbaric, but it works!
I've just "recreated" pd.Series() with the pd.Series I had:
data2 = pd.Series(O, A)
autocorrelation_plot(pd.Series(data2))
plt.show()

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final