Pandas DataFrame expand existing dataset to finer timestamp - pandas

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final

Related

Pandas - get_loc nearest for whole column

I have a df with date and price.
Given a datetime, I would like to find the price at the nearest date.
This works for one input datetime:
import requests, xlrd, openpyxl, datetime
import pandas as pd
file = "E:/prices.csv" #two columns: Timestamp (UNIX epoch), Price (int)
df = pd.read_csv(file, index_col=None, names=["Timestamp", "Price"])
df['Timestamp'] = pd.to_datetime(df['Timestamp'],unit='s')
df = df.drop_duplicates(subset=['Timestamp'], keep='last')
df = df.set_index('Timestamp')
file = "E:/input.csv" #two columns: ID (string), Date (dd-mm-yyy hh:ss:mm)
dfinput = pd.read_csv(file, index_col=None, names=["ID", "Date"])
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
exampledate = pd.to_datetime("20-3-2020 21:37", dayfirst=True)
exampleprice = df.iloc[df.index.get_loc(exampledate, method='nearest')]["Price"]
print(exampleprice) #price as output
I have another dataframe with the datetimes ("dfinput") I want to lookup prices of and save in a new column "Price".
Something like this which is obviously not working:
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
dfinput['Price'] = df.iloc[df.index.get_loc(dfinput['Date'], method='nearest')]["Price"]
dfinput.to_csv('output.csv', index=False, columns=["Hash", "Date", "Price"])
Can I do this for a whole column or do I need to iterate over all rows?
I think you need merge_asof (cannot test, because no sample data):
df = df.sort_index('Timestamp')
dfinput = dfinput.sort_values('Date')
df = pd.merge_asof(df, dfinput, left_index=True, right_on='Date', direction='nearest')

df.groupby('columns').apply(''.join()), join all the cells to a string

df.groupby('columns').apply(''.join()), join all the cells to a string.
This is for a junior dataprocessor. In the past, I've tried many ways.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index(‘key’)
#df2 is expected result
data2 = {'a':['12j5g9d'],'b':['3d6d'],'c':['4d7t']}
df2 = pd.DataFrame(data2)
df2 = df2.set_index(‘key’)
Here's a simple solution, where we first translate the integers to strings and then concatenate profit and income, then finally we concatenate all strings under the same key:
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df['profit_income'] = df['profit'].apply(str) + df['income']
res = df.groupby('key')['profit_income'].agg(''.join)
print(res)
output:
key
a 12j5g9d
b 3d6d
c 4d7t
Name: profit_income, dtype: object
This question can be solved couple different ways:
First add an extra column by concatenating the profit and income columns.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index('key')
df['profinc']=df['profit'].astype(str)+df['income']
1) Using sum
df2=df.groupby('key').profinc.sum()
2) Using apply and join
df2=df.groupby('key').profinc.apply(''.join)
Results from both of the above would be the same:
key
a 12j5g9d
b 3d6d
c 4d7t

Merging DataFrames Causing Int Value Error

I apologize before hand as I'm new to programming, Python, and Pandas. I am trying to merge five dataframes in steps based on datetime indexs and using an outer join. The data is water level gauging station data that I've attached in the following link: https://drive.google.com/open?id=15t7wkU0Sl17WgIS6CAgJpkWONdppdhvH . The data comes from a variety of Canadian and American water level gauging station systems and is all fairly similar except that the American data in the link has to be reduced to a datum by simply subtracting the elevation of the station. I've added the American stations into their own module to handle the processing. The merging works fine except when merging the processed American data which throws the following error:
ValueError: can not merge DataFrame with instance of type <type 'int'>
Could someone please give me some insight into this error?
I've searched the dataframes for ints and tried converting the dataframe to a string or dataframe but still have had no luck merging them.
def readNOAA():
noaaDFs = {}
filenames = glob.glob(r"C:\Users\Andrew\Documents\CHS\SeasonalGaugeAnalysis\Input\CO-OPS_*_wl.csv")
dataframes = [pd.read_csv(filename) for filename in filenames]
for df, filename in zip(dataframes, filenames):
df['filename'] = os.path.basename(filename)
df['Station ID'] = df['filename'].str[7:14]
del df['filename']
combined_df = pd.concat(dataframes, ignore_index=True)
df = pd.DataFrame.from_records(combined_df)
df['Date Time (UTC)'] = pd.to_datetime(df['Date'] + ' ' + df['Time (GMT)'])
#df['Date Time (UTC)'] = df['Date Time (UTC)'].tz_localize('utc')
df.set_index('Date Time (UTC)', inplace=True)
df.index = df.index.to_datetime().tz_localize('utc')
df['WL'] = df['Preliminary (m)'].where(df['Verified (m)'].isnull(), df['Preliminary (m)'])
df['WL'] = df['WL'].replace({'-': np.nan})
df.drop(['Date', 'Time (GMT)', 'Predicted (m)', 'Preliminary (m)', 'Verified (m)'], axis=1, inplace=True)
Station3 = df[df['Station ID'].str.match('9052030')] #Oswego, NY
#Station3['Temp Station ID'] = '13771'
Station3['WL'] = Station3['WL'].astype(float) - 74.2
Station3.rename(columns={'WL':'9052030'}, inplace=True)
del Station3['Station ID']
noaaDFs["9052030"]=9052030
hydrometDF = readHydromet()
pcwlDF = readPCWL()
noaaDF = readNOAA()
# Create 3 minute time series to merge station data to so data gaps can be identified
idx = pd.date_range('06-21-2019 20:00:00', periods=25000, freq='3 min')
ts = pd.Series(range(len(idx)), index=idx)
cts = ts.to_frame()
cts.index = cts.index.to_datetime().tz_localize('utc')
cts.drop(0, axis=1, inplace=True)
final13771 = pd.merge(hydrometDF["Station11985"], pcwlDF["station13988"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, pcwlDF["station13590"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, pcwlDF["station14400"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, noaaDF["9052030"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, cts, how='outer', left_index=True,right_index=True)
final13771.to_excel("13771.xlsx")

Dataframe Split query

I have a dataframe column which contains all the text below in each row
Symbol(id=15351, ticker=VXX, market=US, currency=USD, type=EQUITY,tick_size=0.010000, lot_size=100, contract_size=0, rate=None)
I am trying to extract only after ticker=, which gives VXX
I tried
df['symbolcolumn'] = df['symbolcolumn'].str.split(',market', expand=True)
But it does not extract only the symbol ticker
Looking for df['symbolcolumn'] = VXX
Can you advise me please?
Ok I managed to do it by
df['symbol'] = df['symbol'].astype(str)
df['symbol'] = df['symbol'].str.split(', market', expand=True)
df['symbol'] = df['symbol'].apply(lambda x: x.split("=")[-1])

Assigning values to dataframe columns

In the below code, the dataframe df5 is not getting populated. I am just assigning the values to dataframe's columns and I have specified the column beforehand. When I print the dataframe, it returns an empty dataframe. Not sure whether I am missing something.
Any help would be appreciated.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLong'] = closestPoints[1][1]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
From the look of your code, you're trying to populate df5 with a list of latitudes and longitudes. However, you're making a couple mistakes.
The columns of pandas dataframes are Series, and hold some type of sequential data. So df5['ClosestLat'] = closestPoints[1][0] attempts to assign the entire column a single numerical value, and results in an empty column.
Even if the dataframe wasn't ignoring your attempts to assign a real number to the column, you would lose data because you are overwriting the column with each loop.
The Solution: Build a list of lats and longs, then insert into the dataframe.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
lats, lngs = [], []
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
lats.append(closestPoints[1][0])
lngs.append(closestPoints[1][1])
df['ClosestLat'] = pd.Series(lats)
df['ClosestLong'] = pd.Series(lngs)