Merging DataFrames Causing Int Value Error - pandas

I apologize before hand as I'm new to programming, Python, and Pandas. I am trying to merge five dataframes in steps based on datetime indexs and using an outer join. The data is water level gauging station data that I've attached in the following link: https://drive.google.com/open?id=15t7wkU0Sl17WgIS6CAgJpkWONdppdhvH . The data comes from a variety of Canadian and American water level gauging station systems and is all fairly similar except that the American data in the link has to be reduced to a datum by simply subtracting the elevation of the station. I've added the American stations into their own module to handle the processing. The merging works fine except when merging the processed American data which throws the following error:
ValueError: can not merge DataFrame with instance of type <type 'int'>
Could someone please give me some insight into this error?
I've searched the dataframes for ints and tried converting the dataframe to a string or dataframe but still have had no luck merging them.
def readNOAA():
noaaDFs = {}
filenames = glob.glob(r"C:\Users\Andrew\Documents\CHS\SeasonalGaugeAnalysis\Input\CO-OPS_*_wl.csv")
dataframes = [pd.read_csv(filename) for filename in filenames]
for df, filename in zip(dataframes, filenames):
df['filename'] = os.path.basename(filename)
df['Station ID'] = df['filename'].str[7:14]
del df['filename']
combined_df = pd.concat(dataframes, ignore_index=True)
df = pd.DataFrame.from_records(combined_df)
df['Date Time (UTC)'] = pd.to_datetime(df['Date'] + ' ' + df['Time (GMT)'])
#df['Date Time (UTC)'] = df['Date Time (UTC)'].tz_localize('utc')
df.set_index('Date Time (UTC)', inplace=True)
df.index = df.index.to_datetime().tz_localize('utc')
df['WL'] = df['Preliminary (m)'].where(df['Verified (m)'].isnull(), df['Preliminary (m)'])
df['WL'] = df['WL'].replace({'-': np.nan})
df.drop(['Date', 'Time (GMT)', 'Predicted (m)', 'Preliminary (m)', 'Verified (m)'], axis=1, inplace=True)
Station3 = df[df['Station ID'].str.match('9052030')] #Oswego, NY
#Station3['Temp Station ID'] = '13771'
Station3['WL'] = Station3['WL'].astype(float) - 74.2
Station3.rename(columns={'WL':'9052030'}, inplace=True)
del Station3['Station ID']
noaaDFs["9052030"]=9052030
hydrometDF = readHydromet()
pcwlDF = readPCWL()
noaaDF = readNOAA()
# Create 3 minute time series to merge station data to so data gaps can be identified
idx = pd.date_range('06-21-2019 20:00:00', periods=25000, freq='3 min')
ts = pd.Series(range(len(idx)), index=idx)
cts = ts.to_frame()
cts.index = cts.index.to_datetime().tz_localize('utc')
cts.drop(0, axis=1, inplace=True)
final13771 = pd.merge(hydrometDF["Station11985"], pcwlDF["station13988"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, pcwlDF["station13590"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, pcwlDF["station14400"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, noaaDF["9052030"], how='outer', left_index=True,right_index=True)
final13771 = pd.merge(final13771, cts, how='outer', left_index=True,right_index=True)
final13771.to_excel("13771.xlsx")

Related

Pandas - get_loc nearest for whole column

I have a df with date and price.
Given a datetime, I would like to find the price at the nearest date.
This works for one input datetime:
import requests, xlrd, openpyxl, datetime
import pandas as pd
file = "E:/prices.csv" #two columns: Timestamp (UNIX epoch), Price (int)
df = pd.read_csv(file, index_col=None, names=["Timestamp", "Price"])
df['Timestamp'] = pd.to_datetime(df['Timestamp'],unit='s')
df = df.drop_duplicates(subset=['Timestamp'], keep='last')
df = df.set_index('Timestamp')
file = "E:/input.csv" #two columns: ID (string), Date (dd-mm-yyy hh:ss:mm)
dfinput = pd.read_csv(file, index_col=None, names=["ID", "Date"])
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
exampledate = pd.to_datetime("20-3-2020 21:37", dayfirst=True)
exampleprice = df.iloc[df.index.get_loc(exampledate, method='nearest')]["Price"]
print(exampleprice) #price as output
I have another dataframe with the datetimes ("dfinput") I want to lookup prices of and save in a new column "Price".
Something like this which is obviously not working:
dfinput['Date'] = pd.to_datetime(dfinput['Date'], dayfirst=True)
dfinput['Price'] = df.iloc[df.index.get_loc(dfinput['Date'], method='nearest')]["Price"]
dfinput.to_csv('output.csv', index=False, columns=["Hash", "Date", "Price"])
Can I do this for a whole column or do I need to iterate over all rows?
I think you need merge_asof (cannot test, because no sample data):
df = df.sort_index('Timestamp')
dfinput = dfinput.sort_values('Date')
df = pd.merge_asof(df, dfinput, left_index=True, right_on='Date', direction='nearest')

df.groupby('columns').apply(''.join()), join all the cells to a string

df.groupby('columns').apply(''.join()), join all the cells to a string.
This is for a junior dataprocessor. In the past, I've tried many ways.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index(‘key’)
#df2 is expected result
data2 = {'a':['12j5g9d'],'b':['3d6d'],'c':['4d7t']}
df2 = pd.DataFrame(data2)
df2 = df2.set_index(‘key’)
Here's a simple solution, where we first translate the integers to strings and then concatenate profit and income, then finally we concatenate all strings under the same key:
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df['profit_income'] = df['profit'].apply(str) + df['income']
res = df.groupby('key')['profit_income'].agg(''.join)
print(res)
output:
key
a 12j5g9d
b 3d6d
c 4d7t
Name: profit_income, dtype: object
This question can be solved couple different ways:
First add an extra column by concatenating the profit and income columns.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index('key')
df['profinc']=df['profit'].astype(str)+df['income']
1) Using sum
df2=df.groupby('key').profinc.sum()
2) Using apply and join
df2=df.groupby('key').profinc.apply(''.join)
Results from both of the above would be the same:
key
a 12j5g9d
b 3d6d
c 4d7t

How to apply a rolling Kalman Filter to a column in a DataFrame?

How to apply a rolling Kalman Filter to a DataFrame column (without using external data)?
That is, pretending that each row is a new point in time and therefore requires for the descriptive statistics to be updated (in a rolling manner) after each row.
For example, how to apply the Kalman Filter to any column in the below DataFrame?
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
I've seen previous responses (1 and 2) however they are not applying it to a DataFrame column (and they are not vectorized).
How to apply a rolling Kalman Filter to a column in a DataFrame?
Exploiting some good features of numpy and using pykalman library, and applying the Kalman Filter on column D for a rolling window of 3, we can write:
import pandas as pd
from pykalman import KalmanFilter
import numpy as np
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def get_kf_value(y_values):
kf = KalmanFilter()
Kc, Ke = kf.em(y_values, n_iter=1).smooth(0)
return Kc
n = 2000
index = pd.date_range(start='2000-01-01', periods=n)
data = np.random.randn(n, 4)
df = pd.DataFrame(data, columns=list('ABCD'), index=index)
wsize = 3
arr = rolling_window(df.D.values, wsize)
zero_padding = np.zeros(shape=(wsize-1,wsize))
arrst = np.concatenate((zero_padding, arr))
arrkalman = np.zeros(shape=(len(arrst),1))
for i in range(len(arrst)):
arrkalman[i] = get_kf_value(arrst[i])
kalmandf = pd.DataFrame(arrkalman, columns=['D_kalman'], index=index)
df = pd.concat([df,kalmandf], axis=1)
df.head() should yield something like this:
A B C D D_kalman
2000-01-01 -0.003156 -1.487031 -1.755621 -0.101233 0.000000
2000-01-02 0.172688 -0.767011 -0.965404 -0.131504 0.000000
2000-01-03 -0.025983 -0.388501 -0.904286 1.062163 0.013633
2000-01-04 -0.846606 -0.576383 -1.066489 -0.041979 0.068792
2000-01-05 -1.505048 0.498062 0.619800 0.012850 0.252550

Pandas DataFrame expand existing dataset to finer timestamp

I am trying to make this piece of code faster, it is failing on conversion of ~120K rows to ~1.7m.
Essentially, I am trying to convert each date stamped entry into 14, representing each DOW from PayPeriodEndingDate to T-14
Does anyone have a better suggestion other than iteruples to do this loop?
Thanks!!
df_Final = pd.DataFrame()
for row in merge4.itertuples():
listX = []
listX.append(row)
df = pd.DataFrame(listX*14)
df = df.reset_index().drop('Index',axis=1)
df['Hours'] = df['Hours']/14
df['AmountPaid'] = df['AmountPaid']/14
df['PayPeriodEnding'] = np.arange(df.loc[:,'PayPeriodEnding'][0] - np.timedelta64(14,'D'), df.loc[:,'PayPeriodEnding'][0], dtype='datetime64[D]')
frames = [df_Final,df]
df_Final = pd.concat(frames,axis=0)
df_Final

Assigning values to dataframe columns

In the below code, the dataframe df5 is not getting populated. I am just assigning the values to dataframe's columns and I have specified the column beforehand. When I print the dataframe, it returns an empty dataframe. Not sure whether I am missing something.
Any help would be appreciated.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLat'] = closestPoints[1][0]
df5['ClosestLong'] = closestPoints[1][1]
print ("Point: " + str(closestPoints[0]) + " is closest to " + str(closestPoints[1]))
From the look of your code, you're trying to populate df5 with a list of latitudes and longitudes. However, you're making a couple mistakes.
The columns of pandas dataframes are Series, and hold some type of sequential data. So df5['ClosestLat'] = closestPoints[1][0] attempts to assign the entire column a single numerical value, and results in an empty column.
Even if the dataframe wasn't ignoring your attempts to assign a real number to the column, you would lose data because you are overwriting the column with each loop.
The Solution: Build a list of lats and longs, then insert into the dataframe.
import math
import pandas as pd
columns = ['ClosestLat','ClosestLong']
df5 = pd.DataFrame(columns=columns)
def distance(pt1, pt2):
return math.sqrt((pt1[0] - pt2[0])**2 + (pt1[1] - pt2[1])**2)
lats, lngs = [], []
for pt1 in df1:
closestPoints = [pt1, df2[0]]
for pt2 in df2:
if distance(pt1, pt2) < distance(closestPoints[0], closestPoints[1]):
closestPoints = [pt1, pt2]
lats.append(closestPoints[1][0])
lngs.append(closestPoints[1][1])
df['ClosestLat'] = pd.Series(lats)
df['ClosestLong'] = pd.Series(lngs)