How to compare element of two Pandas Data Frame - pandas

I have two Pandas Dataframe of unequal length. I want to find the common indices and then compare the values of the element of a particular column between two Dataframes

print(forecast.shape[0],df.shape[0])
#468 448
# Make forecast_truncated dataframe same as df dataframe
forecast_truncated_index =forecast.index.intersection(df.index)
forecast_truncated = forecast.loc[forecast_truncated_index]
print(forecast_truncated.shape[0],df.shape[0])
# Output 448 448
# Do element wise comparision
indices =m.history[m.history['y'] > forecast_truncated['yhat_upper']].index
for index in indices:
print("Values greater than threshold",m.history['y'][index],"--",m.history['ds'][index])
Note - I am new to Pandas, not sure if there is a more efficient way

Related

Split a large pandas df into n equal parts and store them in a list?

I have a very large df which I am trying to transpose (rows->columns) but I get the following error: 'Unstacked DataFrame is too big, causing int32 overflow'
Instead I am trying to split my df into n number of equal sized parts so I can transpose them and then concatenate them.
My df is a multi-index df with two levels so when I split the df I would like it to split through the first index (level = 0) index. I have tried np.array_split() but I am unsure whether it splits by row length or index length.

How do I append a column from a numpy array to a pd dataframe?

I have a numpy array of 100 predicted values called first_100. If I convert these to a dataframe they are indexed as 0,1,2 etc. However, the predictions are for values that are in random indexed order, 66,201,32 etc. I want to be able to put the actual values and the predictions in the same dataframe, but I'm really struggling.
The real values are in a dataframe called first_100_train.
I've tried the following:
pd.concat([first_100, first_100_train], axis=1)
This doesn't work and for some reason returns the entire dataframe and indexed from 0 so there are lots of NaNs...
first_100_train['Prediction'] = first_100[0]
This is almost what I want, but again because the indexes are different the data doesn't match up. I'd really appreciate any suggestions.
EDIT: After managing to join the dataframes I now have this:
I'd like to be able to drop the final column...
Here is first_100.head()
and first_100_train.head()
The problem is that index 2 from first_100 actually corresponds to index 480 of first_100_train
Set default index values by DataFrame.reset_index and drop=True for correct alignment:
pd.concat([first_100.reset_index(drop=True),
first_100_train.reset_index(drop=True)], axis=1)
Or if first DataFrame have default RangeIndex solution is simplify:
pd.concat([first_100,
first_100_train.reset_index(drop=True)], axis=1)

What is non-concatenation axis in Pandas?

I am appending a dataframe df1 to a dataframe df2 with both having the same columns but not necessarily in the same order.
df = df1.append(df2)
I am seeing this warning as a result of the above operation.
"FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default."
I understand that the resulting dataframe has columns in alphabetical order after this operation but I am trying to underrstand the definition of non-concatenation axis which is mentioned in the warning. What is it and where is it significant in pandas?
It is the other axis—the axis along which you do not concatenate. If you're concatenating along the index (axis=0), then the non-concatenation axis would be 1 (i.e., the columns), and vice versa.

Resampling/interpolating/extrapolating columns of a pandas dataframe

I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)

Does a DataFrame with a single row have all the attributes of a DataFrame?

I am slicing a DataFrame from a large DataFrame and daughter df have only one row. Does a daughter df with a single row has same attributes like parent df?
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,2),index=dates,columns=['col1','col2'])
df1=df.iloc[1]
type(df1)
>> pandas.core.series.Series
df1.columns
>>'Series' object has no attribute 'columns'
Is there a way I can use all attributes of pd.DataFrame on a pd.series ?
Possibly what you are looking for is a dataframe with one row:
>>> pd.DataFrame(df1).T # T -> transpose
col1 col2
2013-01-02 -0.428913 1.265936
What happens when you do df.iloc[1] is that pandas converts that to a series, which is one-dimensional, and the columns become the index. You can still do df1['col1'], but you can't do df.columns because a series is basically a column, and hence the old columns are now the new index
As a result, you can returns the former columns like this:
>>> df1.index.tolist()
['col1', 'col2']
This used to confuse me quite a bit. I also expected df.iloc[1] to be a dataframe with one row, but it has always been the default behavior of pandas to automatically convert any one dimensional dataframe slice (whether row or column) to a series. It's pretty natural for a row, but less so for a column (since the columns become the index), but really is not a problem once you understand what is happening.