Resolving error when merging dataframes on two columns - pandas

I am trying to merge two dataframes (D1 & R1) on two columns (Date & Symbol) but I'm receiving this error "You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat".
I've been using pd.merge and I've tried different dtypes. I don't want to concatenate these because I just want to add D1 to the right side of R1.
D2 = pd.merge(D1, R2, on=['Date','Symbol'])
D1.dtypes()
Date object
Symbol object
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
pct_change_1D float64
Symbol_above object
NE bool
R1.dtypes()
gvkey int64
datadate int64
fyearq int64
fqtr int64
indfmt object
consol object
popsrc object
datafmt object
tic object
curcdq object
datacqtr object
datafqtr object
rdq int64
costat object
ipodate float64
Report_Today int64
Symbol object
Date int64
Ideally, the columns not in the index of R1 (gvkey - Report_Today) will be on the right side of the columns in D1.
Any help is appreciated. Thanks.

In your description of DataFrames we can see,
In D1 DataFrame column Date has type "object"
In R1 DataFrame column Date has type "int64".
Make types of these columns the same and everything will be OK.

Related

How do I filter a dataframe using boolean mask [duplicate]

This question already has answers here:
Pandas: filter dataframe with type of data
(5 answers)
Closed 6 months ago.
The name of the dataframe variable is df
df.dtypes
when run give this output :
tx_price int64
beds int64
baths int64
lot_size int64
property_type object
exterior_walls object
roof object
basement float64
restaurants int64
groceries int64
nightlife int64
how do I filter df.dtypes using boolean mask such that I get the following output?
property_type object
exterior_walls object
roof object
dtype: object
Could use a lambda inside .loc
df.dtypes.loc[lambda x: x == object]

Missing information in correlation matrix

I created a correlation matrix using pandas.DataFrame.corr(method = 'spearman') and got this result. The column RAIN does not contain any correlation between it and the other columns.
My question is - why are the correlations between RAIN and the other columns blank?
My dataset contains the following columns with their respective datatypes -
PM2.5 float64
PM10 float64
SO2 float64
NO2 float64
CO float64
O3 float64
TEMP float64
PRES float64
DEWP float64
RAIN float64
WSPM float64
dtype: object
First debugging step would be to check with df.isna().all() if RAIN column is all nan (which would be kicked out)

Pandas set column value to 1 if other column value is NaN

I've looked everywhere tried .loc .apply and using lambda but I still cannot figure this out.
I have the UCI congressional vote dataset in a pandas dataframe and some votes are missing for votes 1 to 16 for each Democrat or Republican Congressperson.
So I inserted 16 columns after each vote column called abs.
I want each abs column to be 1 if the corresponding vote column is NaN.
None of those above methods I read on this site worked for me.
So I have this snippet below that also does not work but it might give a hint as to my current attempt using basic iterative Python syntax.
for i in range(16):
for j in range(len(cvotes['v1'])):
if cvotes['v{}'.format(i+1)][j] == np.nan:
cvotes['abs{}'.format(i+1)][j] = 1
else:
cvotes['abs{}'.format(i+1)][j] = 0
Any suggestions?
The above currently gives me 1 for abs when the vote value is NaN or 1.
edit:
I saw the given answer so tried this with just one column
cols = ['v1']
for col in cols:
cvotes = cvotes.join(cvotes[col].add_prefix('abs').isna().
astype(int))
but it's giving me an error:
ValueError: columns overlap but no suffix specified: Index(['v1'], dtype='object')
My dtypes are:
party object
v1 float64
v2 float64
v3 float64
v4 float64
v5 float64
v6 float64
v7 float64
v8 float64
v9 float64
v10 float64
v11 float64
v12 float64
v13 float64
v14 float64
v15 float64
v16 float64
abs1 int64
abs2 int64
abs3 int64
abs4 int64
abs5 int64
abs6 int64
abs7 int64
abs8 int64
abs9 int64
abs10 int64
abs11 int64
abs12 int64
abs13 int64
abs14 int64
abs15 int64
abs16 int64
dtype: object
Let us just do join with add_prefix
col=[c1,c2...]
s=pd.DataFrame(df[col].values.tolist(),index=df.index)
s.columns=s.columns+1
df=df.join(s.add_prefix('abs').isna().astype(int))

How to count the number of categorical features with Pandas?

I have a pd.DataFrame which contains different dtypes columns. I would like to have the count of columns of each type. I use Pandas 0.24.2.
I tried:
dataframe.dtypes.value_counts()
It worked fine for other dtypes (float64, object, int64) but for a weird reason, it doesn't aggregate the 'category' features, and I get a different count for each category (as if they would be counted as different values of dtypes).
I also tried:
dataframe.dtypes.groupby(by=dataframe.dtypes).agg(['count'])
But that raises a
TypeError: data type not understood.
Reproductible example:
import pandas as pd
df = pd.DataFrame([['A','a',1,10], ['B','b',2,20], ['C','c',3,30]], columns = ['col_1','col_2','col_3','col_4'])
df['col_1'] = df['col_1'].astype('category')
df['col_2'] = df['col_2'].astype('category')
print(df.dtypes.value_counts())
Expected result:
int64 2
category 2
dtype: int64
Actual result:
int64 2
category 1
category 1
dtype: int64
Use DataFrame.get_dtype_counts:
print (df.get_dtype_counts())
category 2
int64 2
dtype: int64
But if use last version of pandas your solution is recommended:
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.
As #jezrael mentioned that it is deprecated in 0.25.0, dtypes.value_counts(0) would give two categoryies, so to fix it do:
print(df.dtypes.astype(str).value_counts())
Output:
int64 2
category 2
dtype: int64

Querying a HDF-store

I created a hd5 file by
hdf=pandas.HDFStore(pfad)
hdf.append('df', df, data_columns=True)
I have a list that contains numpy.datetime64 values called expirations and try to read the portion of the hd5 table into a dataframe, that has values between expirations[1] and expirations[0] in column "expiration". Column expiration entries have the format Timestamp('2002-05-18 00:00:00').
I use the following command:
df=hdf.select('df', where=('expiration<expiration[1] & expiration>=expirations[0]'))
However, I get ValueError: Unable to parse x
How should this be correctly done?
df.dtypes
Out[37]:
adjusted stock close price float64
expiration datetime64[ns]
strike int64
call put object
ask float64
bid float64
volume int64
open interest int64
unadjusted stock price float64
df.info
Out[36]:
<bound method DataFrame.info of adjusted stock close price expiration strike call put ask date
2002-05-16 5047.00 2002-05-18 4300 C 802.000
There is more columns but they aren't of interest for the query.
Problem solved!
I obtained expirations by
df_expirations=df.drop_duplicates(subset='expiration')
expirations=df['expiration'].values
This obviously changed the number format from datetime into tz datetime.
I reingeneered this by using
expirations=df['expirations']
Now this query is working:
del df
df=hdf.select('df', where=('expiration=expirations[1]'))
Thanks for pointing me on the datetime format problem.