I've looked everywhere tried .loc .apply and using lambda but I still cannot figure this out.
I have the UCI congressional vote dataset in a pandas dataframe and some votes are missing for votes 1 to 16 for each Democrat or Republican Congressperson.
So I inserted 16 columns after each vote column called abs.
I want each abs column to be 1 if the corresponding vote column is NaN.
None of those above methods I read on this site worked for me.
So I have this snippet below that also does not work but it might give a hint as to my current attempt using basic iterative Python syntax.
for i in range(16):
for j in range(len(cvotes['v1'])):
if cvotes['v{}'.format(i+1)][j] == np.nan:
cvotes['abs{}'.format(i+1)][j] = 1
else:
cvotes['abs{}'.format(i+1)][j] = 0
Any suggestions?
The above currently gives me 1 for abs when the vote value is NaN or 1.
edit:
I saw the given answer so tried this with just one column
cols = ['v1']
for col in cols:
cvotes = cvotes.join(cvotes[col].add_prefix('abs').isna().
astype(int))
but it's giving me an error:
ValueError: columns overlap but no suffix specified: Index(['v1'], dtype='object')
My dtypes are:
party object
v1 float64
v2 float64
v3 float64
v4 float64
v5 float64
v6 float64
v7 float64
v8 float64
v9 float64
v10 float64
v11 float64
v12 float64
v13 float64
v14 float64
v15 float64
v16 float64
abs1 int64
abs2 int64
abs3 int64
abs4 int64
abs5 int64
abs6 int64
abs7 int64
abs8 int64
abs9 int64
abs10 int64
abs11 int64
abs12 int64
abs13 int64
abs14 int64
abs15 int64
abs16 int64
dtype: object
Let us just do join with add_prefix
col=[c1,c2...]
s=pd.DataFrame(df[col].values.tolist(),index=df.index)
s.columns=s.columns+1
df=df.join(s.add_prefix('abs').isna().astype(int))
Related
I have no idea why, but when my column in pandas has dtype: Float64, I can't make this command:
df['column'].round()
Following error follows:
AttributeError: 'FloatingArray' object has no attribute 'round'
If I set the dtype to : float64, everything goes well. Please, could you explain me why..?
Maybe you have to check your version of Pandas or your data?
df = pd.DataFrame({'column': pd.array([1.1, 2.2])})
>>> type(pd.array([1.1, 2.2]))
pandas.core.arrays.floating.FloatingArray
>>> hasattr(pd.array([1.1, 2.2]), 'round')
True
>>> df
column
0 1.1
1 2.2
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column 2 non-null Float64 # <- Float64
dtypes: Float64(1)
memory usage: 146.0 bytes
>>> df['column'].round()
0 1.0
1 2.0
Name: column, dtype: Float64
>>> df['column'].astype(float).round()
0 1.0
1 2.0
Name: column, dtype: float64
Github links
https://github.com/pandas-dev/pandas/issues/38844
https://github.com/pandas-dev/pandas/pull/39751
Minimum Pandas version requirement: 1.3.0
Try this:
df['column'].apply(lambda x: round(x))
I created a correlation matrix using pandas.DataFrame.corr(method = 'spearman') and got this result. The column RAIN does not contain any correlation between it and the other columns.
My question is - why are the correlations between RAIN and the other columns blank?
My dataset contains the following columns with their respective datatypes -
PM2.5 float64
PM10 float64
SO2 float64
NO2 float64
CO float64
O3 float64
TEMP float64
PRES float64
DEWP float64
RAIN float64
WSPM float64
dtype: object
First debugging step would be to check with df.isna().all() if RAIN column is all nan (which would be kicked out)
I am trying to merge two dataframes (D1 & R1) on two columns (Date & Symbol) but I'm receiving this error "You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat".
I've been using pd.merge and I've tried different dtypes. I don't want to concatenate these because I just want to add D1 to the right side of R1.
D2 = pd.merge(D1, R2, on=['Date','Symbol'])
D1.dtypes()
Date object
Symbol object
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
pct_change_1D float64
Symbol_above object
NE bool
R1.dtypes()
gvkey int64
datadate int64
fyearq int64
fqtr int64
indfmt object
consol object
popsrc object
datafmt object
tic object
curcdq object
datacqtr object
datafqtr object
rdq int64
costat object
ipodate float64
Report_Today int64
Symbol object
Date int64
Ideally, the columns not in the index of R1 (gvkey - Report_Today) will be on the right side of the columns in D1.
Any help is appreciated. Thanks.
In your description of DataFrames we can see,
In D1 DataFrame column Date has type "object"
In R1 DataFrame column Date has type "int64".
Make types of these columns the same and everything will be OK.
I have a dataframe with four columns, with dtypes set up like this (hat tip to ryanjdillon!)
dtypes = np.dtype([
('size', int),
('sum', float),
('mean', float),
('std', float),
])
data = np.empty(0, dtype=dtypes)
df = pd.DataFrame(data)
At this stage, df.dtypes looks like this:
size int64
sum float64
mean float64
std float64
dtype: object
Great so far. But the first time I assign an int value to the 'size' column, e.g.
df.loc['foo', 'size'] = 1
it flips the dtype of the column to float64, and the value is cast, to 1.0 in this case.
size float64
sum float64
mean float64
std float64
dtype: object
Wazzup here?
I have a pd.DataFrame which contains different dtypes columns. I would like to have the count of columns of each type. I use Pandas 0.24.2.
I tried:
dataframe.dtypes.value_counts()
It worked fine for other dtypes (float64, object, int64) but for a weird reason, it doesn't aggregate the 'category' features, and I get a different count for each category (as if they would be counted as different values of dtypes).
I also tried:
dataframe.dtypes.groupby(by=dataframe.dtypes).agg(['count'])
But that raises a
TypeError: data type not understood.
Reproductible example:
import pandas as pd
df = pd.DataFrame([['A','a',1,10], ['B','b',2,20], ['C','c',3,30]], columns = ['col_1','col_2','col_3','col_4'])
df['col_1'] = df['col_1'].astype('category')
df['col_2'] = df['col_2'].astype('category')
print(df.dtypes.value_counts())
Expected result:
int64 2
category 2
dtype: int64
Actual result:
int64 2
category 1
category 1
dtype: int64
Use DataFrame.get_dtype_counts:
print (df.get_dtype_counts())
category 2
int64 2
dtype: int64
But if use last version of pandas your solution is recommended:
Deprecated since version 0.25.0.
Use .dtypes.value_counts() instead.
As #jezrael mentioned that it is deprecated in 0.25.0, dtypes.value_counts(0) would give two categoryies, so to fix it do:
print(df.dtypes.astype(str).value_counts())
Output:
int64 2
category 2
dtype: int64