I have a loop that each time creates a dataframe(DF) with a form
DF
ID LCAR RCAR ... LPCA1 LPCA2 RPCA2
0 d0129 312.255859 397.216797 ... 1.098888 1.101905 1.152332
and then add that dataframe to an existing dataframe(main_exl_df) with this form:
main_exl_df
ID Date ... COGOTH3 COGOTH3X COGOTH3F
0 d0129 NaN ... NaN NaN NaN
1 d0757 NaN ... 0.0 NaN NaN
2 d2430 NaN ... NaN NaN NaN
3 d3132 NaN ... 0.0 NaN NaN
4 d0371 NaN ... 0.0 NaN NaN
... ... ... ... ... ... ...
2163 d0620 NaN ... 0.0 NaN NaN
2164 d2410 NaN ... 0.0 NaN NaN
2165 d0752 NaN ... NaN NaN NaN
2166 d0407 NaN ... 0.0 NaN NaN
at each iteration main_exl_df is saved and then loaded again for the next iteration.
I tried
main_exl_df = pd.concat([main_exl_df, DF], axis=1)
but this add the columns each time to the right side of the main_exl_df and does not recognize the index if 'ID' row.
how I can specify to add the new dataframe(DF) at the row with correct ID and right columns?
Merge is the way to go for combining columns in such cases. When you use pd.merge, you need to specify whether the merge is inner, left or right. Assuming that in this case, you want to keep all the rows in main_exl_df, you should merge using:
main_exl_df = main_exl_df.merge(DF, how='left', on='ID')
If you want to keep rows from both the dataframes, use outer as argument value:
main_exl_df = main_exl_df.merge(DF, how='outer', on='ID')
This is what solved the problem at the end (with the help of this answer):
I used the merge function however merge created duplicate columns with _x and _y suffixes. To get rid of the _x suffixes I used this function:
def drop_x(df):
# list comprehension of the cols that end with '_x'
to_drop = [x for x in df if x.endswith('_x')]
df.drop(to_drop, axis=1, inplace=True)
and then merged the two dataframes while replacing the _y suffixes with empty string:
col_to_use = DF.columns.drop_duplicates(main_exl_df)
main_exl_df = main_exl_df.merge(DF[col_to_use], on='ID', how='outer', suffixes=('_x', ''))
drop_x(main_exl_df)
Related
I have some unknown DataFrame that can be of any size and shape, for example:
first1 first2 first3 first4
a NaN 22 56.0 65
c 380.0 40 NaN 66
b 390.0 50 80.0 64
My objective is to delete all columns and rows at which there is a NaN value.
In this specific case, the output should be:
first2 first4
b 50 64
Also, I need to preserve the option to use "all" like in pandas.DataFrame.dropna, meaning when an argument "all" passed, a column or a row must be dropped only if all its values are missing.
When I tried the following code:
def dropna_mta_style(df, how='any'):
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
It obviously didn't work, because it drops firstly the rows, and then searches for columns with Nan's, but it was already dropped.
Thanks in advance!
P.S: for and while loops, python built-in functions that act on iterables (all, any, map, ...), list and dictionary comprehensions shouldn't be used.
Solution intended for readability:
rows = df.dropna(axis=0).index
cols = df.dropna(axis=1).columns
df = df.loc[rows, cols]
Would something like this work ?
df.dropna(axis=1,how='any').loc[df.dropna(axis=0,how='any').index]
(Meaning we take the indexes of all rows for which we dont have NaNs in any row df.dropna(axis=0,how='any').index - then use that to locate the rows we want from the original df for which we drop all columns having at least one NaN)
This should remove all rows and columns dynamically
df['Check'] = df.isin([np.nan]).any(axis=1)
df = df.dropna(axis = 1)
df = df.loc[df['Check'] == False]
df.drop('Check', axis = 1, inplace = True)
df
def dropna_mta_style(df, how='any'):
if (how == 'all'):
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
if len(row_names) > 0:
new_df=df.drop(axis=1, columns=col_names)
else:
new_df = df.dropna(axis=0, how = how).dropna(axis=1, how = how)
return new_df
here is a breakdown of the change made to the function
BEFORE;
first1 first2 first3 first4 first5
a NaN 22.0 NaN 65.0 NaN
c 380.0 40.0 NaN 66.0 NaN
b 390.0 50.0 NaN 64.0 NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
find the null columns
null_col =df.isna().all(axis=0).to_frame(name='col')
col_names = null_col[null_col['col'] == True].index
col_names
Index(['first3', 'first5'], dtype='object')
find the rows with all null rows
null_row =df.isna().all(axis=1).to_frame(name='row')
row_index = null_row[null_row['row'] == True].index
row_index
Index([3, 4, 5, 6], dtype='object')
if len(row_names) > 0:
df2=df.drop(axis=1, columns=col_names)
df2
AFTER:
first1 first2 first4
a NaN 22.0 65.0
c 380.0 40.0 66.0
b 390.0 50.0 64.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
incorporating in your method
I have a set of data with inconsistent step sizes between the x values. Now the idea is to up sample the dataframe where x will have an increment step size of 0.5 and interpolate the y values.
I have the following df:
x y
10183.2 -40.1
10187.1 -41.0
10191.0 -41.2
10195.0 -41.5
10198.9 -42.0
10202.8 -42.4
10206.8 -42.9
10210.7 -43.4
10214.6 -43.8
10218.6 -44.2
10222.5 -44.4
10226.4 -44.6
10230.4 -44.8
10234.3 -44.9
10238.2 -45.0
10242.2 -45.1
10246.1 -45.2
10250.0 -45.2
10253.9 -45.3
10257.9 -45.4
10261.8 -45.5
10265.7 -45.5
10269.7 -45.6
What I want to achieve is:
x y
10185 Nan
10186.5 Nan
10187 -40.00
10187.5 Nan
10188 Nan
10188.5 Nan
10189 Nan
10189.5 Nan
10190 Nan
10190.5 Nan
10191 -41.2
10191.5 Nan
10192 Nan
10192.5 Nan
10193 Nan
10193.5 Nan
10194 Nan
10194.5 Nan
10195 Nan
10195.5 Nan
10196 Nan
10196.5 Nan
10197 Nan
Where the Nan will be interpolated based on the existing points in the original df.
Is there a way to create a new df where the x points are spaced by 0.5 based on the original x data?
I have been looking into reshape but this is only used for time series.
Could anyone point me in the right direction?
You can create your new index, reindex on the combination, interpolate, and subset the new rows only:
new_index = np.arange(10185, 10270, 0.5)
(df.set_index('x')
.reindex(sorted(list(df['x'])+list(new_index)))
.interpolate()
.loc[new_index]
.reset_index()
)
output:
x y
0 10185.0 -40.25
1 10185.5 -40.40
2 10186.0 -40.55
3 10186.5 -40.70
4 10187.0 -40.85
...
I have the following dataframe in pandas
datadate fyear ebit glp ibc ... ind status year month a_date
gvkey ...
7767 20130831 NaN NaN NaN NaN ... 0 1 2013.0 8.0 0
10871 20110930 NaN NaN NaN NaN ... 0 1 2011.0 9.0 0
15481 20110930 NaN NaN NaN NaN ... 0 1 2011.0 9.0 0
15582 19821031 NaN NaN NaN NaN ... 1 1 1982.0 10.0 0
15582 19831031 NaN NaN NaN NaN ... 1 1 1983.0 10.0 0
... ... ... ... ... ... ... ... ... ... ...
282553 20071231 NaN NaN NaN NaN ... 0 1 2007.0 12.0 0
282553 20081231 NaN NaN NaN NaN ... 0 1 2008.0 12.0 0
282553 20091231 NaN NaN NaN NaN ... 0 1 2009.0 12.0 0
294911 20150930 NaN NaN NaN NaN ... 0 1 2015.0 9.0 0
321467 20161231 NaN NaN NaN NaN ... 0 1 2016.0 12.0 0
I want to run the following command to assign the year value to the column a_date if month is at least 6. (Please do not consider that there are NaNs in the dataframe):
df.iloc[(df['month']>=6).values,-1]=df.iloc[(df['month']>=6).values,-3]
but I get the error
ValueError: Must have equal len keys and value when setting with an iterable
How do I proceed then? I really cannot get why I get this error. I googled and found some solutions to the same ValueError but they do not apply to my case. I would like to avoid using dictionaries and keep everything in one line if possible. I know I could solve with a loop but I am looking for a more efficient solution
I think that the error comes from the iloc function in the right part of your line (after =), because this function returns a series and not a value. So you are affecting a serie to a dataframe cell, which for me is the source of the error. Using pandas, for me the code would be :
df.loc[df['month'] >= 6, 'a_date'] = df['year']
The loc function allows to select a group of lines according to a condition (here df['month'] >= 6), a column to apply a change (here 'a_date') and the change you want to apply (here, as it is another column of the dataframe : df['year'])
I found an efficient solution myself using np.where:
df['a_date']=np.where(df['month']>=6,df['year'],df['year']-1)
The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
I've been browsing around but I cannot find the answer to my particular question.
I have a Dataframe with hundreds of columns and hundreds of rows. I want to change the occurring NaN values only for the first row and replace them with an empty string. This has been answered for changing a column or an entire dataframe, but not a particular row. I also don't want to modify the NaNs occurring in other rows.
I've tried the following:
dataframe.loc[0].replace(np.nan, '', regex=True)
and I also tried with:
dataframe.update(dataframe.loc[0].fillna(''))
but when I call the dataframe, it is not modified. Any help would be greatly appreciated!
Consider the data frame df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.choice([1, np.nan], size=(4, 4)),
list('WXYZ'), list('ABCD')
)
df
A B C D
W 1.0 NaN 1.0 NaN
X 1.0 1.0 NaN 1.0
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1.0
If we use a non-scalar, namely an array like thing to select the first row, we'll get a pd.DataFrame object back and can conveniently fillna and pass to pd.DataFrame.update
df.update(df.iloc[[0]].fillna(''))
df
A B C D
W 1.0 1.0
X 1.0 1 NaN 1
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1
Notice that I use [0] instead of 0 within the iloc.