pandas update specific row with nan values - pandas

I've been browsing around but I cannot find the answer to my particular question.
I have a Dataframe with hundreds of columns and hundreds of rows. I want to change the occurring NaN values only for the first row and replace them with an empty string. This has been answered for changing a column or an entire dataframe, but not a particular row. I also don't want to modify the NaNs occurring in other rows.
I've tried the following:
dataframe.loc[0].replace(np.nan, '', regex=True)
and I also tried with:
dataframe.update(dataframe.loc[0].fillna(''))
but when I call the dataframe, it is not modified. Any help would be greatly appreciated!

Consider the data frame df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.choice([1, np.nan], size=(4, 4)),
list('WXYZ'), list('ABCD')
)
df
A B C D
W 1.0 NaN 1.0 NaN
X 1.0 1.0 NaN 1.0
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1.0
If we use a non-scalar, namely an array like thing to select the first row, we'll get a pd.DataFrame object back and can conveniently fillna and pass to pd.DataFrame.update
df.update(df.iloc[[0]].fillna(''))
df
A B C D
W 1.0 1.0
X 1.0 1 NaN 1
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1
Notice that I use [0] instead of 0 within the iloc.

Related

Panda's data frame up sampling with interpolation on a non time series

I have a set of data with inconsistent step sizes between the x values. Now the idea is to up sample the dataframe where x will have an increment step size of 0.5 and interpolate the y values.
I have the following df:
x y
10183.2 -40.1
10187.1 -41.0
10191.0 -41.2
10195.0 -41.5
10198.9 -42.0
10202.8 -42.4
10206.8 -42.9
10210.7 -43.4
10214.6 -43.8
10218.6 -44.2
10222.5 -44.4
10226.4 -44.6
10230.4 -44.8
10234.3 -44.9
10238.2 -45.0
10242.2 -45.1
10246.1 -45.2
10250.0 -45.2
10253.9 -45.3
10257.9 -45.4
10261.8 -45.5
10265.7 -45.5
10269.7 -45.6
What I want to achieve is:
x y
10185 Nan
10186.5 Nan
10187 -40.00
10187.5 Nan
10188 Nan
10188.5 Nan
10189 Nan
10189.5 Nan
10190 Nan
10190.5 Nan
10191 -41.2
10191.5 Nan
10192 Nan
10192.5 Nan
10193 Nan
10193.5 Nan
10194 Nan
10194.5 Nan
10195 Nan
10195.5 Nan
10196 Nan
10196.5 Nan
10197 Nan
Where the Nan will be interpolated based on the existing points in the original df.
Is there a way to create a new df where the x points are spaced by 0.5 based on the original x data?
I have been looking into reshape but this is only used for time series.
Could anyone point me in the right direction?
You can create your new index, reindex on the combination, interpolate, and subset the new rows only:
new_index = np.arange(10185, 10270, 0.5)
(df.set_index('x')
.reindex(sorted(list(df['x'])+list(new_index)))
.interpolate()
.loc[new_index]
.reset_index()
)
output:
x y
0 10185.0 -40.25
1 10185.5 -40.40
2 10186.0 -40.55
3 10186.5 -40.70
4 10187.0 -40.85
...

Empty copy of Pandas DataFrame

I'm looking for an efficient idiom for creating a new Pandas DataFrame with the same columns and types as an existing DataFrame, but with no rows. The following works, but is presumably much less efficient than it could be, because it has to create a long indexing structure and then evaluate it for each row. I'm assuming that's O(n) in the number of rows, and I would like to find an O(1) solution (that's not too bad to look at).
out = df.loc[np.repeat(False, df.shape[0])].copy()
I have the copy() in there because I honestly have no idea under what circumstances I'm getting a copy or getting a view into the original.
For comparison in R, a nice idiom is to do df[0,], because there's no zeroth row. df[NULL,] also works.
I think the equivalent in pandas would be slicing using iloc
df = pd.DataFrame({'A' : [0,1,2,3], 'B' : [4,5,6,7]})
print(df1)
A B
0 0 4
1 1 5
2 2 6
3 3 7
df1 = df.iloc[:0].copy()
print(df1)
Empty DataFrame
Columns: [A, B]
Index: []
Df1 the existing DataFrame:
df1 = pd.DataFrame({'x1':[1,2,3], 'x2':[4,5,6]})
Df2 the new, based on the columns in df1:
df2 = pd.DataFrame({}, columns=df1.columns)
For setting the dtypes of the different columns:
for x in df1.columns:
df2[x]=df2[x].astype(df1[x].dtypes.name)
Update no rows
Use reindex:
dfcopy = pd.DataFrame().reindex(columns=df.columns)
print(dfcopy)
Output:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: []
We can use reindex_like.
dfcopy = pd.DataFrame().reindex_like(df)
MCVE:
#Create dummy source dataframe
df = pd.DataFrame(np.arange(25).reshape(5,-1), index=[*'ABCDE'], columns=[*'abcde'])
dfcopy = pd.DataFrame().reindex_like(df)
print(dfcopy)
Output:
a b c d e
A NaN NaN NaN NaN NaN
B NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN
D NaN NaN NaN NaN NaN
E NaN NaN NaN NaN NaN
Please deep copy original df and drop index.
#df1=(df.copy(deep=True)).drop(df.index)#If df is small
df1=df.drop(df.index).copy()#If df is large and dont want to copy and discard

Adding a dataframe to an existing dataframe at specific rows and columns

I have a loop that each time creates a dataframe(DF) with a form
DF
ID LCAR RCAR ... LPCA1 LPCA2 RPCA2
0 d0129 312.255859 397.216797 ... 1.098888 1.101905 1.152332
and then add that dataframe to an existing dataframe(main_exl_df) with this form:
main_exl_df
ID Date ... COGOTH3 COGOTH3X COGOTH3F
0 d0129 NaN ... NaN NaN NaN
1 d0757 NaN ... 0.0 NaN NaN
2 d2430 NaN ... NaN NaN NaN
3 d3132 NaN ... 0.0 NaN NaN
4 d0371 NaN ... 0.0 NaN NaN
... ... ... ... ... ... ...
2163 d0620 NaN ... 0.0 NaN NaN
2164 d2410 NaN ... 0.0 NaN NaN
2165 d0752 NaN ... NaN NaN NaN
2166 d0407 NaN ... 0.0 NaN NaN
at each iteration main_exl_df is saved and then loaded again for the next iteration.
I tried
main_exl_df = pd.concat([main_exl_df, DF], axis=1)
but this add the columns each time to the right side of the main_exl_df and does not recognize the index if 'ID' row.
how I can specify to add the new dataframe(DF) at the row with correct ID and right columns?
Merge is the way to go for combining columns in such cases. When you use pd.merge, you need to specify whether the merge is inner, left or right. Assuming that in this case, you want to keep all the rows in main_exl_df, you should merge using:
main_exl_df = main_exl_df.merge(DF, how='left', on='ID')
If you want to keep rows from both the dataframes, use outer as argument value:
main_exl_df = main_exl_df.merge(DF, how='outer', on='ID')
This is what solved the problem at the end (with the help of this answer):
I used the merge function however merge created duplicate columns with _x and _y suffixes. To get rid of the _x suffixes I used this function:
def drop_x(df):
# list comprehension of the cols that end with '_x'
to_drop = [x for x in df if x.endswith('_x')]
df.drop(to_drop, axis=1, inplace=True)
and then merged the two dataframes while replacing the _y suffixes with empty string:
col_to_use = DF.columns.drop_duplicates(main_exl_df)
main_exl_df = main_exl_df.merge(DF[col_to_use], on='ID', how='outer', suffixes=('_x', ''))
drop_x(main_exl_df)

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN