Empty copy of Pandas DataFrame - pandas

I'm looking for an efficient idiom for creating a new Pandas DataFrame with the same columns and types as an existing DataFrame, but with no rows. The following works, but is presumably much less efficient than it could be, because it has to create a long indexing structure and then evaluate it for each row. I'm assuming that's O(n) in the number of rows, and I would like to find an O(1) solution (that's not too bad to look at).
out = df.loc[np.repeat(False, df.shape[0])].copy()
I have the copy() in there because I honestly have no idea under what circumstances I'm getting a copy or getting a view into the original.
For comparison in R, a nice idiom is to do df[0,], because there's no zeroth row. df[NULL,] also works.

I think the equivalent in pandas would be slicing using iloc
df = pd.DataFrame({'A' : [0,1,2,3], 'B' : [4,5,6,7]})
print(df1)
A B
0 0 4
1 1 5
2 2 6
3 3 7
df1 = df.iloc[:0].copy()
print(df1)
Empty DataFrame
Columns: [A, B]
Index: []

Df1 the existing DataFrame:
df1 = pd.DataFrame({'x1':[1,2,3], 'x2':[4,5,6]})
Df2 the new, based on the columns in df1:
df2 = pd.DataFrame({}, columns=df1.columns)
For setting the dtypes of the different columns:
for x in df1.columns:
df2[x]=df2[x].astype(df1[x].dtypes.name)

Update no rows
Use reindex:
dfcopy = pd.DataFrame().reindex(columns=df.columns)
print(dfcopy)
Output:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: []
We can use reindex_like.
dfcopy = pd.DataFrame().reindex_like(df)
MCVE:
#Create dummy source dataframe
df = pd.DataFrame(np.arange(25).reshape(5,-1), index=[*'ABCDE'], columns=[*'abcde'])
dfcopy = pd.DataFrame().reindex_like(df)
print(dfcopy)
Output:
a b c d e
A NaN NaN NaN NaN NaN
B NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN
D NaN NaN NaN NaN NaN
E NaN NaN NaN NaN NaN

Please deep copy original df and drop index.
#df1=(df.copy(deep=True)).drop(df.index)#If df is small
df1=df.drop(df.index).copy()#If df is large and dont want to copy and discard

Related

How can a list of pandas DataFrames be merged while keeping duplicate indices?

I have a list of many DataFrames. Each DataFrame is a set of various measurements corresponding to a timestamp. Since many measurements can correspond to the same moment in time, there are many duplicate index entries in the time indices of the DataFrames.
I want to merge this list of DataFrames and obviously to keep the duplicate indices. How can this be done? I have checked this question but the solutions are applicable to the case of merging only two DataFrames, not a list of many DataFrames. The concat functionality apparently cannot handle duplicate indices.
See #HarvIpan comment: that is correct. You can concat a list a pandas dataframes:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3],'b':['a','b','c']})
df.set_index('a', inplace=True)
df2 = pd.DataFrame({'a':[1,2,3],'b':['d','e','f']})
df2.set_index('a', inplace=True)
df3 = pd.DataFrame({'a':[1,2,3],'c':['g','e','h']})
df3.set_index('a', inplace=True)
list_of_dfs = [df,df2,df3]
pd.concat(list_of_dfs, sort=False)
b c
a
1 a NaN
2 b NaN
3 c NaN
1 d NaN
2 e NaN
3 f NaN
1 NaN g
2 NaN e
3 NaN h

pandas update specific row with nan values

I've been browsing around but I cannot find the answer to my particular question.
I have a Dataframe with hundreds of columns and hundreds of rows. I want to change the occurring NaN values only for the first row and replace them with an empty string. This has been answered for changing a column or an entire dataframe, but not a particular row. I also don't want to modify the NaNs occurring in other rows.
I've tried the following:
dataframe.loc[0].replace(np.nan, '', regex=True)
and I also tried with:
dataframe.update(dataframe.loc[0].fillna(''))
but when I call the dataframe, it is not modified. Any help would be greatly appreciated!
Consider the data frame df
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.choice([1, np.nan], size=(4, 4)),
list('WXYZ'), list('ABCD')
)
df
A B C D
W 1.0 NaN 1.0 NaN
X 1.0 1.0 NaN 1.0
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1.0
If we use a non-scalar, namely an array like thing to select the first row, we'll get a pd.DataFrame object back and can conveniently fillna and pass to pd.DataFrame.update
df.update(df.iloc[[0]].fillna(''))
df
A B C D
W 1.0 1.0
X 1.0 1 NaN 1
Y NaN NaN NaN NaN
Z 1.0 NaN NaN 1
Notice that I use [0] instead of 0 within the iloc.

Python Pandas: add two different data frames

I am trying to sum different data frames, say dataframe a, dataframe b, and dataframe c.
Dataframe a is defined within the python code like this:
a=pd.DataFrame(index=range(0,8), columns=[0])
a.iloc[:,0]=0
(a.iloc[:,0]=0 is given to enable arithmetic operations, ie, replacing "NaN" with "Zero")
Dataframe b and Dataframe c are called from an excel sheet like this:
b=pd.read_excel("Test1.xlsx")
c=pd.read_excel("Test2.xlsx")
The excel sheets contain the same number of rows as Dataframe a. The sample is:
10
11
12
13
14
15
16
17
18
19
Now when I try to add, b+c gives fine output, but a+b or a+c give this:
0 10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
Why is this happening, even after assigning numbers to Dataframe a ?
Please help.
Pandas will take care of the indexing for you. You should be able to generate and add dataframes as show here
import pandas as pd
a = pd.DataFrame(list(range(8)))
b = pd.DataFrame(list(range(9,17)))
c = a + b
Using the code you provided to generate data produces a dataframe with only zeroes. Nevertheless, even if you generate two of those and add them, you will again get a dataframe with all zeroes.
a = pd.DataFrame(index=range(0,8), columns=[0])
a.iloc[:,0] = 0
b = pd.DataFrame(index=range(0,8), columns=[0])
b.iloc[:,0] = 0
c = a + b # All zeroes
I am also able to add all combinations such as b+c.

pandas df after fillna() is still NaN

A 40000 rows 1 column data saved as excel. There are hundred null values in it. Such as row 361...
When I carried out df.fillna(method='bfill'), the NaN values is still NaN.
If sliced a df fragment contained Null values, it processed expectently.
I tried but still could not fill NaN cells.
So what's wrong with it?
The df file is here:
excel file click here
df=pd.read_execel('npp.xlsx')
df.fillna(method='bfill')
print( df.iloc[360:370,] )
Out[122]:
0
t360 NaN
t361 NaN
t362 NaN
t363 NaN
t364 220.50
t365 228.59
t366 NaN
t367 NaN
t368 NaN
t369 NaN
When apply fillna() on sliced df, the NaN values could be replaced:
print( df.iloc[360:370,].fillna(method='bfill') )
0
t360 220.50
t361 220.50
t362 220.50
t363 220.50
t364 220.50
t365 228.59
t366 NaN
t367 NaN
t368 NaN
t369 NaN
You need assign output:
df = pd.read_excel('npp.xlsx')
df = df.fillna(method='bfill')
df = df[df[0].isnull()]
print (df)
Empty DataFrame
Columns: [0]
Index: []
Or use inplace=True parameter:
df = pd.read_excel('npp.xlsx')
df.fillna(method='bfill', inplace=True)
df = df[df[0].isnull()]
print (df)
Empty DataFrame
Columns: [0]
Index: []
Or shorter:
df = df.bfill()
df.bfill(inplace=True)

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN