Python Pandas: add two different data frames - pandas

I am trying to sum different data frames, say dataframe a, dataframe b, and dataframe c.
Dataframe a is defined within the python code like this:
a=pd.DataFrame(index=range(0,8), columns=[0])
a.iloc[:,0]=0
(a.iloc[:,0]=0 is given to enable arithmetic operations, ie, replacing "NaN" with "Zero")
Dataframe b and Dataframe c are called from an excel sheet like this:
b=pd.read_excel("Test1.xlsx")
c=pd.read_excel("Test2.xlsx")
The excel sheets contain the same number of rows as Dataframe a. The sample is:
10
11
12
13
14
15
16
17
18
19
Now when I try to add, b+c gives fine output, but a+b or a+c give this:
0 10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
Why is this happening, even after assigning numbers to Dataframe a ?
Please help.

Pandas will take care of the indexing for you. You should be able to generate and add dataframes as show here
import pandas as pd
a = pd.DataFrame(list(range(8)))
b = pd.DataFrame(list(range(9,17)))
c = a + b
Using the code you provided to generate data produces a dataframe with only zeroes. Nevertheless, even if you generate two of those and add them, you will again get a dataframe with all zeroes.
a = pd.DataFrame(index=range(0,8), columns=[0])
a.iloc[:,0] = 0
b = pd.DataFrame(index=range(0,8), columns=[0])
b.iloc[:,0] = 0
c = a + b # All zeroes
I am also able to add all combinations such as b+c.

Related

How to keep all values from a dataframe except where NaN is present in another dataframe?

I am new to Pandas and I am stuck at this specific problem where I have 2 DataFrames in Pandas, e.g.
>>> df1
A B
0 1 9
1 2 6
2 3 11
3 4 8
>>> df2
A B
0 Nan 0.05
1 Nan 0.05
2 0.16 Nan
3 0.16 Nan
What I am trying to achieve is to retain all values from df1 except where there is a NaN in df2 i.e.
>>> df3
A B
0 Nan 9
1 Nan 6
2 3 Nan
3 4 Nan
I am talking about dfs with 10,000 rows each so I can't do this manually. Also indices and columns are the exact same in each case. I also have no NaN values in df1.
As far as I understand df.update() will either overwrite all values including NaN or update only those that are NaN.
You can use boolean masking using DataFrame.notna.
# df2 = df2.astype(float) # This needed if your dtypes are not floats.
m = df2.notna()
df1[m]
A B
0 NaN 9.0
1 NaN 6.0
2 3.0 NaN
3 4.0 NaN

Empty copy of Pandas DataFrame

I'm looking for an efficient idiom for creating a new Pandas DataFrame with the same columns and types as an existing DataFrame, but with no rows. The following works, but is presumably much less efficient than it could be, because it has to create a long indexing structure and then evaluate it for each row. I'm assuming that's O(n) in the number of rows, and I would like to find an O(1) solution (that's not too bad to look at).
out = df.loc[np.repeat(False, df.shape[0])].copy()
I have the copy() in there because I honestly have no idea under what circumstances I'm getting a copy or getting a view into the original.
For comparison in R, a nice idiom is to do df[0,], because there's no zeroth row. df[NULL,] also works.
I think the equivalent in pandas would be slicing using iloc
df = pd.DataFrame({'A' : [0,1,2,3], 'B' : [4,5,6,7]})
print(df1)
A B
0 0 4
1 1 5
2 2 6
3 3 7
df1 = df.iloc[:0].copy()
print(df1)
Empty DataFrame
Columns: [A, B]
Index: []
Df1 the existing DataFrame:
df1 = pd.DataFrame({'x1':[1,2,3], 'x2':[4,5,6]})
Df2 the new, based on the columns in df1:
df2 = pd.DataFrame({}, columns=df1.columns)
For setting the dtypes of the different columns:
for x in df1.columns:
df2[x]=df2[x].astype(df1[x].dtypes.name)
Update no rows
Use reindex:
dfcopy = pd.DataFrame().reindex(columns=df.columns)
print(dfcopy)
Output:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: []
We can use reindex_like.
dfcopy = pd.DataFrame().reindex_like(df)
MCVE:
#Create dummy source dataframe
df = pd.DataFrame(np.arange(25).reshape(5,-1), index=[*'ABCDE'], columns=[*'abcde'])
dfcopy = pd.DataFrame().reindex_like(df)
print(dfcopy)
Output:
a b c d e
A NaN NaN NaN NaN NaN
B NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN
D NaN NaN NaN NaN NaN
E NaN NaN NaN NaN NaN
Please deep copy original df and drop index.
#df1=(df.copy(deep=True)).drop(df.index)#If df is small
df1=df.drop(df.index).copy()#If df is large and dont want to copy and discard

What's wrong with the data?? Showing NaN values though having values and not displaying the proper labels

I have following dataset in my csv file shown in following picture.
The dataset I am working with
Reading the file with pandas as below given.
import pandas as pd
data = pd.read_csv('train.csv',encoding='latin',low_memory=False)
print(data.head(10))
And it gives this output..
id ... Unnamed: 685
0 0 ... NaN
1 1 ... NaN
2 2 ... NaN
3 3 ... NaN
4 4 ... NaN
5 5 ... NaN
6 6 ... NaN
7 7 ... NaN
8 8 ... NaN
9 9 ... NaN
[10 rows x 686 columns]
Process finished with exit code 0
I don't know what wrong I am doing.
Does your dataset really have 686 columns? If not, then it must be including blank spaces. Rectify the format, if necessary.
try reading the file in xlsx format instead of csv. Hope this helps:
import pandas as pd
data = pd.read_xlsx('train.csv',encoding='latin',low_memory=False)
print(data.head(10))
Or, try saving file in proper csv format. You have to take care of escape characters.

Pandas rolling sum of prior n elements with NaN values

I have a pandas Series named df which look like :
NaN
2
3
NaN
NaN
4
6
4
8
I would like to calculate the rolling sum only if there are 5 prior elements. If there is less than 5 prior elements the output should be NaN [see image below].
When there are five prior elements with some NaN element, then NaN should be treated like zeros.
I tried
df.rolling(window=5).sum()
But I get only NaN which is not what I look for.
I also used min_periods=1 as well (suggested in many stackoverflow posts) but it does not work.
Below I show the input, expected output, and I explain why the expected output should be as such.

Pandas DataFrame + object type + HDF + PyTables 'table'

(Editing to clarify my application, sorry for any confusion)
I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.
So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.
Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.
Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.
I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?
I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials
If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.
This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.
But I welcome most enthusiastically a more canonical solution.
Thanks so much for all your time!
EDIT: Adding code (Jeff's suggestion)
import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]
df = pd.DataFrame(mydata)
In [4]: df
Out[4]:
0
0 [1.28822975392e-231]
1 [1.28822975392e-231, -2.31584192385e+77]
2 [1.28822975392e-231, -1.49166823584e-154, 2.12...
3 [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4 [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5 [1.28822975392e-231, 1.49166823584e-154, 1.531...
6 [1.28822975392e-231, -2.68156174706e+154, 2.20...
7 [1.28822975392e-231, -2.68156174706e+154, 2.13...
8 [1.28822975392e-231, -1.3365130604e-315, 2.222...
9 [1.28822975392e-231, -1.33651054067e-315, 2.22...
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0 10 non-null values
dtypes: object(1)
df.to_hdf('test.h5','data')
--> OK
df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype
Here's a simple example along the lines of what you have described
In [17]: df = DataFrame(randn(10,10))
In [18]: df.iloc[5:10,7:9] = np.nan
In [19]: df.iloc[7:10,4:9] = np.nan
In [22]: df.iloc[7:10,-1] = np.nan
In [23]: df
Out[23]:
0 1 2 3 4 5 6 7 8 9
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN
In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)
In [25]: df
Out[25]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3
Note that in 0.12 you should use table=True, rather than fmt (this is in the process of changing)
In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')
In [27]: pd.read_hdf('test.h5','df')
Out[27]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3