How do you filter out rows with NaN in a panda's dataframe - pandas

I have a few entries in a panda dataframe that are NaN. How would I remove any row with a NaN?

Just use x.dropna():
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]:
In [3]: df = pd.DataFrame(np.random.randn(5, 2))
In [4]: df.iloc[0, 1] = np.nan
In [5]: df.iloc[4, 0] = np.nan
In [6]: print(df)
0 1
0 2.264727 NaN
1 0.229321 1.615272
2 -0.901608 -1.407787
3 -0.198323 0.521726
4 NaN 0.692340
In [7]: df2 = df.dropna()
In [8]: print(df2)
0 1
1 0.229321 1.615272
2 -0.901608 -1.407787
3 -0.198323 0.521726

Related

Separating lists into a new dataframe

I have a dataframe with one column that contains lists of lists. All the basic lists contain two strings. The lists that contain these basic lists have a variable amount of lists in them.
example:
A
0 [[1,1],[1,1]]
1 [[1,1]]
2 [[1,1],[1,1],[1,1]]
I want a new dataframe that has two columns. The first has the first item in each basic list, the second column has the second item. I solved it this way:
df = pd.DataFrame(data = {'A': [[[1,1],[1,1]], [[1,1]], [[1,1],[1,1],[1,1]]]})
df2 = pd.DataFrame(columns = ['A', 'B'])
for x in df.A:
for i in x:
n = pd.DataFrame([i], columns = ('A', 'B'))
df2 = df2.append(n)
A B
0 1 1
0 1 1
0 1 1
0 1 1
0 1 1
0 1 1
I know it is not good to loop through a dataframe, but I couldn't figure out how. Here are some failed attempts:
for x in df1:
df2 = [df2.append(pd.DataFrame([i], columns = ('A', 'B'))) for i in x]
df2 = df1.apply(lambda x: df2.append(pd.DataFrame([x[0]], columns = ['name', 'tid'])))
If I had got the first list comprehension to work I would have tried to move the for loop to the end of the first list comprehension.
Thank you in advance for your help!
does this do the trick?
import pandas as pd
import itertools
df = pd.DataFrame(data = {'A': [[[1,1],[1,1]], [[1,1]], [[1,1],[1,1],[1,1]]]})
a = []
b = []
for k in range(len(df)):
a.append([x[0] for x in df.iloc[k].A])
b.append([x[1] for x in df.iloc[k].A])
df2 = df2 = pd.DataFrame(data = {'A': list(itertools.chain(*a)), 'B': list(itertools.chain(*b))})
Result:
>>> df2
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
quick solution:
obj = df['A'].explode()
df1 = pd.DataFrame(obj.tolist(), index=obj.index, columns=['A', 'B'])
Demo data:
df = pd.Series([
[[1,2],[1,2]],[[3,4]], [[1,2],[1,3],[1,4]]
], name='A').to_frame()
print(df)
A
0 [[1, 2], [1, 2]]
1 [[3, 4]]
2 [[1, 2], [1, 3], [1, 4]]
use explode to transform each element of a list-like to a row, replicating index values.
obj = df['A'].explode()
df1 = pd.DataFrame(obj.tolist(), index=obj.index, columns=['A', 'B'])
df_result = df1.groupby(level=0)[['A', 'B']].agg(list)
df_result
A B
0 [1, 1] [2, 2]
1 [3] [4]
2 [1, 1, 1] [2, 3, 4]
df1
A B
0 1 2
0 1 2
1 3 4
2 1 2
2 1 3
2 1 4
use .apply(pd.Series) to convert a column with list element to dataframe.
df2 = df.A.apply(pd.Series)
print(df2)
0 1 2
0 [1, 2] [1, 2] NaN
1 [3, 4] NaN NaN
2 [1, 2] [1, 3] [1, 4]

Pandas: Memory error when using apply to split single column array into columns

I am wondering if anybody has a quick fix for a memory error that appears when doing the same thing as in the below example on larger data?
Example:
import pandas as pd
import numpy as np
nRows = 2
nCols = 3
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: [np.random.rand(nCols)], axis=1)
df3 = pd.concat(df2.apply(pd.DataFrame, columns=range(nCols)).tolist())
It is when creating df3 I get memory error.
The DF's in the example:
df
0
0 NaN
1 NaN
df2
0 [[0.6704675101784022, 0.41730480236712697, 0.5...
1 [[0.14038693859523377, 0.1981014890848788, 0.8...
dtype: object
df3
0 1 2
0 0.670468 0.417305 0.558690
0 0.140387 0.198101 0.800745
First I think working with lists in pandas is not good idea, if possible, you can avoid it.
So I believe you can simplify your code a lot:
nRows = 2
nCols = 3
np.random.seed(2019)
df3 = pd.DataFrame(np.random.rand(nRows, nCols))
print (df3)
0 1 2
0 0.903482 0.393081 0.623970
1 0.637877 0.880499 0.299172
Here's an example with a solution of the problem (note that in this example lists are not used in the columns, but arrays instead. This I cannot avoid, since my original problem comes with lists or array in a column).
import pandas as pd
import numpy as np
import time
np.random.seed(1)
nRows = 25000
nCols = 10000
numberOfChunks = 5
df = pd.DataFrame(index=range(nRows ), columns=range(1))
df2 = df.apply(lambda row: np.random.rand(nCols), axis=1)
for start, stop in zip(np.arange(0, nRows , int(round(nRows/float(numberOfChunks)))),
np.arange(int(round(nRows/float(numberOfChunks))), nRows + int(round(nRows/float(numberOfChunks))), int(round(nRows/float(numberOfChunks))))):
df2tmp = df2.iloc[start:stop]
if start == 0:
df3 = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
continue
df3tmp = pd.DataFrame(df2tmp.tolist(), index=df2tmp.index).astype('float16')
df3 = pd.concat([df3, df3tmp])

Swap certain subset of column data

I'm trying to swap a subset of the data in two columns, but all the methods that I have found on SO give a full swap, or also swap the column names. This is what I would like:
df =
a b c
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
Then I create a random mask:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
Applying the mask and the swap, I want the result to look like this if I swap df[mask]['a'] and df[mask]['b']:
df =
a b c
0 1 2 3
1 2 1 3
2 1 2 3
3 2 1 3
What is the best way to achieve this result? I am using pandas 0.18.1
In one line:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df.loc[mask, ['a', 'b']] = df.loc[mask, ['b', 'a']].values
Solution with numpy.where:
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df[['b', 'a']] = np.where(mask[:, None], df[['b', 'a']], df[['a', 'b']])
print (df)
a b c
0 1 2 3
1 2 1 3
2 2 1 3
3 2 1 3
You can try this
import pandas as pd
import numpy as np
df = pd.DataFrame({"a":[1]*4, "b":[2]*4})
mask = np.random.choice([False, True], len(df), p=[0.5, 0.5])
df["a_bk"] = df["a"].copy()
df["a"] = np.where(mask, df["b"], df["a"])
df["b"] = np.where(mask, df["a"], df["b"])
del df["a_bk"]

Assign a list with a missing value to a Pandas Series in Python

Something wired when I tried to assign a list with missing value np.nan to a Pandas Series
Below are the codes to reproduce the fact.
import numpy as np
import pandas as pd
S = pd.Series(0, index = list('ABCDE'))
>>> S
A 0
B 0
C 0
D 0
E 0
dtype: int64
ind = [True, False, True, False, True]
x = [1, np.nan, 2]
>>> S[ind]
A 0
C 0
E 0
dtype: int64
Assign x to S[ind]
S[ind] = x
Something wired in S
>>> S
A 1
B 0
C 2
D 0
E NaN
dtype: float64
I am expecting S to be
>>> S
A 1
B 0
C NaN
D 0
E 2
dtype: float64
Anyone can give an explanation for this?
You can try this:
S[S[ind].index] = x
or
S[S.index[ind]] = x

Python 3.4 Pandas DataFrame Structuring

QUESTION
How can I get rid of the repeated column labels for each line of data?
CODE
req = urllib.request.Request(newIsUrl)
resp = urllib.request.urlopen(req)
respData = resp.read()
dRespData = respData.decode('utf-8')
df = pd.DataFrame(columns= ['Ticker', 'GW', 'RE', 'OE', 'NI', 'CE'])
df = df.append({'Ticker':ticker,
'GW':gw,
'RE':rt,
'OE':oe,
'NI':netInc,
'CE':capExp}, ignore_index= True)
print(df)
yhooKeyStats()
acquireData()
OUTCOME
Ticker GW RE OE NI CE
0 MMM [7,050,000] [34,317,000] [13,109,000] [4,956,000] [(1,493,000)]
Ticker GW RE OE NI CE
0 ABT [17,501,000] [7,412,000] [12,156,000] [2,437,000]
NOTES
all of the headers and data line up respectively
headers are repeated in the dataframe for each line of data
You can skip every other line with a slice and iloc:
In [11]: df = pd.DataFrame({0: ['A', 1, 'A', 3], 1: ['B', 2, 'B', 4]})
In [12]: df
Out[12]:
0 1
0 A B
1 1 2
2 A B
3 3 4
In [13]: df.iloc[1::2]
Out[13]:
0 1
1 1 2
3 3 4