Condensing Wide Data Based on Column Name - pandas

Is there an elegant way to do what I'm trying to do in Pandas? My data looks something like:
df = pd.DataFrame({
'alpha': [1, np.nan, np.nan, np.nan],
'bravo': [np.nan, np.nan, np.nan, -1],
'charlie': [np.nan, np.nan, np.nan, np.nan],
'delta': [np.nan, 1, np.nan, np.nan],
})
print(df)
alpha bravo charlie delta
0 1.0 NaN NaN NaN
1 NaN NaN NaN 1.0
2 NaN NaN NaN NaN
3 NaN -1.0 NaN NaN
and I want to transform that into something like:
position value
0 alpha 1
1 delta 1
2 NaN NaN
3 bravo -1
So for each row in the original data I want to find the non-NaN value and retrieve the name of the column it was found in. Then I'll store the column and value in new columns called 'position' and 'value'.
I can guarantee that each row in the original data contains exactly zero or one non-NaN values.
My only idea is to iterate over each row but I know that idea is bad and there must be a more pandorable way to do it. I'm not exactly sure how to word my problem so I'm having trouble Googling for ideas. Thanks for any advice!

We can use DataFrame.melt to un pivot your data, then use sort_values and drop_duplicates:
df = (
df.melt(var_name='position')
.sort_values('value')
.drop_duplicates('position', ignore_index=True)
)
position value
0 bravo -1.0
1 alpha 1.0
2 delta 1.0
3 charlie NaN
Another option would be to use DataFrame.bfill over the column axis. Since you noted that:
can guarantee that each row in the original data contains exactly zero or one non-NaN values
values = df.bfill(axis=1).iloc[:, 0]
dfn = pd.DataFrame({'positions': df.columns, 'values': values})
positions values
0 alpha 1.0
1 bravo 1.0
2 charlie NaN
3 delta -1.0

Another way to do this. Actually, I just noticed, that it is quite similar to Erfan's first proposal:
# get the index as a column
df2= df.reset_index(drop=False)
# melt the columns keeping index as the id column
# and sort the result, so NaNs appear at the end
df3= df2.melt(id_vars=['index'])
df3.sort_values('value', ascending=True, inplace=True)
# now take the values of the first row per index
df3.groupby('index')[['variable', 'value']].agg('first')
Or shorter:
(
df.reset_index(drop=False)
.melt(id_vars=['index'])
.sort_values('value')
.groupby('index')[['variable', 'value']].agg('first')
)
The result is:
variable value
index
0 alpha 1.0
1 delta 1.0
2 alpha NaN
3 bravo -1.0

Related

How to split dictionary column in dataframe and make a new columns for each key values

I have a dataframe which has a column containing multiple values, separated by ",".
id data
0 {'1':A, '2':B, '3':C}
1 {'1':A}
2 {'0':0}
How can I split up the keys-values of 'data' column and make a new column for each key values present in it, without removing the original 'data' column.
desired output.
id data 1 2 3 0
0 {'1':A, '2':B, '3':C} A B C Nan
1 {'1':A} A Nan Nan Nan
2 {'0':0} Nan Nan Nan 0
Thank you in advance :).
You'll need a regular expression to convert the data into a format that can be parsed as JSON. Then, pd.json_normalize will do the job nicely:
df['data'] = df['data'].str.replace(r'(["\'])\s*:(.+?)\s*(,?\s*["\'}])', '\\1:\'\\2\'\\3', regex=True)
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = pd.concat([df, pd.json_normalize(df['data'])], axis=1)
Output:
>>> df
data 1 2 3 0
0 {'1': 'A', '2': 'B', '3': 'C'} A B C NaN
1 {'1': 'A'} A NaN NaN NaN
2 {'0': '0'} NaN NaN NaN 0

Check if a Pandas Series has 6+ Continuous Missing Values

I know it is easy to check how many missing values are in a pandas series. What if I want to check if a Pandas Series has 6+ Continuous Missing Values Entries?
mask = temp_df.loc[:,i].isna()
max_missing_val = temp_df.loc[:,i][mask].groupby((~mask).cumsum()[mask]).agg(['size'])
if len(max_missing_val) == 0:
max_missing_val = 0
else:
max_missing_val = max_missing_val.max()[0]
Reference: Counting continuous nan values in panda Time series
You can make use of cumsum to create groups of continuous NaNvalues:
s = pd.Series(
[np.nan, 1, 2, np.nan, np.nan, np.nan, 3, 4, np.nan, np.nan]*2
)
# create groups of continuous na/non na values
group = s.isna().ne(s.shift().isna()).cumsum()
# set threshold for minimum group size, here 3 instead of 6
threshold = 3
group_size = s.groupby(group).transform('size')
# check for rows with 3+ continous NaN values
print(s[(group % 2 == 0) & (group_size.ge(threshold))])
# output
3 NaN
4 NaN
5 NaN
8 NaN
9 NaN
10 NaN
13 NaN
14 NaN
15 NaN

How to find column names in pandas dataframe that contain all unique values except NaN?

I want to find columns that contain all non-duplicates from a pandas data frame except NaN.
x y z
a 1 2 A
b 2 2 B
c NaN 3 D
d 4 NaN NaN
e NaN NaN NaN
The columns "x" and "z" have non-duplicate values except NaN, so I want to pick them out and create a new data frame.
Let us use nunique
m=df.nunique()==df.notnull().sum()
subdf=df.loc[:,m]
x z
a 1.0 A
b 2.0 B
c NaN D
d 4.0 NaN
e NaN NaN
m.index[m].tolist()
['x', 'z']
Compare length of unique values and length of values after applying dropna().
Try this code.
import pandas as pd
import numpy as np
df = pd.DataFrame({"x":[1, 2, np.nan, 4, np.nan],
"y":[2, 2, 3, np.nan, np.nan],
"z":["A", "B", "D", np.nan, np.nan]})
for col in df.columns:
if len(df[col].dropna()) == len(df[col].dropna().unique()):
print(col)

Pandas: Number of rows with missing data

How do I find out the total number of rows that have missing data in a Pandas DataFrame?
I have tried this:
df.isnull().sum().sum()
But this is for the total missing fields.
I need to know how many rows are affected.
You can use .any. This will return True if any element is True and False otherwise.
df = pd.DataFrame({'a': [0, np.nan, 1], 'b': [np.nan, np.nan, 'c']})
print(df)
outputs
a b
0 0.0 NaN
1 NaN NaN
2 1.0 c
and
df.isnull().any(axis=1).sum() # returns 2

In pandas, how can all columns that do not contain at least one NaN be dropped from a DataFrame?

I have a DataFrame in which some columns have NaN values. I want to drop all columns that do not have at least one NaN value in them.
I am able to identify the NaN values by creating a DataFrame filled with Boolean values (True in place of NaN values, False otherwise):
data.isnull()
Then, I am able to identify the columns that contain at least one NaN value by creating a series of column names with associated Boolean values (True if the column contains at least one NaN value, False otherwise):
data.isnull().any(axis = 0)
When I attempt to use this series to drop the columns that do not contain at least one NaN value, I run into a problem: the columns that do not contain NaN values are dropped:
data = data.loc[:, data.isnull().any(axis = 0)]
How should I do this?
Consider the dataframe df
df = pd.DataFrame([
[1, 2, None],
[3, None, 4],
[5, 6, None]
], columns=list('ABC'))
df
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
IIUC:
pandas
dropna with thresh parameter
df.dropna(1, thresh=2)
A B
0 1 2.0
1 3 NaN
2 5 6.0
loc + boolean indexing
df.loc[:, df.isnull().sum() < 2]
A B
0 1 2.0
1 3 NaN
2 5 6.0
I used sample DF from #piRSquared's answer.
If you want to "to drop the columns that do not contain at least one NaN value":
In [19]: df
Out[19]:
A B C
0 1 2.0 NaN
1 3 NaN 4.0
2 5 6.0 NaN
In [26]: df.loc[:, df.isnull().any()]
Out[26]:
B C
0 2.0 NaN
1 NaN 4.0
2 6.0 NaN