Delete all rows with an empty cell anywhere in the table at once in pandas - pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?

As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

Related

Python compare 2 dataframe and result of not in 2nd dataframe

Want to is compare 2 dataframes (df). if exat_merge (df1) not df_ss_cpd2 (df2) want df_missing(df3) with results.
code:
df_missing = exat_merge.loc[exat_merge[df_ss_cpd2.columns.to_list()].isnull().all(axis = 1), df_ss_cpd2.columns.to_list()]
Index Column plus 8 columns. All column names are identical both dataframe (df). Nothing works. What do you think I am doing incorrect on this code? Thanks.

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

Joining two data frames on column name and comparing result side by side

I have two data frames which look like df1 and df2 below and I want to create df3 as shown.
I could do this using a left join to have all the rows in one dataframe and then did a numpy.where to see if they are matching or not.
I could get what I want but I feel there should be an elegant way of doing this which will eliminate renaming columns, reshuffling columns in dataframe and then using np.where.
Is there a better way to do this?
code to reproduce dataframes:
import pandas as pd
df1=pd.DataFrame({'product':['apples','bananas','oranges','pineapples'],'price':[1,2,3,7],'quantity':[5,7,11,4]})
df2=pd.DataFrame({'product':['apples','bananas','oranges'],'price':[2,2,4],'quantity':[5,7,13]})
df3=pd.DataFrame({'product':['apples','bananas','oranges'],'price_df1':[1,2,3],'price_df2':[2,2,4],'price_match':['No','Yes','No'],'quantity':[5,7,11],'quantity_df2':[5,7,13],'quantity_match':['Yes','Yes','No']})
An elegant way to do your task is to:
generate "partial" DataFrames from each source column,
and then concatenate them.
The first step is to define a function to join 2 source columns and append "match" column:
def myJoin(s1, s2):
rv = s1.to_frame().join(s2.to_frame(), how='inner',
lsuffix='_df1', rsuffix='_df2')
rv[s1.name + '_match'] = np.where(rv.iloc[:,0] == rv.iloc[:,1], 'Yes', 'No')
return rv
Then, from df1 and df2, generate 2 auxiliary DataFrames setting product as the index:
wrk1 = df1.set_index('product')
wrk2 = df2.set_index('product')
And the final step is:
result = pd.concat([ myJoin(wrk1[col], wrk2[col]) for col in wrk1.columns ], axis=1)\
.reset_index()
Details:
for col in wrk1.columns - generates names of columns to join.
myJoin(wrk1[col], wrk2[col]) - generates the partial result for this column from
both source DataFrames.
[…] - a list comprehension, collecting the above partial results in a list.
pd.concat(…) - concatenates these partial results into the final result.
reset_index() - converts the index (product names) into a regular column.
For your source data, the result is:
product price_df1 price_df2 price_match quantity_df1 quantity_df2 quantity_match
0 apples 1 2 No 5 5 Yes
1 bananas 2 2 Yes 7 7 Yes
2 oranges 3 4 No 11 13 No

why does python not drop all duplicates?

This is my
original data frame
I want to remove the duplicates for the columns 'head_x' and 'head_y' and the columns 'cost_x' and 'cost_y'.
This is my code:
df=df.astype(str)
df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
print(df)
This is the output dataframe, as you can see the first row is a duplicate on both subsets. So why is this row stil there?
I do not just want to remove the first row but all duplicates. Tis is another output where also for Index/Node 6 there is a duplicate.
Take a look at the first 2 rows:
head_x cost_x head_y cost_y
Node
1 2 6 2 3
1 2 6 3 4
Start from head_x and head_y:
from the first row are 2 and 2,
from the second row are 2 and 3,
so these two pairs are different.
Then look at cost_x and cost_y:
from the first row are 6 and 3,
from the second row are 6 and 4,
so these two pairs are also different.
Conclusion: These 2 rows are not duplicates, taking into account both column
subsets.
df=df.astype(str)
df = df.drop_duplicates(subset={'head_x','head_y'}, keep=False, inplace=True)
df = df.drop_duplicates(subset={'cost_x','cost_y'}, keep=False, inplace=True)
I assume that cost_x should be replaced with head_y, in other way there are no duplicates

pandas get columns without copy

I have a data frame with multiple columns, and I want to get some of them, and drop others, without copying a new dataframe
I suppose it should be
df = df['col_a','col_b']
but I'm not sure whether it copy a new one or not. Is there any better way to do this?
Your approach should work, apart from one minor issue:
df = df['col_a','col_b']
shoud be:
df = df[['col_a','col_b']]
Because you assign the subset df back to df, it's essentially equivalent to dropping the other columns.
If you would like to drop other columns in place, you can do:
df.drop(columns=df.columns.difference(['col_a','col_b']),inplace=True)
Let me know if this is what you want.
you have a dataframe df with multiple columns a, b, c, d and e. You want to select let us say a and b and store them back in df. To achieve this, you can do :
df=df[['a', 'b']]
Input dataframe df:
a b c d e
1 1 1 1 1
3 2 3 1 4
When you do :
df=df[['a', 'b']]
output will be :
a b
1 1
3 2