comparing two DataFrames, specific questions - pandas

I was read Andy's answer to the question Outputting difference in two Pandas dataframes side by side - highlighting the difference
i have two questions regarding the code, unfortunately I dont yet have 50 rep to comment on the answer so I hope i could get some help here.
what does In [24]: changed = ne_stacked[ne_stacked] do?
I'm not sure what df1 = df[df] do and i cant seem to get an answer from pandas doc, could someone explain this to me please?
is np.where(df1 != df2) the same as pd.df.where(df1 != df2). If no, what is the difference?

Question 1
ne_stacked is a pd.Series that consists of True and False values that indicate where df1 and df2 are not equal.
ne_stacked[boolean_array] is a way to filter the series ne_stacked by eliminating the rows of ne_stacked where boolean_array is False and keeping the rows of ne_stacked where boolean_array is True.
It so happens that ne_stacked is also a boolean array and so can be used to filter itself. Why would be want to do this? So we can see what the values of the index are after we've filtered.
So ne_stacked[ne_stacked] is a subset of ne_stacked with only True values.
Question 2
np.where
np.where does two things, if you only pass a conditional like in np.where(df1 != df2), you get a tuple of arrays where the first is a reference of all row indices to be used in conjunction with the second element of the tuple that is a reference to all column indices. I usually use it like this
i, j = np.where(df1 != df2)
Now I can get at all elements of df1 or df2 in which there are differences like
df.values[i, j]
Or I can assign to those cells
df.values[i, j] = -99
Or lots of other useful things.
You can also use np.where as an if, then, else for arrays
np.where(df1 != df2, -99, 99)
To produce an array the same size as df1 or df2 where you have -99 in all the places where df1 != df2 and 99 in the rest.
df.where
On the other hand df.where evaluates the first argument of boolean values and returns an object of equal size to df where the cells that evaluated to True are kept and the rest are either np.nan or the values passed in the second argument of df.where
df1.where(df1 != df2)
Or
df1.where(df1 != df2, -99)
are they the same?
Clearly they are not the "same". But you can use them similarly
np.where(df1 != df2, df1, -99)
Should be the same as
df1.where(df1 != df2, -99).values

Related

How do i combine multiple dataframes using a repeating index system

I have multiple dataframes that I want to combine and only want to use the indexing system of the first dataframe. The problem is the indices I want to use are repeating and I want to keep it that way.
df = pd.concat([df1, df2, df3], axis=1, join='inner')
This gives me InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Just so it's clear, df1 has repeating indices (0-9 and then it repeats again multiple times), whereas df2 and df3 are single-column dataframes and have non-repeating indices. The number of rows do match though.
Well from what i understand your index repeats itself, on df1. That is what is causing the given error InvalidIndexError: Reindexing only valid with uniquely valued Index objects, since you have a loop beetween (0,9 values) pandas, will never be able to identify which row to join with what row since the indexes well are repeated so non unique. My apprach would be just to use join, but hey if you want to use concat for reasons
A few ways to do this would be just
Just using the join function
df1.join([df2,df3])
But if you insist on using concat, i would
x = df1.index
df1.reset_index(drop=True)
df = pd.concat([df1,df2,df3],axis=1,join='inner')
df.index = x

How to count specific value occurancies for x amount of rows in Pandas DataFrame?

I have Pandas DataFrame:
df = pd.read_csv("file.csv")
I need to count not all occurancies of columnn 'gender', but for the first 5 entries (would love to see code for any interval of rows, let's say from 10th to the 20th etc.)
Currently I am familliar only with this:
df[df['gender'] == 1].shape[0]
and more complicated with lambda:
A2_object = df.apply(lambda x: True if x['gender'] == 1 else False, axis=1)
A2 = len(A2_object[A2_object == True].index)
I am learning, and I see that loops don't work in dataframe the same way as in the lists or dictionaries.
I am trying something like that:
df[df['gender'] == 1 and df.index < 5].shape[0]
I love this post, but can't get my head around the examples.
As Mr. #Quang-Hoang posted, I needed to use slicing for indecies and in dataframe format indecies are .iloc (.loc?). Thank you, Sir.
Answer: df.iloc[start:end]['gender'].eq(1).sum()

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]

Repeat elements in pandas dataframe so equal number of each unique element

I have a pandas dataframe with multiple different feature columns. I have one particular column which can take on a variety of integer value. I want to manipulate the dataframe in such a way that there is an equal number of each of these integer value.
Before;
df['key'] = [1,1,1,3,4,5,5]
After;
df['key'] = [1,1,1,3,3,3,4,4,4,5,5,5]
I want this to be applied to every key in the dataframe.
So here's an ugly way that I've coded up a solution, but I feel like it goes against the entire reason to use pandas dataframes.
for idx, i in enumerate(data['key'].value_counts()):
if i == max(data['key'].value_counts()):
pass
else:
scaling = (max(data['key'].value_counts()) // i) - 1
data2 = pd.concat([data[data['key'] == idx]]*scaling, ignore_index=True)
data = pd.concat([data, data2], ignore_index=True)

Pandas: How can I check if a pandas dataframe contains a specific value?

Specifically, I want to see if my pandas dataframe contains a False.
It's an nxn dataframe, where the indices are labels.
So if all values were True except for even one cell, then I would want to return False.
your question is a bit confusing, but assuming that you want to know whether there is at least one False in your dataframe, you could simply use
mask = df.mycol == False
mask.value_counts()
mask.sum()
mask.sum() > 0
All will tell you the truth
If you just want to scan your whole dataframe looking for a value, check df.values. It returns an array of all the rows in the dataframe.
value = False # This is the value you're searching
df_contains_value = any([value in row for row in df.values])