Pandas Duplicated returns some not duplicate values? - pandas

I am trying to remove duplicates from dataset.
Before using df.drop_duplicates(), I run df[df.duplicated()] to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.
How to get accurate duplicate results and drop real duplicates?
city price year manufacturer cylinders fuel odometer
whistler 26880 2016.0 chrysler NaN gas 49000.0
whistler 17990 2010.0 toyota NaN hybrid 117000.0
whistler 15890 2010.0 audi NaN gas 188000.0
whistler 8800 2007.0 nissan NaN gas 163000.0

Encountered the same problem.
At first, it looks like
df.duplicated(subset='my_column_of_interest')
returns results which actually have unique values in my_column_of_interest field.
This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.
Which means that if you have a value present twice in this column, running
df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).

Related

SQL Impala convert NaN values to NULL values

I have the following column in my table:
col 1
1
3
NULL
NaN
5
"Bad" aggregations return NaN instead of NULL and the variable is type DOUBLE in the end.
I want to have one type of missing values only, hence I need to convert NULL to NaN or the other way around.
My problem is that when I partition with a window function it does not recognize NaNs as equal to
NULLS and creates separate subgroups, which is something I do not want.
Any suggestions on how to convert them?

filter values in a dataframe column based on null values in a different column python dataframe

I've been stuck on this for a bit so hopefully someone has better guidance.
I currently have a dataframe that looks something like this(only way more rows):
|"released_date"| "status" |
+-------------+--------+
| 12/12/20 |released|
+-------------+--------+
| 10/01/20 | NaN |
+-------------+--------+
| NaN | NaN |
+-------------+--------+
| NaN. |released|
+-------------+--------+
I wanted to do df['status'].fillna('released' if df.released_date.notnull())
aka, fill any Nan value in the status column of df with "released" as long as df.released_date is't a null value.
I keep getting various error messages when I do this though in different variations, first for the code above is a syntax error, which I imagine is because notnull() returns a boolean array?
I feel like there is a simple answer for this and I somehow am not seeing it. I haven't found any questions like this where I'm trying to organize something based on the null values in a dataframe, which leads me to wonder if my methodology isn't ideal in the first place? How can I filter values in a dataframe column based on null values in a different column without using isnull() or notnull() if those only return boolean arrays anyways? using == Null doesn't seem to work either...
Try:
idx = df[(df['status'].isnull()) & (~df['released_date'].isnull())].index
df.loc[idx,'status'] = 'released'
First get the index of all rows with 'status' equals null and 'released_date' notequals null. Then use df.loc to update the status column.
Prints:
released_date status
0 12/12/20 released
1 10/01/20 released
2 NaN NaN
3 NaN released

Unable to create new features in Machine learning

I have a dataset. I am using pandas dataframe and named it df.
The dataset has 50,000 rows - here are the first 5:.
Name_Restaurant cuisines_available Average cost
Food Heart Japnese, chinese 60$
Spice n Hungary Indian, American, mexican 42$
kfc, Lukestreet Thai, Japnese 29$
Brown bread shop American 11$
kfc, Hypert mall Thai, Japnese 40$
I want to create column which contains the no. of cuisines available
I am trying code
df['no._of_cuisines_available']=df['cuisines_available'].str.len()
Then instead of showing the no. of cuisines, it is showing the sum of charecters.
For example - for first row the o/p should be 2 , but its showing 17.
I need a new column that contain number of stores for each restaurant. example -
here kfc has 2 stores kfc, lukestreet and kfc, hypert mall. I have completely
no idea how to code this.
i)
df['cuisines_available'].str.split(',').apply(len)
ii)
df['Name_Restaurant'].str.split(',', expand=True).melt().['value'].str.strip().value_counts()
What ii) does: split columns at ',' and store all strings thus generated in an individual column. Then use melt to make one big column, strip away spaces etc. and count individual entries.

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

Replacing NaN values with group mean

I have s dataframe made of countries, years and many other features. there are many years for a single country
country year population..... etc.
1 2000 5000
1 2001 NaN
1 2002 4800
2 2000
now there are many NaN in the dataframe.
I want to replace each NaN corresponding to a specific country in every column with the country average of this column.
so for example for the NaN in the population column corresponding to country 1, year 2001, I want to use the average population for country 1 for all the years = (5000+4800)/2.
now I am using the groupby().mean() method to find the means for each country, but I am running into the following difficulties:
1- some means are coming as NaN when I know for sure there is a value for it. why is so?
2- how can I get access to specific values in the groupby clause? in other words, how can I replace every NaN with its correct average?
Thanks a lot.
Using combine_first with groupby mean
df.combine_first(df.groupby('country').transform('mean'))
Or
df.fillna(df.groupby('country').transform('mean'))