I want to replace outlier instead of completely removing it... Any Suggestion? - pandas

I have data-set in which one column has outlier and these outlier somehow dependent on one more column which has 12 different categories.
So, I want to replace these outlier with mean of those categories.
for example:
column A has market_01, market_02, ..., market_12 and
column B has int values 984, 678, 1326, 887, ....., 710, .....
so, here, I want to replace 1326 outlier value with respect to its corresponding market_02.mean() rather than simply values.mean()

Try:
via mask() + groupby()+transform()
#Firstly find mean:
m=df.groupby('market')['values'].transform('mean').round(2)
#Finally replace outlier:
df['values']=df['values'].mask(df['values'].eq(1326),m)
OR
via np.where() with groupby()+transform():
m=df.groupby('market')['values'].transform('mean').round(2)
df['values']=np.where(df['values'].eq(1326),m,df['values'])

Step 1: Calculate the mean of the values corresponding to the market.
mean_val = df[df['market'] = 'market_02']['values'].mean()
Step 2: Now replace all values greater (or equal to) the value that you believe is outlier with the above calculated mean.
df['values'] = df['values'].apply(lambda x: mean_val if x >= mean_val else x)

Thank you #Pratik and #Anurag for your answers.
I have solved this by below approach: By this way my all outlier are replaced.
t = df.groupby('market').values
df['values'] = (
df.values.where(
t.transform('quantile', q=0.75) > df.values,
t.transform('median')))

Related

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

How to calculate the difference between row values based on another column value without filtering the values in between

How to calculate the difference between row values based on another column value without filtering the values in between.I want to calculate the difference between seconds for turn_marker == 1. but when I use the following method, it filters all the zeros but I need the zeros, because I need the entire data set.
Here you can see my data set with a column called turn_marker that has the values zero and 1, and another column with seconds. Now I want to calculte the time bwetween those rows where turn_marker is equal 1.
dataframe = main_dataframe.query("turn_marker=='1;'")
main_dataframe["seconds_diff"] = dataframe["seconds"].diff()
main_dataframe
I would be grateful if you could help me.
You can do this:
main_dataframe['indx'] = main_dataframe.index
main_dataframe['diff'] = main_dataframe.sort_values(by=['turn_marker', 'indx'], ascending=[False, True])['seconds'].diff()
main_dataframe.loc[main_dataframe.turn_marker == '0;', 'diff'] = np.nan

Pandas identifying if any element is in a row

I have a data frame that is a single row of numerical values and I want to know if any of those values is greater than 2 and if so create a new column with the word 'Diff'
Col_,F_1,F_2
1,5,0
My dataframe is diff_df. Here is one thing I tried
c = diff_df >2
if c.any():
diff_df['difference']='Difference'
If I were to print c. it would be
Col_,F_1,F_2
False,True,False
I have tried c.all() and many iterations of other things. Clearly my inexperience is holding me back and google is not helping in this regards. Everything I try is either "The truth value of a Series (or Data Frame) is ambiguous use a.any(), a.all()...." Any help would be appreciated.
Since it is only one row, take the .max().max() of the dataframe. With one .max() you are going to get the .max() of each column. The second .max() takes the max of all the columns.
if diff_df.max().max() > 2: diff_df['difference']='Difference'
output:
Col_ F_1 F_2 difference
0 1 5 0 Difference
Use .loc accessor and .gt() to query and at the same time create new column and populate it
df.loc[df.gt(2).any(1), "difference"] = 'Difference'
Col_ F_1 F_2 difference
0 1 5 0 Difference
In addition to David's reponse you may also try this:
if ((df > 2).astype(int)).sum(axis=1).values[0] == 1:
df['difference']='Difference'

Conditional If Statement: If value in row starts with letter in string … set another column with some corresponding value

I have the 'Field_Type' column filled with strings and I want to derive the values in the 'Units' column using an if statement.
So Units shows the desired result. Essentially I want to call out what type of activity is occurring.
I tried to do this using my code below but it won't run (please see screen shot below for error). Any help is greatly appreciated!
create_table['Units'] = pd.np.where(create_table['Field_Name'].str.startswith("W"), "MW",
pd.np.where(create_table['Field_Name'].str.contains("R"), "MVar",
pd.np.where(create_table['Field_Name'].str.contains("V"), "Per Unit")))```
ValueError: either both or neither of x and y should be given
You can write a function to define your conditionals, then use apply on the dataframe and pass the funtion
def unit_mapper(row):
if row['Field_Type'].startswith('W'):
return 'MW'
elif 'R' in row['Field_Type']:
return 'MVar'
elif 'V' in row['Field_Type']:
return 'Per Unit'
else:
return 'N/A'
And then
create_table['Units'] = create_table.apply(unit_mapper, axis=1)
In your text you talk about Field_Type but you are using Field_Name in your example. Which one is good ?
You want to do something like:
create_table[create_table['Field_Type'].str.startwith('W'), 'Units'] = 'MW'
create_table[create_table['Field_Type'].str.startwith('R'), 'Units'] = 'MVar'
create_table[create_table['Field_Type'].str.startwith('V'), 'Units'] = 'Per Unit'

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))