How to plot outliers with regard to unique ids - pandas

I have item_code column in my data and another column, sales, which represents sales quantity for the particular item.
The data can have a particular item id many times. There are other columns tell apart these entries.
I want to plot only the outlier sales for each item (because data has thousands of different item ids, plotting every entry can be difficult).
Since I'm very new to this, what is the right way and tool to do this?

you can use pandas. You should choose a method to detect outliers, but I have an example for you:
If you want to get outliers for all sales (not in groups), you can use apply with function (example - lambda function) to have outliers indexes.
import numpy as np
%matplotlib inline
df = pd.DataFrame({'item_id': [1, 1, 2, 1, 2, 1, 2],
'sales': [0, 2, 30, 3, 30, 30, 55]})
df[df.apply(lambda x: np.abs(x.sales - df.sales.mean()) / df.sales.std() > 1, 1)
].set_index('item_id').plot(style='.', color='red')
In this example we generated data sample and search indexes of points what are more then mean / std + 1 (you can try another method). And then just plot them where y is count of sales and x is item id. This method detected points 0 and 55. If you want search outliers in groups, you can group data before.
df.groupby('item_id').apply(lambda data: data.loc[
data.apply(lambda x: np.abs(x.sales - data.sales.mean()) / data.sales.std() > 1, 1)
]).set_index('item_id').plot(style='.', color='red')
In this example we have points 30 and 55, because 0 isn't outlier for group where item_id = 1, but 30 is.
Is it what you want to do? I hope it helps start with it.

Related

Sample random row from df.groupby("column1")["column2].max() and not first one if multiple candidates

What would be the correct way to return n random max values from a groupby?
I have a dataframe containing audio events, with the following columns:
audio
start_time
end_time
duration
labelling confidence (1 to 5)
label ("Ambulance", "Engine", ...)
I have multiple events/rows for each label and I have 26 labels in total.
What I would like to achieve is to get one event per label with max confidence.
Let's say we have 7 events that have label "Ambulance" and they have the following labelling confidence: 2, 5, 5, 4, 4, 3, 5.
The max confidence is 5 in this case, which gives us 3 selectable events.
I would like to get one of the three at random.
Doing the following with pandas: df.groupby("label").max() will return the first row with max labelling confidence. I would like it to be a random selection.
Many thanks in advance
Cheers
Antoine
Edit: following a comment from the OP, the simplest solution is to shuffle the data frame before picking the max rows:
# Some random data
labels = list('ABCDE')
repeats = np.random.randint(1, 6, len(labels))
df = pd.DataFrame({
'label': np.repeat(labels, repeats),
'confidence': np.random.randint(1, 6, repeats.sum())
})
# Shuffle the data frame. For each `label` get the first row,
# which we can be sure has the max `confidence` because we
# sorted it
(
df.sample(frac=1)
.sort_values(['label', 'confidence'], ascending=[True, False])
.groupby('label')
.head(1)
)
If you are running this in IPython / Jupyter Notebook, watch the index of the resulting data frame to see the randomness of the result.
Here is how I finally managed to do it:
shuffled_df = df.sample(frac=1)
filtered_df = shuffled_df.loc[shuffled_df.groupby("label")["confidence"].idxmax()]

How to bin a numerical pandas Series into n groups of approximately the same size without qcut?

I would like to split my series into exactly n groups (assuming there are at least n distinct values in the series), where the group sizes are approximately equal.
The code needs to be generic, so I cannot know the distribution of the data in advance, hence using pd.cut with pre-defined bins is not an option for me.
I tried using pd.qcut or pd.cut with pd.Series.quantile but they all fall short when some value is repeated very often in the series.
For instance, if I want exactly 3 groups:
series = pd.Series([1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 4, 4, 4, 4])
pd.qcut(series, q=3, duplicates="drop")
creates only 2 categories: Categories (2, interval[float64]): [(0.999, 3.0] < (3.0, 4.0]], whereas I would like to get something like [(0.999, 1.0] < (1.0, 3.0] < (3.0, 4.0]].
Is there any way to do this easily with pandas' built-in methods?

Rolling means in Pandas dataframe

I am trying to run some computations on DataFrames. I want to compute the average difference between two sets of rolling mean. To be more specific, the average of the difference between a long-term mean (lst) and a smaller-one (lst_2). I am trying to combine the calculation with a double for loop as follows:
import pandas as pd
import numpy as pd
def main(df):
df=df.pct_change()
lst=[100,150,200,250,300]
lst_2=[5,10,15,20]
result=pd.DataFrame(np.sum([calc(df,T,t) for T in lst for t in lst_2]))/(len(lst)+len(lst_2))
return result
def calc(df,T,t):
roll=pd.DataFrame(np.sign(df.rolling(t).mean()-df.rolling(T).mean()))
return roll
Overall I should have 20 differences (5 and 100, 10 and 100, 15 and 100 ... 20 and 300); I take the sign of the difference and I want the average of these differences at each point in time. Ideally the result would be a dataframe result.
I got the error: cannot copy sequence with size 3951 to array axis with dimension 1056 when it runs the double for loops. Obviously I understand that due to rolling of different T and t, the dimensions of the dataframes are not equal when it comes to the array conversion (with np.sum), but I thought it would put "NaN" to align the dimensions.
Hope I have been clear enough. Thank you.
As requested in the comments, here is an example. Let's suppose the following
dataframe:
df = pd.DataFrame({'A': [100,101.4636,104.9477,106.7089,109.2701,111.522,113.3832,113.8672,115.0718,114.6945,111.7446,108.8154]},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
df=df.pct_change()
and I have the following 2 sets of mean I need to compute:
lst=[8,10]
lst_1=[3,4]
Then I follow these steps:
1/
I want to compute the rolling mean(3) - rolling mean(8), and get the sign of it:
roll=np.sign(df.rolling(3).mean()-df.rolling(8).mean())
This should return the following:
roll = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
2/
I redo step 1 with the combination of differences 3-10 ; 4-8 ; 4-10. So I get overall 4 roll dataframes.
roll_3_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_3_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_8 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1,-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
roll_4_10 = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])
3/
Now that I have all the diffs, I simply want the average of them, so I sum all the 4 rolling dataframes, and I divide it by 4 (number of differences computed). The results should be (before dropping all N/A values):
result = pd.DataFrame({'A': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN',-1,-1},index=[0, 1, 2, 3,4,5,6,7,8,9,10,11])

Find subgroups of a numpy array

I have a numpy array like this one:
A = ([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
I would like to write a small script that finds a subgroup of values of the array for which the difference is smaller than a certain threshold, let say 3, and that returns the highest value of the subgroup. In the case of A array the output should be:
A_out =([250,3017,5680,8258,10757,13179,...])
Is there a numpy function for that?
Here's a vectorized Numpy approach.
First, the data (in a numpy array) and the threshold:
In [41]: A = np.array([249, 250, 3016, 3017, 5679, 5680, 8257, 8258,
10756, 10757, 13178, 13179, 15531, 15532, 17824, 17825,
20058, 20059, 22239, 22240, 24373, 24374, 26455, 26456,
28491, 28492, 30493, 30494, 32452, 32453, 34377, 34378,
36264, 36265, 38118, 38119, 39939, 39940, 41736, 41737,
43501, 43502, 45237, 45238, 46950, 46951, 48637, 48638])
In [42]: threshold = 3
The following produces the array delta. It is almost the same as delta = np.diff(A), but I want to include one more value that is greater than the threshold at the end of delta.
In [43]: delta = np.hstack((diff(A), threshold + 1))
Now the group maxima are simply A[delta > threshold]:
In [46]: A[delta > threshold]
Out[46]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
Or, if you want, A[delta >= threshold]. That gives the same result for this example:
In [47]: A[delta >= threshold]
Out[47]:
array([ 250, 3017, 5680, 8258, 10757, 13179, 15532, 17825, 20059,
22240, 24374, 26456, 28492, 30494, 32453, 34378, 36265, 38119,
39940, 41737, 43502, 45238, 46951, 48638])
There is a case where this answer differs from #DrV's answer. From your description, it isn't clear to me how a set of values such as 1, 2, 3, 4, 5, 6 should be handled. The consecutive differences are all 1, but the difference between the first and last is 5. The numpy calculation above will treat these as a single group. #DrV's answer will create two groups.
Interpretation 1: The value of an item in a group must not differ more than 3 units from that of the first item of the group
This is one of the things where NumPy's capabilities are at their limits. As you will have to iterate through the list, I suggest a pure Python approach:
first_of_group = A[0]
previous = A[0]
group_lasts = []
for a in A[1:]:
# if this item no longer belongs to the group
if abs(a - first_of_group) > 3:
group_lasts.append(previous)
first_of_group = a
previous = a
# add the last item separately, because it is always a last of the group
group_lasts.append(a)
Now you have the group lasts in group_lasts.
Using any NumPy array functionality here does not seem to provide much help.
Interpretation 2: The value of an item in a group must not differ more than 3 units from the previous item
This is easier, as we can easily form a list of group breaks as in Warren Weckesser's answer. Here NumPy is of a lot of help.

Need help plot 1 to n y-series with matplotlib

I have a problem that the user of my script want to be able to print 1 - n graphs of the type account (ex 1930,1940 etc) and the sum for every account for every year.
The graph I want to plot should look like this (in this ex 2 accounts(1930 and 1940) and sum for every account for every year):
The input for the graph printing is like this (The user of the script should be able to choose as many accounts as the user wants 1-n):
How many accounts to print graphs for? 2
Account 1 :
1930
Account 2 :
1940
The system will store the Accounts in an array (accounts = [1930,1940]
) and look up the sum for every account for every year. The years and sum for the accounts are placed in a matrix ([[2008, 1, 12], [2009, 7, 30], [2010, 13, 48], [2011, 19, 66], [2012, 25, 84], [2013, 31, 102]]).
When this is done I want to plot 1 - n graphs (in this case 2 graphs). But I can't figure out how to plot with 1 - n accounts...
For the moment I just use this code to print the graph and it's just static :(:
#fix the x serie
x_years = []
for i in range (nrOfYearsInXML):
x_years.append(matrix[x][0])
x = x + 1
plt.xticks(x_years, map(str,x_years))
#fix the y series, how to solve the problem if the user shows 1 - n accounts?
1930_sum = [1, 7, 13, 19, 25, 31]
1940_sum = [12, 30, 48, 66, 84, 102]
plt.plot(x_years, konto1_summa, marker='o', label='1930')
plt.plot(x_years, konto2_summa, marker='o', label='1940')
plt.xlabel('Year')
plt.ylabel('Summa')
plt.title('Sum for account per year')
plt.legend()
plt.show()
Ok, so I have tried with for loops etc, but I have not been able to figure it out with 1-n accounts and an unique account label to 1-n accounts..
My scenario is that the user choose 1 - n accounts. Specify the accounts (ex 1930,1940,1950..). Store the accounts to an array. System calculate the sum for 1-n account for every year and place this data to the matrix. System when reads from the accounts array and the matrix and plot 1-n graphs. Every graph with account label.
A shorter version of the problem...
For example if I have the x values (the years 2008-2013) and the y values (the sum for the accounts for every year) in a matrix and the accounts(should also be used as label) in an array like this:
accounts = [1930,1940]
matrix = [[2008, 1, 12], [2009, 7, 30], [2010, 13, 48], [2011, 19, 66], [2012, 25, 84], [2013, 31, 102]]
Or I can explain x and y like this:
x y1(1930 graph1) y2(1940 graph2)
2008 1 12
2009 7 30
2010 13 48
etc etc etc
The problem for me is that the user can choose one to many accounts (accounts [1..n]) and this will result in 1 to many account graphs.
Any idea how to solve it.. :)?
BR/M
I don't quite understand what you are asking, but I think this is what you want:
# set up axes
fig, ax = plt.subplots(1, 1)
ax.set_xlabel('xlab')
ax.set_ylabel('ylab')
# loop and plot
for j in range(n):
x, y = get_data(n) # what ever function you use to get your data
lab = get_label(n)
ax.plot(x, y, label=lab)
ax.legend()
plt.show()
More concretely, assuming you have the matrix structure you posted above:
# first, use numpy, you have it installed anyway if matplotlib is working
# and it will make your life much nicer
data = np.array(data_list_of_lists)
x = data[:,0]
for j in range(n):
y = data[:, j+1]
ax.plot(x, y, lab=accounts[j])
A better way to do this is to store your data in a dict
data_dict[1940] = (x_data_1940, y_data_1940)
data_dict[1930] = (x_data_1930, y_data_1930)
# ...
for k in acounts:
x,y = data_dict[k]
ax.plot(x, y, lab=k)