Outliers in data - pandas

I have a dataset like so -
15643, 14087, 12020, 8402, 7875, 3250, 2688, 2654, 2501, 2482, 1246, 1214, 1171, 1165, 1048, 897, 849, 579, 382, 285, 222, 168, 115, 92, 71, 57, 56, 51, 47, 43, 40, 31, 29, 29, 29, 29, 28, 22, 20, 19, 18, 18, 17, 15, 14, 14, 12, 12, 11, 11, 10, 9, 9, 8, 8, 8, 8, 7, 6, 5, 5, 5, 4, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
Based on domain knowledge, I know that larger values are the only ones we want to include in our analysis. How do I determine where to cut off our analysis? Should it be don't include 15 and lower or 50 and lower etc?

You can do a distribution check with quantile function. Then you can remove values below lowest 1 percentile or 2 percentile. Following is an example:
import numpy as np
data = np.array(data)
print(np.quantile(data, (.01, .02)))
Another method is calculating the inter quartile range (IQR) and setting lowest bar for analysis is Q1-1.5*IQR
Q1, Q3 = np.quantile(data, (0.25, 0.75))
data_floor = Q1 - 1.5 * (Q3 - Q1)

Related

Error message More than one column has the same display name

I keep getting the error message
More than one column has the same display name
but I cannot find the route cause. Any help is greatly appreciated!
SELECT
gl_ap_details.ledger_name,
gl_ap_details.company_code,
gl_ap_details.location_code,
gl_ap_details.cost_center,
gl_ap_details.account_number,
gl_ap_details.account_name,
gl_ap_details.product_code,
gl_ap_details.channel_code,
gl_ap_details.journal_name,
gl_ap_details.line_description,
gl_ap_details.gl_posted_date,
gl_ap_details.currency,
gl_ap_details.je_source,
gl_ap_details.je_category,
gl_ap_details.effective_date,
gl_ap_details.created_by,
gl_ap_details.invoice_num,
gl_ap_details.invoice_id,
gl_ap_details.invoice_date,
gl_ap_details.vendor_name,
gl_ap_details.vendor_number,
gl_ap_details.invoice_image,
gl_ap_details.po_number,
gl_ap_details.po_requestor,
gl_ap_details.period_name,
gl_ap_details.amount,
gl_ap_details.gl_posted_date,
gl_ap_details.project_code
FROM
wbr_global.gl_ap_details
WHERE
wbr_global.gl_ap_details.ledger_name = 'Amazon.com, Inc.'
AND cost_center IN ('1172')
AND period_name = 'JUL-21'
AND wbr_global.gl_ap_details.account_number = '60820'
GROUP BY
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28;

Increasing the label size in matplotlib in pie chart

I have the following dictionary
{'Electronic Arts': 66,
'GT Interactive': 1,
'Palcom': 1,
'Fox Interactive': 1,
'LucasArts': 5,
'Bethesda Softworks': 9,
'SquareSoft': 3,
'Nintendo': 142,
'Virgin Interactive': 4,
'Atari': 7,
'Ubisoft': 28,
'Konami Digital Entertainment': 11,
'Hasbro Interactive': 1,
'MTV Games': 1,
'Sega': 11,
'Enix Corporation': 4,
'Capcom': 13,
'Warner Bros. Interactive Entertainment': 7,
'Acclaim Entertainment': 1,
'Universal Interactive': 1,
'Namco Bandai Games': 7,
'Eidos Interactive': 9,
'THQ': 7,
'RedOctane': 1,
'Sony Computer Entertainment Europe': 3,
'Take-Two Interactive': 24,
'Square Enix': 5,
'Microsoft Game Studios': 22,
'Disney Interactive Studios': 2,
'Vivendi Games': 2,
'Sony Computer Entertainment': 52,
'Activision': 45,
'505 Games': 4}
Now the problem I am facing is viewing the labels. The labels are extremely small and invisible.
Please anyone can suggest on how to increase the label size.
I have tried the below code:
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(),labels=vg_dict.keys())
plt.show()
Adding textprops argument in plt.pie method:
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(), labels=vg_dict.keys(), textprops={'fontsize': 30})
plt.show()
You can check all the properties of Text object here.
Updated
I don't know if your labels order matter? To avoid overlapping labels, you can try to modify your start angle (plt start drawing pie counterclockwise from the x-axis), and re-order the "crowded" labels:
vg_dict = {
'Palcom': 1,
'Electronic Arts': 66,
'GT Interactive': 1,
'LucasArts': 5,
'Bethesda Softworks': 9,
'SquareSoft': 3,
'Nintendo': 142,
'Virgin Interactive': 4,
'Atari': 7,
'Ubisoft': 28,
'Hasbro Interactive': 1,
'Konami Digital Entertainment': 11,
'MTV Games': 1,
'Sega': 11,
'Enix Corporation': 4,
'Capcom': 13,
'Acclaim Entertainment': 1,
'Warner Bros. Interactive Entertainment': 7,
'Universal Interactive': 1,
'Namco Bandai Games': 7,
'Eidos Interactive': 9,
'THQ': 7,
'RedOctane': 1,
'Sony Computer Entertainment Europe': 3,
'Take-Two Interactive': 24,
'Vivendi Games': 2,
'Square Enix': 5,
'Microsoft Game Studios': 22,
'Disney Interactive Studios': 2,
'Sony Computer Entertainment': 52,
'Fox Interactive': 1,
'Activision': 45,
'505 Games': 4}
plt.figure(figsize=(80,80))
plt.pie(vg_dict.values(), labels=vg_dict.keys(), textprops={'fontsize': 35}, startangle=-35)
plt.show()
Result:

Appending numpy arrays using numpy.insert

I have a numpy array (inputs) of shape (30,1). I want to insert 31st value (eg. x = 2). Trying to use the np.insert function but it is giving me out of bounds error.
np.insert(inputs,b+1,x)
IndexError: index 31 is out of bounds for axis 0 with size 30
Short answer: you need to insert it at index b, not b+1.
The index you pass to np.insert(..) [numpy-doc], is the one where the element should be added. If you insert it at index 30, then it will be positioned last. Note that indexes are zero-based. So if you have an array with 30 elements, then the last index is 29. If you thus insert this at index 30, we get:
>>> a
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
>>> np.insert(a,30,42)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 42])

How to print more than 32 values?

Anyone know how to print more than 32 values? My output looks like this, and I'm trying to make it show the rest of the array:
Value of: model.GetOutput(0)
Expected: contains 64 values, where each value and its corresponding value in { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, ... } are an almost-equal pair
Actual: { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... }, where the value pair (1, 2) at index #1 don't match, which is 1 from 1
It's hard-coded in the Google Test sources (kMaxCount = 32). To change it, you have to modify the code and rebuild Google Test. You might be able to define your own printer if the type is specific enough.

MultiPoint crossover using Numpy

I am trying to do crossover on a Genetic Algorithm population using numpy.
I have sliced the population using parent 1 and parent 2.
population = np.random.randint(2, size=(4,8))
p1 = population[::2]
p2 = population[1::2]
But I am not able to figure out any lambda or numpy command to do a multi-point crossover over parents.
The concept is to take ith row of p1 and randomly swap some bits with ith row of p2.
I think you want to select from p1 and p2 at random, cell by cell.
To make it easier to understand i've changed p1 to be 10 to 15 and p2 to be 20 to 25. p1 and p2 were generated at random in these ranges.
p1
Out[66]:
array([[15, 15, 13, 14, 12, 13, 12, 12],
[14, 11, 11, 10, 12, 12, 10, 12],
[12, 11, 14, 15, 14, 10, 13, 10],
[11, 12, 10, 13, 14, 13, 12, 13]])
In [67]: p2
Out[67]:
array([[23, 25, 24, 21, 24, 20, 24, 25],
[21, 21, 20, 20, 25, 22, 24, 22],
[24, 22, 25, 20, 21, 22, 21, 22],
[22, 20, 21, 22, 25, 23, 22, 21]])
In [68]: sieve=np.random.randint(2, size=(4,8))
In [69]: sieve
Out[69]:
array([[0, 1, 0, 1, 1, 0, 1, 0],
[1, 1, 1, 0, 0, 1, 1, 1],
[0, 1, 1, 0, 0, 1, 1, 0],
[0, 0, 0, 1, 1, 1, 1, 1]])
In [70]: not_sieve=sieve^1 # Complement of sieve
In [71]: pn = p1*sieve + p2*not_sieve
In [72]: pn
Out[72]:
array([[23, 15, 24, 14, 12, 20, 12, 25],
[14, 11, 11, 20, 25, 12, 10, 12],
[24, 11, 14, 20, 21, 10, 13, 22],
[22, 20, 21, 13, 14, 13, 12, 13]])
The numbers in the teens come from p1 when sieve is 1
The numbers in the twenties come from p2 when sieve is 0
This may be able to be made more efficient but is this what you expect as output?