replace the outlier value from multiple columns based on different condition using pandas? - pandas

I want to find the outlier in multiple columns at a time and replace the outlier value with some other value based on two conditions.
sample dataset:
day phone_calls received
1 11 11
2 12 12
3 10 0
4 13 12
5 170 2
6 9 9
7 67 1
8 180 150
9 8 1
10 10 10
find out the outlier range, let's say the range is (8-50), then replace the value: if the column value is less than 8 then replace with 8, and if greater than 50 then replace with 50.
Please help I am new to pandas.

I think need set_index with clip:
df = df.set_index('day').clip(8,50)
print (df)
phone_calls received
day
1 11 11
2 12 12
3 10 8
4 13 12
5 50 8
6 9 9
7 50 8
8 50 50
9 8 8
10 10 10
Or similar with iloc select all columns without first:
df.iloc[:, 1:] = df.iloc[:, 1:].clip(8,50)
print (df)
day phone_calls received
0 1 11 11
1 2 12 12
2 3 10 8
3 4 13 12
4 5 50 8
5 6 9 9
6 7 50 8
7 8 50 50
8 9 8 8
9 10 10 10
EDIT: You can specify columns in list:
cols = ['phone_calls','received']
df[cols] = df[cols].clip(8,50)
print (df)
day phone_calls received
0 1 11 11
1 2 12 12
2 3 10 8
3 4 13 12
4 5 50 8
5 6 9 9
6 7 50 8
7 8 50 50
8 9 8 8
9 10 10 10

Related

Pandas: How to extract data that has been grouped by

Here is an example code to demonstrate my problem:
import numpy as np
import pandas as pd
np.random.seed(10)
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=list('xy'))
df
x y
0 9 4
1 0 1
2 9 0
3 1 8
4 9 0
... ... ...
95 0 4
96 6 4
97 9 8
98 0 7
99 1 7
groups = df.groupby(['x'])
groups.size()
x
0 11
1 12
2 15
3 13
4 14
5 5
6 6
7 9
8 5
9 10
dtype: int64
How can I access the x-values as a column and the aggregated y-values as a second column to plot x versus y?
Two options.
Use reset_index():
groups = df.groupby(['x']).size().reset_index(name='size')
Add as_index=False to groupby:
groups = df.groupby(['x'], as_index=False).size()
Output for both:
>>> groups
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15
IIUC, use as_index=False:
groups = df.groupby(['x'], as_index=False)
out = groups.size()
out.plot(x='x', y='size')
If you only want to plot, you can also keep the x as index:
df.groupby(['x']).size().plot()
output:
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15

Keep only the first value on duplicated column (set 0 to others)

Supposing I have the following situation:
A dataframe where the first column ['ID'] will eventually have duplicated values.
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,4,5,5,5,6,6],
"l_1": [10,12,32,45,45,20,20,20,20,20],
"l_2": [11,12,32,11,21,27,38,12,9,6],
"l_3": [5,9,32,12,21,21,18,12,8,1],
"l_4": [6,21,12,77,77,2,2,2,8,8]})
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 45 21 21 77
5 20 27 21 2
5 20 38 18 2
5 20 12 12 2
6 20 9 8 8
6 20 6 1 8
When duplicated IDs occurs:
I need to keep only the first values for column l_1 and l_4 (other duplicated rows must be zero).
Columns 'l_2' and 'l_3' must stay the same.
When duplicated IDs occurs, the values on these rows on columns l_1 and l_4 will be also duplicated.
Expected output:
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 0 21 21 0
5 20 27 21 2
5 0 38 18 0
5 0 12 12 0
6 20 9 8 8
6 0 6 1 0
Is there a Straightforward way using pandas or numpy to accomplish this ?
I could just accomplish it doing all these steps:
x1 = df[df.duplicated(subset=['ID'], keep=False)].copy()
x1.loc[x1.groupby('ID')['l_1'].apply(lambda x: (x.shift(1) == x)), 'l_1'] = 0
x1.loc[x1.groupby('ID')['l_4'].apply(lambda x: (x.shift(1) == x)), 'l_4'] = 0
df = df.drop_duplicates(subset=['ID'], keep=False)
df = pd.concat([df, x1])
Isn't this just:
df.loc[df.duplicated('ID'), ['l_1','l_4']] = 0
Output:
ID l_1 l_2 l_3 l_4
0 1 10 11 5 6
1 2 12 12 9 21
2 3 32 32 32 12
3 4 45 11 12 77
4 4 0 21 21 0
5 5 20 27 21 2
6 5 0 38 18 0
7 5 0 12 12 0
8 6 20 9 8 8
9 6 0 6 1 0

How to create an OD matrix from a pandas Data Frame only with specific columns

I have this data frame as in the picture below. I need to create an Origin-Destination matrix wherein the Row axis I will have the date as a column and the values from the "From municipality code", On the Columns axis I will have the values of the "To municipality code" and as the values to fill the matrix will be the values of the column "count". How do you get a matrix from the pandas data frame?
result_final.head()
ODMatrix= pd.DataFrame(0, index=list(range(0,202708)), columns = list(range(0,202708))
).add(df.pivot_table(values='count', index="from_municipality_code",
columns='to_municipality_code', aggfunc=len),fill_value=0).astype('int')
I tried to convert the pandas data frame into numpy array but it did not work.
result_final[['date', 'from_municipality_code','to_municipality_code','count','Lng_x','Lat_x','Lng_y','Lat_y',]].to_numpy()
This is the final matrix I want if this helps to visualize:
You can use the pivot_table method. Here is a working example:
import pandas as pd
import numpy as np
# Some example data
df = pd.DataFrame({"from": np.random.randint(0, 10, (1000,)), "to": np.random.randint(0, 10, (1000,))})
# Remove examples where from == to
df = df.loc[df["from"] != df["to"]].copy()
# The key operation
matrix = (
df.assign(count=1)
.pivot_table(index="from", columns="to", values="count", aggfunc="count")
.fillna(0)
.astype(int)
)
print(matrix)
to 0 1 2 3 4 5 6 7 8 9
from
0 0 10 14 7 9 14 18 6 11 8
1 11 0 12 7 4 12 9 11 6 13
2 6 14 0 12 13 8 5 15 11 10
3 10 9 12 0 14 10 8 14 9 11
4 10 14 14 11 0 8 4 10 11 4
5 15 10 10 18 8 0 15 15 8 12
6 9 7 10 13 10 8 0 11 12 10
7 9 12 4 6 9 9 8 0 8 12
8 8 8 11 12 15 10 11 4 0 6
9 10 13 11 16 14 18 11 9 4 0

RODBC: Columns and values don't match

I came across this behavior in RODBC (using SQL Server driver):
df1 = data.frame(matrix(c(1:20), nrow=10))
df1
which outputs
X1 X2
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
which makes sense. Then I save the table using RODBC
sqlSave(conout, df1, 'TEST')
Then I switch the two created columns:
df2 = df1[,c(2,1)]
df2
which outputs
X2 X1
1 11 1
2 12 2
3 13 3
4 14 4
5 15 5
6 16 6
7 17 7
8 18 8
9 19 9
10 20 10
which also makes sense.
Seeing those two tables, I see that X1 only contains 1:10 and X2 only contains 11:20. Now, when I do
sqlSave(conout, df2, 'TEST', append=TRUE, fast=FALSE)
sqlQuery(conout, 'SELECT * FROM TEST')
rownames X1 X2
1 1 1 11
2 2 2 12
3 3 3 13
4 4 4 14
5 5 5 15
6 6 6 16
7 7 7 17
8 8 8 18
9 9 9 19
10 10 10 20
11 1 11 1
12 2 12 2
13 3 13 3
14 4 14 4
15 5 15 5
16 6 16 6
17 7 17 7
18 8 18 8
19 9 19 9
20 10 20 10
which is definitely not what I saved. Now three questions:
How is this possible?
Where is this behavior explained in the RODBC manual?
How can I prevent the behavior without reordering my columns (the real case behind this example has > 300 columns).

Update Query in SQL with numeric pattern in MS Access

Good Day All,
I need assistance in an creating an update query that groups my data.
The data in my table is actually spatial in nature and can be thought of a matrix that is 10 columns by 5 rows. I have the ObjectID, Row and Column but I want the column DesiredResult which is a 2x2 grouping of the rows & columns.
So the R,Cs of 1,1 1,2, 2,1 and 2,2, will have a DesiredResult of 1 while the 1,3 1,4 2,3 2,4 will have a DesiredResult of 2 and so on (see below for an example) ....
I was able to create the R and C columns using a combination of Quotient & Mod so I assume I would do somethign similar but I am stuck. How would I go about this query in MS Access ?
ObjectID R C DesiredResult
1 1 1 1
2 1 2 1
3 1 3 2
4 1 4 2
5 1 5 3
6 1 6 3
7 1 7 4
8 1 8 4
9 1 9 5
10 1 10 5
11 2 1 1
12 2 2 1
13 2 3 2
14 2 4 2
15 2 5 3
16 2 6 3
17 2 7 4
18 2 8 4
19 2 9 5
20 2 10 5
21 3 1 6
22 3 2 6
23 3 3 7
24 3 4 7
25 3 5 8
26 3 6 8
27 3 7 9
28 3 8 9
29 3 9 10
30 3 10 10
31 4 1 6
32 4 2 6
33 4 3 7
34 4 4 7
35 4 5 8
36 4 6 8
37 4 7 9
38 4 8 9
39 4 9 10
40 4 10 10
41 5 1 11
42 5 2 11
43 5 3 12
44 5 4 12
45 5 5 13
46 5 6 13
47 5 7 14
48 5 8 14
49 5 9 15
50 5 10 15
Something like ... ?
SELECT a.Row, a.Col, Col\2 AS D1, Col Mod 2 AS D2, [D1]+[D2] AS Desired
FROM table AS a
ORDER BY a.Row, a.Col;
Remou had a close approximation but it turns out this gives me what I need. I needed both a row and a column index.
SELECT ObjectID, R, C,
Int(([C]-1)/2) AS ColIndex,
Int(([R]-1)/2) AS RowIndex,
[RowIndex]*5+[ColIndex]+1 AS DesiredResult
FROM Testing
ORDER BY ObjectID
The key in the query is that there is the number 2 in both the Column & Row Index (which is the grouping size) and the number 5 is used in Desired Result and represents the Number of Row cells.
Thanks !