Finding significant values from a series - pandas

I have a series with index and the count can be 0 to 1000.
I can select all the entries where the value is greater than 3
But after looking at the data, I decide to select all the entries where the value is more than 10 because some values are significantly higher than others!
s[s > 3].dropna()
-PB-[variable][variable] 8.0
-[variable] 15.0
-[variable][variable] 6.0
A-[variable][variable] 5.0
B 5.0
B-[variable][variable] 5.0
Book 4.0
Bus 8.0
Date 5.0
Dear 1609.0
MR 4.0
Man[variable] 4.0
Number[variable] 5.0
PM[variable] 4.0
Pickup 12.0
Pump[variable] 5.0
RJ 9.0
RJ-[variable]-PB-[variable][variable] 6.0
Time[variable] 6.0
[variable] 103.0
[variable][variable] 15.0
I have refined my query to something like this...
s[s > 10].dropna()
-[variable] 15.0
Dear 1609.0
Pickup 12.0
[variable] 103.0
[variable][variable] 15.0
Is there any function in pandas to return the significant entries. I can sort in descending order and select the first 5 or 10, but there is no guarantee that those entries will be very high compared to average. In that case I will prefer to select all entries.
In other words, I have decided the threshold of 10 in this case after looking at the data. Is there any method to select that value programmatically?

Selecting a threshold value with the quntile method might be a better solution, but still not the exact answer.

You can use .head function to select default top 5 row and .sort_values to sort in that dataframe. If you want to select top 10 then pass 10 in head function.
Simply call:
s[s['column_name'] > 10].sort_values(kind='quicksort', by='column_name_to_sort', ascending=False).head(10)

Related

How to compare a value against a column value containing csv in Postgres?

I have a table called device_info that looks like below (only a sample provided)
device_ip
cpu
memory
100.33.1.0
10.0
29.33
110.35.58.2
3.0, 2.0
20.47
220.17.58.3
4.0, 3.0
23.17
30.13.18.8
-1
26.47
70.65.18.10
-1
20.47
10.25.98.11
5.0, 7.0
19.88
12.15.38.10
7.0
22.45
Now I need to compare a number say 3 against the cpu column values and get the rows that are greater than that. Since the cpu column values are stored as a csv, I am not sure how to do the comparison.
I found there is a concept called string_to_array in Postgres which converts csv to array and accordingly tried the below query that didn't work out
select device_ip, cpu, memory
from device_info
where 3 > any(string_to_array(cpu, ',')::float[]);
What am I doing wrong?
Expected output
device_ip
cpu
memory
100.33.1.0
10.0
29.33
220.17.58.3
4.0, 3.0
23.17
10.25.98.11
5.0, 7.0
19.88
12.15.38.10
7.0
22.45
The statement as-is is saying "3 is greater than my array value". What I think you want is "3 is less than my array value".
Switch > to <.
select device_ip, cpu
from device_info
where 3 < any(string_to_array(cpu, ',')::float[]);
View on DB Fiddle

filter file in .csv formar with pandas [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 4 years ago.
I tried to filter my data with pandas but I have not succeeded, I changed my data to a .csv file and did the following:
import pandas as pd
data = pd.read_csv("test3.csv")
print (type(data))
print(data)
By doing this I get my table:
C1 C2 C3 C4 C5
0 1.0 2.0 3.0 4.0 5.0
1 2.0 3.0 4.0 5.0 6.0
2 3.0 4.0 5.0 6.0 7.0
3 4.0 5.0 6.0 7.0 8.0
class 'pandas.core.frame.DataFrame'
Now I need that for the rows in which the columns meet a condition, python prints that row for example the rows for which all the columns are <4.0, the idea is that I have a condition for each column. I tried this but it does not work:
for item in data:
fil_C1=(data["C1"]) == 4.0
print (fil_C1)
please help me!!!
If you need to retain rows where columns value for C1 is less than 4, try the following:
less_than_4 = data[data['C1'] < 4]
print(less_than_4)
If you have multiple conditions say C1 less than 4 and C5 greater than 5 try this:
mul_conditions = data[(data['C1'] < 4) & (data['C5'] > 5)]
print(mul_conditions)
Let me know how it goes.

Replacing NaN values with group mean

I have s dataframe made of countries, years and many other features. there are many years for a single country
country year population..... etc.
1 2000 5000
1 2001 NaN
1 2002 4800
2 2000
now there are many NaN in the dataframe.
I want to replace each NaN corresponding to a specific country in every column with the country average of this column.
so for example for the NaN in the population column corresponding to country 1, year 2001, I want to use the average population for country 1 for all the years = (5000+4800)/2.
now I am using the groupby().mean() method to find the means for each country, but I am running into the following difficulties:
1- some means are coming as NaN when I know for sure there is a value for it. why is so?
2- how can I get access to specific values in the groupby clause? in other words, how can I replace every NaN with its correct average?
Thanks a lot.
Using combine_first with groupby mean
df.combine_first(df.groupby('country').transform('mean'))
Or
df.fillna(df.groupby('country').transform('mean'))

Assign titles to a column according to percentiles in SQL

Trying to solve Python problems into SQL code.
I would like to assign titles according to the grade in a new column.
For example:
A for the top 0.9% of the column
B for next 15% of the column
C for next 25% of the column
D for next 30% of the column
E for next 13% of the column
F for rest of the column
There is this column:
Grades
2.3
3
2
3.3
3.5
3.6
3.2
2.1
2.3
3.7
3.3
3.1
4.4
4.3
1.4
4.5
3.5
I don't know how sqlite can work with this since it doesn't have a function like quantile that languages like R have.
Something that tried but not even close is this :
SELECT x
FROM MyTable
ORDER BY x
LIMIT 1
OFFSET (SELECT COUNT(*)
FROM MyTable) / 2
to get at half of the column.

SSRS - column calculations

In SQL Server Reporting Services within Visual Studio, I created a report which has a detail and a total line. I try to subtract the value in the total line from the value in the detail line and I get a result of Zero which is incorrect. See example below :
Col A Col B
Detail 4.7 4.7 – 4.0
lines 3.7 3.7 – 4.0
3.5 3.5 – 4.0
Total/AVG 4.0
In Column B, I take the figure from detail line in col A and subtract the Total line from it and get zero instead of 0.7 etc....
You need to include the scope for the calculating the average within your detail row. If you are doing this at the group level, aggregate on the table's group:
=Fields!MyField.Value - AVG(Fields!MyField.Value, "table1_Group1")
If it is at the dataset level, you can do the same with the dataset:
=Fields!MyField.Value - AVG(Fields!MyField.Value, "MyDataset")