I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*
You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0
The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)
I'm trying to determine the cause of some high latency I'm seeing on my ElastiCache Redis node (cache.m3.medium). I gathered some data using the redis-cli latency test, running it from an EC2 instance in the same region/availability-zone as the ElastiCache node.
I see that the latency is quite good on average (~.5ms), but that there are some pretty high outliers. I don't believe that the outliers are due to network latency, as network ping tests between two EC2 instances don't exhibit these high spikes.
The Redis node is not under any load, and the metrics seem to look fine.
My questions are:
What might be causing the high max latencies?
Are these max latencies expected?
What other steps/tests/tools would you use to further diagnose the issue?
.
user#my-ec2-instance:~/redis-3.2.8$ ./src/redis-cli -h redis-host --latency-history -i 1
min: 0, max: 12, avg: 0.45 (96 samples) -- 1.01 seconds range
min: 0, max: 1, avg: 0.33 (96 samples) -- 1.00 seconds range
min: 0, max: 3, avg: 0.33 (96 samples) -- 1.01 seconds range
min: 0, max: 2, avg: 0.29 (96 samples) -- 1.01 seconds range
min: 0, max: 2, avg: 0.26 (96 samples) -- 1.01 seconds range
min: 0, max: 1, avg: 0.34 (96 samples) -- 1.00 seconds range
min: 0, max: 4, avg: 0.34 (96 samples) -- 1.01 seconds range
min: 0, max: 1, avg: 0.26 (96 samples) -- 1.00 seconds range
min: 0, max: 5, avg: 0.33 (96 samples) -- 1.01 seconds range
min: 0, max: 1, avg: 0.31 (96 samples) -- 1.00 seconds range
min: 0, max: 1, avg: 0.33 (96 samples) -- 1.00 seconds range
min: 0, max: 1, avg: 0.28 (96 samples) -- 1.00 seconds range
min: 0, max: 1, avg: 0.30 (96 samples) -- 1.00 seconds range
min: 0, max: 4, avg: 0.35 (96 samples) -- 1.01 seconds range
min: 0, max: 15, avg: 0.52 (95 samples) -- 1.01 seconds range
min: 0, max: 4, avg: 0.48 (94 samples) -- 1.00 seconds range
min: 0, max: 2, avg: 0.54 (94 samples) -- 1.00 seconds range
min: 0, max: 1, avg: 0.38 (96 samples) -- 1.01 seconds range
min: 0, max: 8, avg: 0.55 (94 samples) -- 1.00 seconds range
I ran tests with several different node types, and found that bigger nodes performed much better. I'm using the cache.m3.xlarge type, which has provided more consistent network latency.
Thanks to Origineil for his help in modifying the awk script. The awk script work perfetly right if the interval are whole number but if I use an interval of less than one like 0.2, it gives a wrong output. I have a file "sss" containing this data:
H34 5.0856 5.45563
H39 5.0857 5.45573
H26 6.4822 6.81033
H30 6.4822 6.81033
H32 6.4823 6.81043
H40 6.4824 6.81053
H33 7.6729 7.96531
H27 7.673 7.96541
H31 7.6731 7.96551
H38 7.6731 7.96551
H29 8.5384 8.80485
H28 8.5387 8.80514
H35 8.5387 8.80514
H37 8.5387 8.80514
H41 9.9078 10.1332
H36 9.9087 10.134
If I then run the awk command
awk '!e{e=$2+0.2;} $2-e>0{print "Range " ++i , c " entries. min: " min " max: " max ; e+=0.2; c=0; min=""} {if(!min)min=$2; c++; max=$2} END{print "Range " ++i , c " entries. min: " min " max: " max} ' ss
It gives the output with difference between the values are not up to 0.2 indicated in the script:
Range 1 2 entries. min: 5.0856 max: 5.0857
Range 2 1 entries. min: 6.4822 max: 6.4822
Range 3 1 entries. min: 6.4822 max: 6.4822
Range 4 1 entries. min: 6.4823 max: 6.4823
Range 5 1 entries. min: 6.4824 max: 6.4824
Range 6 1 entries. min: 7.6729 max: 7.6729
Range 7 1 entries. min: 7.673 max: 7.673
Range 8 1 entries. min: 7.6731 max: 7.6731
Range 9 1 entries. min: 7.6731 max: 7.6731
Range 10 1 entries. min: 8.5384 max: 8.5384
Range 11 1 entries. min: 8.5387 max: 8.5387
Range 12 1 entries. min: 8.5387 max: 8.5387
Range 13 1 entries. min: 8.5387 max: 8.5387
Range 14 1 entries. min: 9.9078 max: 9.9078
Range 15 1 entries. min: 9.9087 max: 9.9087
Can somebody help me out on this?
Thanks in advance.
By minimum and maximum I assume you mean the first and last entry seen within the range.
For the provided input I changed 18 to 17 so that not all "max" values were also the upper bounds of the range.
Script:
awk '!e{e=$1+4;} $1-e>0{print "Range " ++i , c " entries. min: " min " max: " max ; e+=4; c=0; min=""} {if(!min)min=$1; c++; max=$1} END{print "Range " ++i , c " entries. min: " min " max: " max} ' file
I introduced two variables min and max to track the entries.
Output:
Range 1 4 entries. min: 2 max: 6
Range 2 3 entries. min: 7 max: 10
Range 3 3 entries. min: 12 max: 14
Range 4 2 entries. min: 16 max: 17
Range 5 1 entries. min: 19 max: 19
I have a table that looks like this at the moment:
Day Limit Price
1 52 0.3
1 4 70
1 44 200
1 9 0.01
1 0 0.03
1 0 0.03
2 52 0.4
2 10 70
2 44 200
2 5 0.01
2 0 0.55
2 2 50
Is there a way I can use SQL to manipulate the result into a table with different categories for price and selecting the maximum value for the limit respective to its price?
Day 0-10 10-100 100+
1 52 4 44
2 52 10 44
You can use CASE and MAX:
SELECT Day,
MAX(CASE WHEN Price BETWEEN 0 AND 10 THEN Limit ELSE 0 END) as ZeroToTen,
MAX(CASE WHEN Price BETWEEN 10 AND 100 THEN Limit ELSE 0 END) as TenToHundred,
MAX(CASE WHEN Price > 100 THEN Limit ELSE 0 END) as HundredPlus
FROM YourTable
GROUP BY Day
Here is the Fiddle.
BTW -- if you're using MySQL, add ticks around LIMIT since it's a keyword.
Good luck.