Slicing numpy with condition - numpy

I have numpy array with the sape of 178 rows X 14 columns like this:
0 1 2 3 4 5 6 7 8 9 10 11 \
0 1.0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04
1 1.0 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05
2 1.0 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03
3 1.0 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86
4 1.0 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04
.. ... ... ... ... ... ... ... ... ... ... ... ...
173 3.0 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64
174 3.0 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70
175 3.0 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59
176 3.0 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60
177 3.0 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61
12 13
0 3.92 1065.0
1 3.40 1050.0
2 3.17 1185.0
3 3.45 1480.0
4 2.93 735.0
.. ... ...
173 1.74 740.0
174 1.56 750.0
175 1.56 835.0
176 1.62 840.0
177 1.60 560.0
[178 rows x 14 columns]
I tried to print it in dataframe for all the rows and only the first (index 0) column and the output worked like this:
0
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
.. ...
173 3.0
174 3.0
175 3.0
176 3.0
177 3.0
[178 rows x 1 columns]
using the same logic, I want totake all the rows and only the first column with the value is below 2. I tried to do it like this and it doesn't work:
reduced = data[data[:,0:1]<=2]
I got an
IndexError
like this:
IndexError Traceback (most recent call last)
<ipython-input-159-7eab0abd8f99> in <module>()
----> 1 reduced = data[data[:,0:1]<=2]
IndexError: boolean index did not match indexed array along dimension 1; dimension is 14 but corresponding boolean dimension is 1.
anybody can help me?
thank in advance

Solved it.
It is just as simple as just convert the numpy array to dataframe and then select rows based on condition in dataframe:
reduced = data[data['class'] <= 2]

Related

Pandas - Reading excel file from O365

I have an excel file in O365 which has been shared and want to be import into pandas but failed. The file is shared as anyone with the link can edit. I’ve went through many related posts but always get different error messages. You can find a link which is a shared excel file. Please help me how to import! Thanks in advance!
text
import urllib.request
url = 'https://botizrt-my.sharepoint.com/:x:/g/personal/bauko_botizrt_onmicrosoft_com/ETu8P8uY5EJChvYhi4tMaqABnOLyU97Ijw2v-pEmM-jXaA'
urllib.request.urlretrieve(url, "test.xlsx")
You have to append &download=1 to your url:
import pandas as pd
import requests
from io import BytesIO
url = 'https://botizrt-my.sharepoint.com/:x:/g/personal/bauko_botizrt_onmicrosoft_com/ETu8P8uY5EJChvYhi4tMaqABnOLyU97Ijw2v-pEmM-jXaA?rtime=GhHyWm4O20g&download=1'
response = requests.get(url)
with BytesIO() as buf:
buf.write(response.content)
df = pd.read_excel(buf)
Output:
>>> df
date tavg tmin tmax prcp snow wdir wspd wpgt pres tsun
0 2023-01-01 9.1 5.6 13.5 0.0 NaN 152 12.6 25.9 1029.6 NaN
1 2023-01-02 7.2 1.9 12.9 0.0 NaN 148 10.2 27.8 1027.4 NaN
2 2023-01-03 5.2 1.8 6.7 0.3 NaN 207 5.1 20.4 1027.7 NaN
3 2023-01-04 5.5 3.4 6.9 0.4 NaN 213 7.5 25.9 1029.5 NaN
4 2023-01-05 5.9 3.4 8.6 0.7 NaN 200 17.6 35.2 1019.1 NaN
5 2023-01-06 6.2 2.7 9.0 0.0 NaN 241 13.7 35.2 1022.7 NaN
6 2023-01-07 5.6 1.2 11.1 0.0 NaN 145 7.8 27.8 1023.0 NaN
7 2023-01-08 4.8 -0.5 8.5 0.0 NaN 156 12.0 27.8 1018.6 NaN
8 2023-01-09 7.9 6.5 9.4 3.4 NaN 146 14.7 33.3 1009.3 NaN
9 2023-01-10 6.7 5.2 7.8 5.8 NaN 76 13.7 42.6 1008.0 NaN
10 2023-01-11 5.6 2.8 8.5 2.1 NaN 347 14.4 40.8 1017.8 NaN
11 2023-01-12 5.0 1.5 8.9 0.0 NaN 203 7.7 22.2 1022.4 NaN
12 2023-01-13 4.1 1.6 5.7 0.0 NaN 148 9.8 27.8 1021.7 NaN
13 2023-01-14 3.3 1.0 5.4 2.9 NaN 211 5.6 18.5 1022.8 NaN
14 2023-01-15 5.0 3.6 8.5 0.0 NaN 158 12.0 31.5 1018.0 NaN
15 2023-01-16 5.7 3.7 7.6 2.4 NaN 156 15.0 33.3 1004.5 NaN
16 2023-01-17 8.1 5.5 10.6 11.1 NaN 152 18.9 44.5 995.2 NaN
17 2023-01-18 9.7 6.8 12.0 3.4 NaN 159 19.5 44.5 995.5 NaN
18 2023-01-19 7.8 3.2 11.8 10.5 NaN 182 15.8 40.8 1002.4 NaN
19 2023-01-20 2.5 0.2 4.0 14.1 NaN 312 15.2 35.2 1007.3 NaN
20 2023-01-21 2.4 0.9 3.9 0.7 NaN 35 11.8 25.9 1013.5 NaN
21 2023-01-22 5.4 1.0 11.0 0.3 NaN 10 9.6 22.2 1020.7 NaN
22 2023-01-23 6.5 2.2 12.0 0.0 NaN 10 10.4 24.1 1025.9 NaN
23 2023-01-24 3.7 2.7 5.4 0.0 NaN 254 6.9 20.4 1031.5 NaN
24 2023-01-25 2.6 1.8 3.5 0.0 NaN 287 6.2 22.2 1030.8 NaN
25 2023-01-26 2.0 0.9 3.5 2.1 NaN 41 7.1 24.1 1021.5 NaN
26 2023-01-27 1.0 -0.4 2.6 0.5 NaN 348 14.7 24.1 1014.6 NaN
27 2023-01-28 1.9 0.9 3.2 1.7 NaN 342 15.1 25.9 1018.0 NaN
28 2023-01-29 0.7 -0.9 1.7 0.0 NaN 328 13.1 25.9 1024.5 NaN
29 2023-01-30 -0.2 -4.2 2.5 0.4 NaN 207 14.6 33.3 1018.7 NaN
30 2023-01-31 2.0 -2.4 6.3 0.3 NaN 260 11.8 33.3 1016.1 NaN

Depth profile visual

Hi there I was wondering whether someone might assist with combining plots generated using the example provide on this page Depth Profiling visualization where I have analyzed data for salinity and depth, however I have a categorical variable dividing three estuaries based on whether the mouth is "closed", "open", or "semi-closed". I used the code of Depth Profiling Visualization, however each plot has its own salinity legend scale per plot.
Here is the data.
State Distance Depth pH DO Chla Salinity Max.depth
1 Closed 0.60 0.0 8.66 10.64 0.8880000 18.49 -1.3
2 Closed 0.60 0.5 8.68 10.79 1.4800000 18.51 -1.3
3 Closed 0.60 1.3 8.73 11.26 1.1840000 18.51 -1.3
4 Closed 1.00 0.0 8.48 9.07 5.3280000 18.18 -0.8
5 Closed 1.00 0.8 8.47 8.30 2.9600000 18.35 -0.8
6 Closed 1.60 0.0 8.38 9.70 1.1840000 18.38 -2.0
7 Closed 1.60 0.5 8.40 9.33 NA 18.39 -2.0
8 Closed 1.60 1.0 8.40 9.27 1.1840000 18.39 -2.0
9 Closed 1.60 1.5 8.41 9.27 NA 18.41 -2.0
10 Closed 1.60 2.0 8.47 9.23 1.4800000 18.57 -2.0
11 Closed 2.15 0.0 8.40 9.85 2.6640000 18.26 -3.5
12 Closed 2.15 0.5 8.41 9.95 NA 18.27 -3.5
13 Closed 2.15 1.0 8.42 9.16 1.1840000 18.28 -3.5
14 Closed 2.15 2.0 8.42 9.82 NA 18.28 -3.5
15 Closed 2.15 3.5 8.38 9.17 0.5920000 18.30 -3.5
16 Closed 3.50 0.0 8.30 9.82 2.0720000 17.71 -5.0
17 Closed 3.50 0.5 8.31 9.78 NA 17.71 -5.0
18 Closed 3.50 1.0 8.32 9.75 1.4800000 17.72 -5.0
19 Closed 3.50 2.0 8.32 9.73 NA 17.78 -5.0
20 Closed 3.50 3.0 8.30 9.20 NA 17.95 -5.0
21 Closed 3.50 4.0 8.29 8.80 NA 18.00 -5.0
22 Closed 3.50 5.0 8.24 7.47 1.4800000 18.06 -5.0
23 Closed 4.85 0.0 8.21 10.10 2.9600000 17.33 -1.6
24 Closed 4.85 0.5 8.21 9.90 2.0720000 17.33 -1.6
25 Closed 4.85 1.0 8.21 9.73 NA 17.32 -1.6
26 Closed 4.85 1.6 8.22 9.60 1.1840000 17.32 -1.6
27 Closed 6.00 0.0 8.07 9.07 4.4400000 16.65 -1.5
28 Closed 6.00 0.5 8.06 8.98 5.6240000 16.65 -1.5
29 Closed 6.00 1.0 8.06 8.81 NA 16.67 -1.5
30 Closed 6.00 1.5 8.10 8.80 4.1440000 16.67 -1.5
31 Closed 6.70 0.0 7.83 9.25 0.0000000 13.90 -0.5
32 Open 0.60 0.0 7.56 8.42 1.1840000 1.62 -0.5
33 Open 0.60 0.5 7.62 8.40 1.9733333 1.79 -0.5
34 Open 1.00 0.0 7.67 8.55 1.1840000 1.10 -0.4
35 Open 1.00 0.4 7.62 8.49 1.5786667 1.10 -0.4
36 Open 1.60 0.0 7.48 8.40 1.5786667 0.98 -1.0
37 Open 1.60 0.5 7.47 8.33 NA 0.98 -1.0
38 Open 1.60 1.0 7.45 8.33 2.7626667 0.99 -1.0
39 Open 2.15 0.0 7.19 7.99 1.1840000 0.85 -1.0
40 Open 2.15 0.5 7.19 7.96 NA 0.86 -1.0
41 Open 2.15 1.0 7.18 7.98 1.1840000 0.89 -1.0
42 Open 3.50 0.0 7.14 7.56 0.3946667 0.55 -4.8
43 Open 3.50 0.5 7.20 7.50 NA 0.55 -4.8
44 Open 3.50 1.0 7.28 7.38 1.9733333 0.55 -4.8
45 Open 3.50 2.0 7.38 7.10 NA 0.55 -4.8
46 Open 3.50 3.0 7.56 6.15 NA 0.56 -4.8
47 Open 3.50 4.0 7.20 4.43 NA 2.65 -4.8
48 Open 3.50 4.8 6.93 2.25 1.9733333 6.76 -4.8
49 Open 4.85 0.0 6.90 8.29 1.1840000 0.26 -1.2
50 Open 4.85 0.5 6.77 8.20 0.7893333 0.27 -1.2
51 Open 4.85 1.2 6.55 8.20 0.7893333 0.39 -1.2
52 Open 6.00 0.0 6.49 9.53 1.1840000 0.13 -1.0
53 Open 6.00 0.5 6.59 9.53 NA 0.13 -1.0
54 Open 6.00 1.0 6.79 9.53 1.1840000 0.13 -1.0
55 Open 6.70 0.0 6.48 9.48 0.7893333 0.11 -0.5
56 Semi-closed 0.60 0.0 8.05 6.30 19.7300000 18.86 -1.4
57 Semi-closed 0.60 0.5 8.04 5.19 19.7300000 24.07 -1.4
58 Semi-closed 0.60 1.0 8.00 5.98 NA 30.50 -1.4
59 Semi-closed 0.60 1.4 7.87 6.19 5.1300000 31.18 -1.4
60 Semi-closed 1.00 0.0 7.99 5.75 22.8900000 18.81 -0.9
61 Semi-closed 1.00 0.5 7.95 5.10 NA 19.08 -0.9
62 Semi-closed 1.00 0.9 7.86 3.42 11.8400000 26.60 -0.9
63 Semi-closed 1.60 0.0 7.88 6.05 11.4500000 17.29 -1.7
64 Semi-closed 1.60 0.5 7.87 5.78 NA 17.32 -1.7
65 Semi-closed 1.60 1.0 7.86 4.74 8.6800000 17.44 -1.7
66 Semi-closed 1.60 1.5 7.84 3.90 NA 19.65 -1.7
67 Semi-closed 1.60 1.7 7.91 3.75 9.0800000 21.07 -1.7
68 Semi-closed 2.15 0.0 7.91 6.95 22.8900000 16.50 -1.3
69 Semi-closed 2.15 0.5 7.92 6.76 26.4400000 16.50 -1.3
70 Semi-closed 2.15 1.0 7.91 5.99 NA 17.40 -1.3
71 Semi-closed 2.15 1.3 7.97 4.10 7.1000000 18.79 -1.3
72 Semi-closed 3.50 0.0 7.75 6.13 18.5500000 15.86 -4.5
73 Semi-closed 3.50 0.5 7.72 5.90 NA 15.86 -4.5
74 Semi-closed 3.50 1.0 7.65 4.38 9.0800000 16.38 -4.5
75 Semi-closed 3.50 1.5 7.56 1.59 NA 20.09 -4.5
76 Semi-closed 3.50 2.0 7.55 0.38 NA 22.11 -4.5
77 Semi-closed 3.50 3.0 7.53 0.42 NA 30.36 -4.5
78 Semi-closed 3.50 4.0 7.52 0.52 NA 31.50 -4.5
79 Semi-closed 3.50 4.5 7.54 0.68 1.1800000 31.84 -4.5
80 Semi-closed 4.85 0.0 7.66 6.31 21.7100000 15.41 -1.6
81 Semi-closed 4.85 0.5 7.65 6.18 NA 15.44 -1.6
82 Semi-closed 4.85 1.0 7.65 5.57 21.3100000 15.54 -1.6
83 Semi-closed 4.85 1.6 7.52 0.76 6.7100000 22.60 -1.6
84 Semi-closed 6.00 0.0 7.74 8.50 87.6200000 13.11 -1.0
85 Semi-closed 6.00 0.5 7.66 7.38 NA 13.92 -1.0
86 Semi-closed 6.00 1.0 7.60 3.20 7.5000000 15.42 -1.0
87 Semi-closed 6.70 0.0 8.55 6.94 0.0000000 0.25 -0.5
I was hoping someone might be able to assist to unify the scales of the three legends from the three mouth conditions of estuary so that only one legend describing salinity for all plots is possible.

Jedis benchmarking on local Redis server

I'm using JMH to test the performance of Jedis on a local Redis server (Jedis version 2.9.0, Redis version 6.2.6, CPU Quad-Core Intel Core i5). I use 200 threads to send SET command within a connection pool.
#State(Scope.Benchmark)
public class CommonClientBenchmark {
private JedisPool jedisPool;
private final String host = "127.0.0.1";
private final int port = 6379;
#Setup
public void setup() {
JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
jedisPoolConfig.setMaxTotal(200);
jedisPoolConfig.setMaxIdle(200);
jedisPool = new JedisPool(jedisPoolConfig, host, port, 30000);
}
#TearDown
public void tearDown() {
jedisPool.close();
}
#Threads(200)
#Fork(1)
#Benchmark
#BenchmarkMode(Mode.Throughput)
#Warmup(iterations = 1, time = 30, timeUnit = TimeUnit.SECONDS)
#Measurement(iterations = 2, time = 30, timeUnit = TimeUnit.SECONDS)
public void jedisSet() {
try (Jedis jedis = jedisPool.getResource()) {
jedis.set("jedis", "jedis");
}
}
public static void main(String[] args) throws IOException, RunnerException {
CommonClientBenchmark commonClientBenchmark = new CommonClientBenchmark();
commonClientBenchmark.setup();
org.openjdk.jmh.Main.main(args);
}
}
With the code above, I obtain about 25000+ QPS. However, when I decrease the maxTotal and maxIdle parameter of the connection pool from 200 to 100, the result QPS is even much higher - it reaches about 75000. Could anyone explain the phenomenon? Thanks a lot!
EDIT: I've change the version of Jedis to 4.1.1 and run multiple benchmarking tests, the result is similar. When the size of connection pool is set to 100 (both maxTotal and maxIdle), I obtain about 25000 ~ 50000 QPS. When I increase the size (both maxTotal and maxIdle) to 200, the QPS rise to 60000 ~ 75000.
I've also use iostat 1 to monitor the usage of CPU while running the tests. And I found that when the pool size is set to 200, the %system is often much higher than when it is set to 100.
connection pool size set to 200:
disk0 cpu load average
KB/t tps MB/s us sy id 1m 5m 15m
4.09 929 3.71 7 87 6 39.58 14.44 8.06
4.00 902 3.52 6 89 5 39.58 14.44 8.06
4.50 8 0.04 5 88 6 38.33 14.60 8.15
4.39 145 0.62 6 89 6 38.33 14.60 8.15
28.00 11 0.30 6 88 5 38.33 14.60 8.15
8.00 1 0.01 5 88 6 38.33 14.60 8.15
0.00 0 0.00 5 88 7 38.33 14.60 8.15
4.00 5 0.02 5 88 7 38.94 15.12 8.37
0.00 0 0.00 5 89 6 38.94 15.12 8.37
0.00 0 0.00 5 88 7 38.94 15.12 8.37
0.00 0 0.00 5 89 6 38.94 15.12 8.37
8.68 222 1.88 5 88 7 38.94 15.12 8.37
5.60 10 0.05 5 87 8 45.20 16.81 9.01
29.65 46 1.33 11 82 7 45.20 16.81 9.01
52.57 7 0.36 8 85 7 45.20 16.81 9.01
28.00 2 0.05 5 87 8 45.20 16.81 9.01
223.33 6 1.31 6 87 7 45.20 16.81 9.01
4.19 1344 5.49 8 85 7 44.54 17.15 9.17
4.61 952 4.29 6 89 5 44.54 17.15 9.17
4.00 690 2.69 6 89 5 44.54 17.15 9.17
connection pool size set to 100:
disk0 cpu load average
KB/t tps MB/s us sy id 1m 5m 15m
4.31 13 0.05 16 59 26 6.55 7.86 7.49
750.67 3 2.20 30 53 17 6.58 7.85 7.48
9.14 225 2.01 23 54 23 6.58 7.85 7.48
37.00 8 0.29 23 56 21 6.58 7.85 7.48
32.00 6 0.19 18 55 26 6.58 7.85 7.48
145.20 10 1.41 22 56 22 6.58 7.85 7.48
0.00 0 0.00 22 56 22 6.46 7.80 7.47
4.00 2660 10.39 24 58 18 6.46 7.80 7.47
4.00 1952 7.62 19 56 25 6.46 7.80 7.47
4.00 1 0.00 19 56 24 6.46 7.80 7.47
4.00 5 0.02 18 56 27 6.46 7.80 7.47
0.00 0 0.00 15 57 28 6.10 7.71 7.44
0.00 0 0.00 18 57 25 6.10 7.71 7.44
256.00 10 2.50 18 56 25 6.10 7.71 7.44
6.29 7 0.04 20 57 23 6.10 7.71 7.44
4.00 5 0.02 20 56 24 6.10 7.71 7.44
17.71 7 0.12 20 56 24 6.01 7.66 7.42
23.00 4 0.09 20 58 23 6.01 7.66 7.42
5.00 4 0.02 23 55 22 6.01 7.66 7.42
4.00 1 0.00 20 56 24 6.01 7.66 7.42

Replace closest values by average (or min/max) and keep exactly two rows by id if all values by id are equal

I have a dataframe that looks like this:
df_segments =
id seg_length
15 000b994d-1a6b-4698-a270-b0f671b1e612 16.3
11 000b994d-1a6b-4698-a270-b0f671b1e612 1.1
3 000b994d-1a6b-4698-a270-b0f671b1e612 1.1
7 000b994d-1a6b-4698-a270-b0f671b1e612 16.3
31 016490a8-8740-4205-bfe4-c9fe45e853d3 1.0
27 016490a8-8740-4205-bfe4-c9fe45e853d3 1.4
19 016490a8-8740-4205-bfe4-c9fe45e853d3 1.4
23 016490a8-8740-4205-bfe4-c9fe45e853d3 1.0
39 05290fe1-ead2-462b-bbec-a7669eed7883 1.1
35 05290fe1-ead2-462b-bbec-a7669eed7883 1.4
47 05290fe1-ead2-462b-bbec-a7669eed7883 1.1
43 05290fe1-ead2-462b-bbec-a7669eed7883 1.4
63 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.1
59 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.4
51 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.4
55 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.1
71 05577c2e-da7d-4753-bba6-66762385e159 1.0
67 05577c2e-da7d-4753-bba6-66762385e159 5.4
79 05577c2e-da7d-4753-bba6-66762385e159 1.0
75 05577c2e-da7d-4753-bba6-66762385e159 5.4
1475 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
1479 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
1487 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
1483 5a104c86-327e-466f-b14a-6953cacddcbb 0.5
2287 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
2283 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
2279 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
2275 8e853797-a7f3-4605-8848-f6b211f9b055 2.1
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.5
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
All id have four row. For most id, dropping duplicates results in two rows. But for a few ids one of two things can happen:
Either all rows are equal, in which case drop_duplicates() will result in a single row.
drop_duplicates() with result in three or for rows because all values of seg_length are different.
However, all seg_length are the length of the sides in a rectangle (or very close to it) and squares. So, what I would like to do are the following things:
A. If all rows by id have the same seg_length value, keep two rows.
B. Replace the two largest (resp. smallest) values (by id) with their average. In other words:
df_segments =
id seg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.5
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
would become:
df_segments =
id seg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
Alternatively, use min/max if it is easier:
df_segments =
id seg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.6
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2
I have tried to use np.where and define conditions but without any luck. I also tried to create a separate dataframe with the ids whose count was not 2 after dropping duplicates from the original dataframe, df_segments but I ended up in the same situation.
If anyone has an idea, I would be thankful for insights.
If I understand well, you want to average values 2 by 2 within each id. This also happens to do what you want when it’s 4 times the same value.
>>> averages = df.groupby('id')['seg_length'].apply(
... lambda s: s.sort_values().groupby([0, 0, 1, 1]).mean()
... )
>>> averages
id
000b994d-1a6b-4698-a270-b0f671b1e612 0 1.10
1 16.30
016490a8-8740-4205-bfe4-c9fe45e853d3 0 1.00
1 1.40
05290fe1-ead2-462b-bbec-a7669eed7883 0 1.10
1 1.40
0537a9e3-09c4-459c-a6e4-25694cfbacbd 0 1.10
1 1.40
05577c2e-da7d-4753-bba6-66762385e159 0 1.00
1 5.40
5a104c86-327e-466f-b14a-6953cacddcbb 0 0.50
1 0.50
8e853797-a7f3-4605-8848-f6b211f9b055 0 2.10
1 2.10
c1120018-c626-4c1b-81a5-476ce38f346b 0 0.55
1 1.20
Name: seg_length, dtype: float64
If you want to keep the original shape, you can use transform (on both groupbys):
>>> replaced_seglengths = df.groupby('id')['seg_length'].transform(
... lambda s: s.sort_values().groupby([0, 0, 1, 1]).transform('mean')
... )
>>> replaced_seglengths
15 1.10
11 1.10
3 16.30
7 16.30
31 1.00
27 1.00
19 1.40
23 1.40
39 1.10
35 1.10
47 1.40
43 1.40
63 1.10
59 1.10
51 1.40
55 1.40
71 1.00
67 1.00
79 5.40
75 5.40
1475 0.50
1479 0.50
1487 0.50
1483 0.50
2287 2.10
2283 2.10
2279 2.10
2275 2.10
3351 0.55
3347 0.55
3359 1.20
3355 1.20
Finally replace the column in the dataframe:
>>> df['seg_length'] = replaced_seglengths
>>> df
id seg_length
15 000b994d-1a6b-4698-a270-b0f671b1e612 1.10
11 000b994d-1a6b-4698-a270-b0f671b1e612 1.10
3 000b994d-1a6b-4698-a270-b0f671b1e612 16.30
7 000b994d-1a6b-4698-a270-b0f671b1e612 16.30
31 016490a8-8740-4205-bfe4-c9fe45e853d3 1.00
27 016490a8-8740-4205-bfe4-c9fe45e853d3 1.00
19 016490a8-8740-4205-bfe4-c9fe45e853d3 1.40
23 016490a8-8740-4205-bfe4-c9fe45e853d3 1.40
39 05290fe1-ead2-462b-bbec-a7669eed7883 1.10
35 05290fe1-ead2-462b-bbec-a7669eed7883 1.10
47 05290fe1-ead2-462b-bbec-a7669eed7883 1.40
43 05290fe1-ead2-462b-bbec-a7669eed7883 1.40
63 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.10
59 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.10
51 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.40
55 0537a9e3-09c4-459c-a6e4-25694cfbacbd 1.40
71 05577c2e-da7d-4753-bba6-66762385e159 1.00
67 05577c2e-da7d-4753-bba6-66762385e159 1.00
79 05577c2e-da7d-4753-bba6-66762385e159 5.40
75 05577c2e-da7d-4753-bba6-66762385e159 5.40
1475 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
1479 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
1487 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
1483 5a104c86-327e-466f-b14a-6953cacddcbb 0.50
2287 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
2283 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
2279 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
2275 8e853797-a7f3-4605-8848-f6b211f9b055 2.10
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3347 c1120018-c626-4c1b-81a5-476ce38f346b 0.55
3359 c1120018-c626-4c1b-81a5-476ce38f346b 1.20
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.20
use np.select([conditions],[solutions])
conditons
condition1=df2.groupby('id')['seg_length'].apply(lambda x:x.duplicated(keep=False))
condition2=df2.groupby('id')['seg_length'].apply(lambda x:~x.duplicated(keep=False))
Solution
sol1=df2['seg_length']
sol2=(df2.loc[condition2,'seg_length'].sum(0))/2
df2['newseg_length']=np.select([condition1,condition2],[sol1,sol2])
id seg_length newseg_length
3351 c1120018-c626-4c1b-81a5-476ce38f346b 0.6 0.55
3347 c1120018-c626-4c1b-81a5-476ce38f346b 1.2 1.20
3359 c1120018-c626-4c1b-81a5-476ce38f346b 0.5 0.55
3355 c1120018-c626-4c1b-81a5-476ce38f346b 1.2 1.20

Extracting the second last line from a table using a specific number followed by an asterisk (e.g. xy.z*)

I'm looking to extract and print a specific line from a table I have in a long log file. It looks something like this:
******************************************************************************
XSCALE (VERSION July 4, 2012) 4-Jun-2013
******************************************************************************
Author: Wolfgang Kabsch
Copy licensed until 30-Jun-2013 to
academic users for non-commercial applications
No redistribution.
******************************************************************************
CONTROL CARDS
******************************************************************************
MAXIMUM_NUMBER_OF_PROCESSORS=16
RESOLUTION_SHELLS= 20 10 6 4 3 2.5 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8
MINIMUM_I/SIGMA=4.0
OUTPUT_FILE=fae-ip.ahkl
INPUT_FILE= /dls/sci-scratch/Sam/FC59251/fr6_1/XDS_ASCII.HKL
THE DATA COLLECTION STATISTICS REPORTED BELOW ASSUMES:
SPACE_GROUP_NUMBER= 97
UNIT_CELL_CONSTANTS= 128.28 128.28 181.47 90.000 90.000 90.000
***** 16 EQUIVALENT POSITIONS IN SPACE GROUP # 97 *****
If x',y',z' is an equivalent position to x,y,z, then
x'=x*ML(1)+y*ML( 2)+z*ML( 3)+ML( 4)/12.0
y'=x*ML(5)+y*ML( 6)+z*ML( 7)+ML( 8)/12.0
z'=x*ML(9)+y*ML(10)+z*ML(11)+ML(12)/12.0
# 1 2 3 4 5 6 7 8 9 10 11 12
1 1 0 0 0 0 1 0 0 0 0 1 0
2 -1 0 0 0 0 -1 0 0 0 0 1 0
3 -1 0 0 0 0 1 0 0 0 0 -1 0
4 1 0 0 0 0 -1 0 0 0 0 -1 0
5 0 1 0 0 1 0 0 0 0 0 -1 0
6 0 -1 0 0 -1 0 0 0 0 0 -1 0
7 0 -1 0 0 1 0 0 0 0 0 1 0
8 0 1 0 0 -1 0 0 0 0 0 1 0
9 1 0 0 6 0 1 0 6 0 0 1 6
10 -1 0 0 6 0 -1 0 6 0 0 1 6
11 -1 0 0 6 0 1 0 6 0 0 -1 6
12 1 0 0 6 0 -1 0 6 0 0 -1 6
13 0 1 0 6 1 0 0 6 0 0 -1 6
14 0 -1 0 6 -1 0 0 6 0 0 -1 6
15 0 -1 0 6 1 0 0 6 0 0 1 6
16 0 1 0 6 -1 0 0 6 0 0 1 6
ALL DATA SETS WILL BE SCALED TO /dls/sci-scratch/Sam/FC59251/fr6_1/XDS_ASCII.HKL
******************************************************************************
READING INPUT REFLECTION DATA FILES
******************************************************************************
DATA MEAN REFLECTIONS INPUT FILE NAME
SET# INTENSITY ACCEPTED REJECTED
1 0.1358E+03 1579957 0 /dls/sci-scratch/Sam/FC59251/fr6_1/XDS_ASCII.HKL
******************************************************************************
CORRECTION FACTORS AS FUNCTION OF IMAGE NUMBER & RESOLUTION
******************************************************************************
RECIPROCAL CORRECTION FACTORS FOR INPUT DATA SETS MERGED TO
OUTPUT FILE: fae-ip.ahkl
THE CALCULATIONS ASSUME FRIEDEL'S_LAW= TRUE
TOTAL NUMBER OF CORRECTION FACTORS DEFINED 720
DEGREES OF FREEDOM OF CHI^2 FIT 357222.9
CHI^2-VALUE OF FIT OF CORRECTION FACTORS 1.024
NUMBER OF CYCLES CARRIED OUT 4
CORRECTION FACTORS for visual inspection by XDS-Viewer DECAY_001.cbf
XMIN= 0.6 XMAX= 1799.3 NXBIN= 36
YMIN= 0.00049 YMAX= 0.44483 NYBIN= 20
NUMBER OF REFLECTIONS USED FOR DETERMINING CORRECTION FACTORS 396046
******************************************************************************
CORRECTION FACTORS AS FUNCTION OF X (fast) & Y(slow) IN THE DETECTOR PLANE
******************************************************************************
RECIPROCAL CORRECTION FACTORS FOR INPUT DATA SETS MERGED TO
OUTPUT FILE: fae-ip.ahkl
THE CALCULATIONS ASSUME FRIEDEL'S_LAW= TRUE
TOTAL NUMBER OF CORRECTION FACTORS DEFINED 7921
DEGREES OF FREEDOM OF CHI^2 FIT 356720.6
CHI^2-VALUE OF FIT OF CORRECTION FACTORS 1.023
NUMBER OF CYCLES CARRIED OUT 3
CORRECTION FACTORS for visual inspection by XDS-Viewer MODPIX_001.cbf
XMIN= 5.4 XMAX= 2457.6 NXBIN= 89
YMIN= 40.0 YMAX= 2516.7 NYBIN= 89
NUMBER OF REFLECTIONS USED FOR DETERMINING CORRECTION FACTORS 396046
******************************************************************************
CORRECTION FACTORS AS FUNCTION OF IMAGE NUMBER & DETECTOR SURFACE POSITION
******************************************************************************
RECIPROCAL CORRECTION FACTORS FOR INPUT DATA SETS MERGED TO
OUTPUT FILE: fae-ip.ahkl
THE CALCULATIONS ASSUME FRIEDEL'S_LAW= TRUE
TOTAL NUMBER OF CORRECTION FACTORS DEFINED 468
DEGREES OF FREEDOM OF CHI^2 FIT 357286.9
CHI^2-VALUE OF FIT OF CORRECTION FACTORS 1.022
NUMBER OF CYCLES CARRIED OUT 3
CORRECTION FACTORS for visual inspection by XDS-Viewer ABSORP_001.cbf
XMIN= 0.6 XMAX= 1799.3 NXBIN= 36
DETECTOR_SURFACE_POSITION= 1232 1278
DETECTOR_SURFACE_POSITION= 1648 1699
DETECTOR_SURFACE_POSITION= 815 1699
DETECTOR_SURFACE_POSITION= 815 858
DETECTOR_SURFACE_POSITION= 1648 858
DETECTOR_SURFACE_POSITION= 2174 1673
DETECTOR_SURFACE_POSITION= 1622 2230
DETECTOR_SURFACE_POSITION= 841 2230
DETECTOR_SURFACE_POSITION= 289 1673
DETECTOR_SURFACE_POSITION= 289 884
DETECTOR_SURFACE_POSITION= 841 326
DETECTOR_SURFACE_POSITION= 1622 326
DETECTOR_SURFACE_POSITION= 2174 884
NUMBER OF REFLECTIONS USED FOR DETERMINING CORRECTION FACTORS 396046
******************************************************************************
CORRECTION PARAMETERS FOR THE STANDARD ERROR OF REFLECTION INTENSITIES
******************************************************************************
The variance v0(I) of the intensity I obtained from counting statistics is
replaced by v(I)=a*(v0(I)+b*I^2). The model parameters a, b are chosen to
minimize the discrepancies between v(I) and the variance estimated from
sample statistics of symmetry related reflections. This model implicates
an asymptotic limit ISa=1/SQRT(a*b) for the highest I/Sigma(I) that the
experimental setup can produce (Diederichs (2010) Acta Cryst D66, 733-740).
Often the value of ISa is reduced from the initial value ISa0 due to systematic
errors showing up by comparison with other data sets in the scaling procedure.
(ISa=ISa0=-1 if v0 is unknown for a data set.)
a b ISa ISa0 INPUT DATA SET
1.086E+00 1.420E-03 25.46 29.00 /dls/sci-scratch/Sam/FC59251/fr6_1/XDS_ASCII.HKL
FACTOR TO PLACE ALL DATA SETS TO AN APPROXIMATE ABSOLUTE SCALE 0.4178E+04
(ASSUMING A PROTEIN WITH 50% SOLVENT)
******************************************************************************
STATISTICS OF SCALED OUTPUT DATA SET : fae-ip.ahkl
FILE TYPE: XDS_ASCII MERGE=FALSE FRIEDEL'S_LAW=TRUE
186 OUT OF 1579957 REFLECTIONS REJECTED
1579771 REFLECTIONS ON OUTPUT FILE
******************************************************************************
DEFINITIONS:
R-FACTOR
observed = (SUM(ABS(I(h,i)-I(h))))/(SUM(I(h,i)))
expected = expected R-FACTOR derived from Sigma(I)
COMPARED = number of reflections used for calculating R-FACTOR
I/SIGMA = mean of intensity/Sigma(I) of unique reflections
(after merging symmetry-related observations)
Sigma(I) = standard deviation of reflection intensity I
estimated from sample statistics
R-meas = redundancy independent R-factor (intensities)
Diederichs & Karplus (1997), Nature Struct. Biol. 4, 269-275.
CC(1/2) = percentage of correlation between intensities from
random half-datasets. Correlation significant at
the 0.1% level is marked by an asterisk.
Karplus & Diederichs (2012), Science 336, 1030-33
Anomal = percentage of correlation between random half-sets
Corr of anomalous intensity differences. Correlation
significant at the 0.1% level is marked.
SigAno = mean anomalous difference in units of its estimated
standard deviation (|F(+)-F(-)|/Sigma). F(+), F(-)
are structure factor estimates obtained from the
merged intensity observations in each parity class.
Nano = Number of unique reflections used to calculate
Anomal_Corr & SigAno. At least two observations
for each (+ and -) parity are required.
SUBSET OF INTENSITY DATA WITH SIGNAL/NOISE >= -3.0 AS FUNCTION OF RESOLUTION
RESOLUTION NUMBER OF REFLECTIONS COMPLETENESS R-FACTOR R-FACTOR COMPARED I/SIGMA R-meas CC(1/2) Anomal SigAno Nano
LIMIT OBSERVED UNIQUE POSSIBLE OF DATA observed expected Corr
20.00 557 66 74 89.2% 2.7% 3.0% 557 58.75 2.9% 100.0* 45 1.674 25
10.00 5018 417 417 100.0% 2.4% 3.1% 5018 75.34 2.6% 100.0* 2 0.812 276
6.00 18352 1583 1584 99.9% 2.8% 3.3% 18351 65.55 2.9% 100.0* 11* 0.914 1248
4.00 59691 4640 4640 100.0% 3.2% 3.5% 59690 64.96 3.4% 100.0* 4 0.857 3987
3.00 112106 8821 8822 100.0% 4.4% 4.4% 112102 50.31 4.6% 99.9* -3 0.844 7906
2.50 147954 11023 11023 100.0% 8.7% 8.6% 147954 29.91 9.1% 99.8* 0 0.829 10096
2.00 332952 24698 24698 100.0% 21.4% 21.6% 332949 14.32 22.3% 99.2* 1 0.804 22992
1.90 106645 8382 8384 100.0% 56.5% 57.1% 106645 5.63 58.8% 94.7* -2 0.767 7886
1.80 138516 10342 10343 100.0% 86.8% 87.0% 138516 3.64 90.2% 87.9* -2 0.762 9741
1.70 175117 12897 12899 100.0% 140.0% 140.1% 175116 2.15 145.4% 69.6* -2 0.732 12188
1.60 209398 16298 16304 100.0% 206.1% 208.5% 209397 1.35 214.6% 48.9* -2 0.693 15466
1.50 273432 20770 20893 99.4% 333.4% 342.1% 273340 0.80 346.9% 23.2* -1 0.644 19495
1.40 33 27 27248 0.1% 42.6% 112.7% 12 0.40 60.3% 88.2 0 0.000 0
1.30 0 0 36205 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
1.20 0 0 49238 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
1.10 0 0 68746 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
1.00 0 0 98884 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
0.90 0 0 147505 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
0.80 0 0 230396 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
total 1579771 119964 778303 15.4% 12.8% 13.1% 1579647 14.33 13.4% 99.9* -1 0.755 111306
========== STATISTICS OF INPUT DATA SET ==========
R-FACTORS FOR INTENSITIES OF DATA SET /dls/sci-scratch/Sam/FC59251/fr6_1/XDS_ASCII.HKL
RESOLUTION R-FACTOR R-FACTOR COMPARED
LIMIT observed expected
20.00 2.7% 3.0% 557
10.00 2.4% 3.1% 5018
6.00 2.8% 3.3% 18351
4.00 3.2% 3.5% 59690
3.00 4.4% 4.4% 112102
2.50 8.7% 8.6% 147954
2.00 21.4% 21.6% 332949
1.90 56.5% 57.1% 106645
1.80 86.8% 87.0% 138516
1.70 140.0% 140.1% 175116
1.60 206.1% 208.5% 209397
1.50 333.4% 342.1% 273340
1.40 42.6% 112.7% 12
1.30 -99.9% -99.9% 0
1.20 -99.9% -99.9% 0
1.10 -99.9% -99.9% 0
1.00 -99.9% -99.9% 0
0.90 -99.9% -99.9% 0
0.80 -99.9% -99.9% 0
total 12.8% 13.1% 1579647
******************************************************************************
WILSON STATISTICS OF SCALED DATA SET: fae-ip.ahkl
******************************************************************************
Data is divided into resolution shells and a straight line
A - 2*B*SS is fitted to log<I>, where
RES = mean resolution (Angstrom) in shell
SS = mean of (sin(THETA)/LAMBDA)**2 in shell
<I> = mean reflection intensity in shell
BO = (A - log<I>)/(2*SS)
# = number of reflections in resolution shell
WILSON LINE (using all data) : A= 14.997 B= 29.252 CORRELATION= 0.99
# RES SS <I> log(<I>) BO
1667 8.445 0.004 2.3084E+06 14.652 49.2
2798 5.260 0.009 1.5365E+06 14.245 41.6
3547 4.106 0.015 2.0110E+06 14.514 16.3
4147 3.480 0.021 1.2910E+06 14.071 22.4
4688 3.073 0.026 7.3586E+05 13.509 28.1
5154 2.781 0.032 4.6124E+05 13.042 30.3
5568 2.560 0.038 3.1507E+05 12.661 30.6
5966 2.384 0.044 2.4858E+05 12.424 29.2
6324 2.240 0.050 1.8968E+05 12.153 28.5
6707 2.119 0.056 1.3930E+05 11.844 28.3
7030 2.016 0.062 9.1378E+04 11.423 29.0
7331 1.926 0.067 5.4413E+04 10.904 30.4
7664 1.848 0.073 3.5484E+04 10.477 30.9
7934 1.778 0.079 2.4332E+04 10.100 31.0
8193 1.716 0.085 1.8373E+04 9.819 30.5
8466 1.660 0.091 1.4992E+04 9.615 29.7
8743 1.609 0.097 1.1894E+04 9.384 29.1
9037 1.562 0.102 9.4284E+03 9.151 28.5
9001 1.520 0.108 8.3217E+03 9.027 27.6
HIGHER ORDER MOMENTS OF WILSON DISTRIBUTION OF CENTRIC DATA
AS COMPARED WITH THEORETICAL VALUES. (EXPECTED: 1.00)
# RES <I**2>/ <I**3>/ <I**4>/
3<I>**2 15<I>**3 105<I>**4
440 8.445 0.740 0.505 0.294
442 5.260 0.762 0.733 0.735
442 4.106 0.888 0.788 0.717
439 3.480 1.339 1.733 2.278
438 3.073 1.168 1.259 1.400
440 2.781 1.215 1.681 2.269
438 2.560 1.192 1.603 2.405
450 2.384 1.117 1.031 0.891
432 2.240 1.214 1.567 2.173
438 2.119 0.972 0.992 0.933
445 2.016 1.029 1.019 0.986
441 1.926 1.603 1.701 1.554
440 1.848 1.544 1.871 2.076
436 1.778 0.927 0.661 0.435
444 1.716 1.134 1.115 1.197
440 1.660 1.271 1.618 2.890
436 1.609 1.424 1.045 0.941
448 1.562 1.794 1.447 1.423
426 1.520 2.517 1.496 2.099
8355 overall 1.253 1.255 1.455
HIGHER ORDER MOMENTS OF WILSON DISTRIBUTION OF ACENTRIC DATA
AS COMPARED WITH THEORETICAL VALUES. (EXPECTED: 1.00)
# RES <I**2>/ <I**3>/ <I**4>/
2<I>**2 6<I>**3 24<I>**4
1227 8.445 1.322 1.803 2.340
2356 5.260 1.167 1.420 1.789
3105 4.106 1.010 1.046 1.100
3708 3.480 1.055 1.262 1.592
4250 3.073 0.999 1.083 1.375
4714 2.781 1.061 1.232 1.591
5130 2.560 1.049 1.178 1.440
5516 2.384 1.025 1.117 1.290
5892 2.240 1.001 1.058 1.230
6269 2.119 1.060 1.140 1.233
6585 2.016 1.109 1.344 1.709
6890 1.926 1.028 1.100 1.222
7224 1.848 1.060 1.150 1.348
7498 1.778 1.143 1.309 1.655
7749 1.716 1.182 1.299 1.549
8026 1.660 1.286 1.376 1.538
8307 1.609 1.419 1.481 1.707
8589 1.562 1.663 1.750 2.119
8575 1.520 2.271 2.172 5.088
111610 overall 1.253 1.354 1.804
======= CUMULATIVE INTENSITY DISTRIBUTION =======
DEFINITIONS:
<I> = mean reflection intensity
Na(Z)exp = expected number of acentric reflections with I <= Z*<I>
Na(Z)obs = observed number of acentric reflections with I <= Z*<I>
Nc(Z)exp = expected number of centric reflections with I <= Z*<I>
Nc(Z)obs = observed number of centric reflections with I <= Z*<I>
Nc(Z)obs/Nc(Z)exp versus resolution and Z (0.1-1.0)
# RES 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
440 8.445 0.75 0.95 0.98 1.00 0.98 0.99 1.00 1.00 1.02 1.02
442 5.260 1.18 1.11 1.09 1.09 1.07 1.08 1.08 1.08 1.07 1.06
442 4.106 0.97 1.01 0.98 0.97 0.96 0.94 0.92 0.91 0.92 0.94
439 3.480 0.91 0.88 0.91 0.91 0.89 0.90 0.90 0.89 0.89 0.93
438 3.073 0.92 0.92 0.90 0.93 0.94 0.99 1.02 0.99 0.96 0.96
440 2.781 0.98 1.01 1.02 1.05 1.04 1.03 1.04 1.02 1.01 1.01
438 2.560 1.02 1.10 1.05 1.03 1.01 1.03 1.04 1.01 1.04 1.02
450 2.384 0.78 0.93 0.92 0.93 0.89 0.89 0.92 0.95 0.96 0.95
432 2.240 0.69 0.82 0.84 0.86 0.91 0.92 0.93 0.94 0.95 0.95
438 2.119 0.75 0.87 0.95 1.02 1.09 1.09 1.12 1.12 1.10 1.08
445 2.016 0.86 0.86 0.87 0.90 0.91 0.93 0.98 0.99 1.00 1.00
441 1.926 0.88 0.79 0.79 0.81 0.82 0.84 0.85 0.85 0.86 0.86
440 1.848 1.00 0.89 0.85 0.83 0.85 0.85 0.88 0.90 0.90 0.92
436 1.778 1.03 0.87 0.79 0.79 0.80 0.84 0.85 0.87 0.90 0.92
444 1.716 1.09 0.85 0.81 0.78 0.80 0.80 0.81 0.81 0.84 0.85
440 1.660 1.27 1.01 0.93 0.88 0.85 0.84 0.84 0.85 0.88 0.91
436 1.609 1.34 1.00 0.89 0.83 0.80 0.80 0.80 0.81 0.80 0.83
448 1.562 1.39 1.09 0.93 0.86 0.81 0.78 0.77 0.79 0.78 0.78
426 1.520 1.38 1.03 0.88 0.83 0.82 0.80 0.78 0.76 0.75 0.74
8355 overall 1.01 0.95 0.92 0.91 0.91 0.91 0.92 0.92 0.93 0.93
Na(Z)obs/Na(Z)exp versus resolution and Z (0.1-1.0)
# RES 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1227 8.445 1.10 1.22 1.21 1.21 1.14 1.10 1.12 1.10 1.11 1.09
2356 5.260 1.15 1.10 1.09 1.03 1.03 1.03 1.01 1.01 1.01 1.00
3105 4.106 0.91 0.96 0.99 1.01 1.02 1.00 1.00 0.99 0.99 1.00
3708 3.480 0.93 0.97 1.00 1.06 1.05 1.04 1.04 1.04 1.04 1.05
4250 3.073 0.94 1.02 1.01 1.00 1.01 1.00 1.00 1.01 1.02 1.02
4714 2.781 1.11 1.04 1.02 1.02 1.02 1.01 1.01 1.01 1.00 1.00
5130 2.560 1.00 1.10 1.06 1.03 1.01 1.02 1.01 1.01 1.01 1.02
5516 2.384 1.09 1.08 1.05 1.04 1.04 1.02 1.01 1.01 1.01 1.01
5892 2.240 0.98 0.99 1.00 1.01 1.01 1.01 1.00 1.00 1.00 1.00
6269 2.119 1.14 1.04 1.02 1.00 1.00 1.00 1.01 1.02 1.02 1.01
6585 2.016 1.17 1.02 1.01 1.02 1.02 1.03 1.02 1.02 1.02 1.02
6890 1.926 1.35 1.07 1.00 0.99 1.00 1.01 1.01 1.00 1.00 1.01
7224 1.848 1.52 1.11 1.01 0.97 0.96 0.98 0.98 0.98 0.98 0.99
7498 1.778 1.80 1.22 1.03 0.97 0.95 0.94 0.95 0.95 0.95 0.96
7749 1.716 2.01 1.28 1.07 0.99 0.94 0.92 0.92 0.92 0.93 0.93
8026 1.660 2.31 1.41 1.13 1.01 0.95 0.92 0.90 0.89 0.89 0.89
8307 1.609 2.62 1.54 1.19 1.04 0.95 0.90 0.88 0.87 0.86 0.87
8589 1.562 2.94 1.69 1.29 1.10 1.00 0.93 0.89 0.86 0.85 0.85
8575 1.520 3.14 1.78 1.34 1.13 1.01 0.93 0.88 0.85 0.83 0.83
111610 overall 1.73 1.24 1.09 1.03 0.99 0.97 0.96 0.96 0.96 0.96
List of 33 reflections *NOT* obeying Wilson distribution (Z> 10.0)
h k l RES Z Intensity Sigma
72 11 61 1.52 17.34 0.2886E+06 0.2367E+05 "alien"
67 53 6 1.50 15.85 0.2638E+06 0.1128E+06 "alien"
35 10 25 3.17 14.39 0.2118E+08 0.2364E+06 "alien"
46 17 99 1.50 14.16 0.2357E+06 0.9588E+05 "alien"
34 32 2 2.75 13.44 0.1239E+08 0.1279E+06 "alien"
79 6 15 1.60 13.10 0.3117E+06 0.2477E+05 "alien"
61 20 33 1.88 12.54 0.8900E+06 0.3054E+05 "alien"
44 4 48 2.30 12.38 0.4695E+07 0.6072E+05 "alien"
66 25 19 1.79 11.89 0.5788E+06 0.2739E+05 "alien"
66 25 11 1.81 11.88 0.5781E+06 0.2771E+05 "alien"
60 43 61 1.50 11.77 0.1959E+06 0.9769E+05 "alien"
72 11 17 1.74 11.64 0.4278E+06 0.2619E+05 "alien"
80 24 26 1.50 11.41 0.1899E+06 0.9793E+05 "alien"
41 21 26 2.59 11.09 0.6988E+07 0.7945E+05 "alien"
44 18 20 2.59 11.08 0.6982E+07 0.7839E+05 "alien"
23 3 62 2.59 11.06 0.6971E+07 0.9154E+05 "alien"
69 7 22 1.80 11.06 0.5383E+06 0.2564E+05 "alien"
73 10 15 1.72 10.98 0.4036E+06 0.2356E+05 "alien"
70 17 35 1.68 10.96 0.3286E+06 0.2415E+05 "alien"
57 24 41 1.88 10.91 0.7746E+06 0.2842E+05 "alien"
82 24 6 1.50 10.74 0.1787E+06 0.1019E+06 "alien"
69 25 62 1.50 10.67 0.1775E+06 0.8689E+05 "alien"
24 20 44 2.91 10.45 0.9641E+07 0.1017E+06 "alien"
66 43 5 1.63 10.37 0.2468E+06 0.2294E+05 "alien"
81 4 29 1.53 10.36 0.1725E+06 0.2364E+05 "alien"
60 40 26 1.72 10.32 0.3792E+06 0.2578E+05 "alien"
39 18 57 2.18 10.24 0.3885E+07 0.5573E+05 "alien"
70 41 15 1.57 10.19 0.1922E+06 0.2281E+05 "alien"
55 36 41 1.79 10.16 0.4942E+06 0.2967E+05 "alien"
37 4 81 1.88 10.15 0.7202E+06 0.3357E+05 "alien"
56 27 5 2.06 10.14 0.1854E+07 0.3569E+05 "alien"
44 39 29 2.06 10.09 0.1844E+07 0.3805E+05 "alien"
65 46 29 1.56 10.06 0.1898E+06 0.2270E+05 "alien"
List of 33 reflections *NOT* obeying Wilson distribution (sorted by resolution)
Ice rings could occur at (Angstrom):
3.897,3.669,3.441, 2.671,2.249,2.072, 1.948,1.918,1.883,1.721
h k l RES Z Intensity Sigma
82 24 6 1.50 10.74 0.1787E+06 0.1019E+06
67 53 6 1.50 15.85 0.2638E+06 0.1128E+06
80 24 26 1.50 11.41 0.1899E+06 0.9793E+05
60 43 61 1.50 11.77 0.1959E+06 0.9769E+05
69 25 62 1.50 10.67 0.1775E+06 0.8689E+05
46 17 99 1.50 14.16 0.2357E+06 0.9588E+05
72 11 61 1.52 17.34 0.2886E+06 0.2367E+05
81 4 29 1.53 10.36 0.1725E+06 0.2364E+05
65 46 29 1.56 10.06 0.1898E+06 0.2270E+05
70 41 15 1.57 10.19 0.1922E+06 0.2281E+05
79 6 15 1.60 13.10 0.3117E+06 0.2477E+05
66 43 5 1.63 10.37 0.2468E+06 0.2294E+05
70 17 35 1.68 10.96 0.3286E+06 0.2415E+05
73 10 15 1.72 10.98 0.4036E+06 0.2356E+05
60 40 26 1.72 10.32 0.3792E+06 0.2578E+05
72 11 17 1.74 11.64 0.4278E+06 0.2619E+05
66 25 19 1.79 11.89 0.5788E+06 0.2739E+05
55 36 41 1.79 10.16 0.4942E+06 0.2967E+05
69 7 22 1.80 11.06 0.5383E+06 0.2564E+05
66 25 11 1.81 11.88 0.5781E+06 0.2771E+05
61 20 33 1.88 12.54 0.8900E+06 0.3054E+05
57 24 41 1.88 10.91 0.7746E+06 0.2842E+05
37 4 81 1.88 10.15 0.7202E+06 0.3357E+05
56 27 5 2.06 10.14 0.1854E+07 0.3569E+05
44 39 29 2.06 10.09 0.1844E+07 0.3805E+05
39 18 57 2.18 10.24 0.3885E+07 0.5573E+05
44 4 48 2.30 12.38 0.4695E+07 0.6072E+05
44 18 20 2.59 11.08 0.6982E+07 0.7839E+05
41 21 26 2.59 11.09 0.6988E+07 0.7945E+05
23 3 62 2.59 11.06 0.6971E+07 0.9154E+05
34 32 2 2.75 13.44 0.1239E+08 0.1279E+06
24 20 44 2.91 10.45 0.9641E+07 0.1017E+06
35 10 25 3.17 14.39 0.2118E+08 0.2364E+06
cpu time used by XSCALE 25.9 sec
elapsed wall-clock time 28.1 sec
I would like to extract the second last line where the 11th column has a number followed by an asterisk (xy.z*). E.g. in this table the line I'm looking for would contain "23.2*" from the 11th column (CC(1/2)). I would like the second last because the last would be the line that starts with total, and this was a lot easier to extract with a simple grep command.
So the expected output for the code in this case would be to print the line:
1.50 273432 20770 20893 99.4% 333.4% 342.1% 273340 0.80 346.9% 23.2* -1 0.644 19495
In a different file the second last value in the 11th with an asterisk after may correspond to 1.6 in the first column so the expected output would be:
1.60 216910 5769 5769 100.0% 207.5% 214.7% 216910 1.72 210.4% 26.0* -3 0.654 5204
And so on for all the different possible positions of the asterisk in the table.
I've tried using things like grep "[0-9, 0-9, ., 0-9*]" file.name and various other grep and fgrep things but I'm pretty new to this and can't get it to work.
Any help would be greatly appreciated.
Sam
GNU sed
(for your updated script)
sed -n '/LIMIT/,/=/{/^\s*\(\S*\s*\)\{10\}[0-9.-]*\*/H;x;s/^.*\n\(.*\n.*\)$/\1/;x;/=/{x;P;q}}' file
.. output is:
1.50 273432 20770 20893 99.4% 333.4% 342.1% 273340 0.80 346.9% 23.2* -1 0.644 19495
To print the entire second last line which matches that regex, you can do something like this:
awk '$11~/[0-9.]+\*/{secondlast=last;last=$0}END{print secondlast}' logFile
This one liner can do it:
$ awk '{if ($11 ~ /\*/) {i++; a[i]=$0}} END {print a[i -1]}' file
1.50 274090 20781 20874 99.6% 333.7% 341.9% 274015 0.80 347.1% 24.8* 0 0.645 19516
Explanation
It add to the array a[] all lines that contain * the 11th field. Then prints not the last but the previous one.
Update
Since your log is very big and asterisks appear all around, I update my code to:
$ awk '{if ($11 == /[0-9]*.[0-9]*\*/) {i++; a[i]=$0}} END {print a[i -1]}' a
0.90 0 0 147505 0.0% -99.9% -99.9% 0 -99.00 -99.9% 0.0 0 0.000 0
so it looks for lines with NNN.XXX* format.
awk '$11~/^[0-9.]+\*$/ {prev=val; val=$11+0} END {print prev}' log
I add 0 to the value of $11 to convert the string "23.2*" to the number 23.2.
Alternately, when I hear "nth from the end", I think: reverse it and take the nth from the top:
tac log | awk '$11~/^[0-9.]+\*$/ && ++n == 2 {print $11+0; exit}'