price.txt file has two columns: (name and value)
Mary 134
Lucy 56
Jack 88
range.txt file has three columns: (fruit and min_value and max_value)
apple 57 136
banana 62 258
orange 88 99
blueberry 98 121
My aim is to test whether the value in price.txt file is between the min_value and max_value in range.txt. If yes, putout 1, If not, output "x".
I tried:
awk 'FNR == NR { name=$1; price[name]=$2; next} {
for (name in price) {
if ($2<=price[name] && $3>=price[name]) {print 1} else {print "x"}
}
}' price.txt range.txt
But my results are all in one column, just like follows:
1
1
x
x
x
x
x
x
1
1
1
x
Actually, I want my result to be like: (Each name has one column)
1 x 1
1 x 1
x x 1
x x x
Because I need to use paste to add the output file and range.txt file together. The final result should be like:
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
So, how can I get the result of each loop in different columns? And is there anyway to output the final result without paste based on my current code? Thank you.
This builds on what you provided,
# load prices by index to maintain read order
FNR == NR {
price[names++]=$2
next
}
# save max index to avoid using non-standard length(array)
END {
names=NR
}
{
l = $1 " " $2 " " $3
for (i=0; i < names; i++) {
if ($2 <= price[i] && $3 >= price[i]) {
l = l " 1"
} else {
l = l " x"
}
}
print l
}
and generates output,
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
However, you don't have the person name for the score (anonymous results) - maybe that's intentional?
The change here is to explicitly index array populated in first block to maintain order.
Related
I have a data set: (file.txt)
X Y
1 a
2 b
3 c
10 d
11 e
12 f
15 g
20 h
25 i
30 j
35 k
40 l
41 m
42 n
43 o
46 p
I want to add two columns which are Up10 and Down10,
Up10: From (X) to (X-10) count of row.
Down10 : From (X) to (X+10)
count of row
For example:
X Y Up10 Down10
35 k 3 5
For Up10; 35-10 X=35 X=30 X=25 Total = 3 row
For Down10; 35+10 X=35 X=40 X=41 X=42 X=42 Total = 5 row
Desired Output:
X Y Up10 Down10
1 a 1 5
2 b 2 5
3 c 3 4
10 d 4 5
11 e 5 4
12 f 5 3
15 g 4 3
20 h 5 3
25 i 3 3
30 j 3 3
35 k 3 5
40 l 3 5
41 m 3 4
42 n 4 3
43 o 5 2
46 p 5 1
This is the Pierre François' solution: Thanks again #Pierre François
awk '
BEGIN{OFS="\t"; print "X\tY\tUp10\tDown10"}
(NR == FNR) && (FNR > 1){a[$1] = $1 + 0}
(NR > FNR) && (FNR > 1){
up = 0; upl = $1 - 10
down = 0; downl = $1 + 10
for (i in a) { i += 0 # tricky: convert i to integer
if ((i >= upl) && (i <= $1)) {up++}
if ((i >= $1) && (i <= downl)) {down++}
}
print $1, $2, up, down;
}
' file.txt file.txt > file-2.txt
But when i use this command for 13GB data, it takes too long.
I have used this way for 13GB data again:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{a[NR]=$1;next} {x=y=FNR;while(--x in a&&$1-10<a[x]){} while(++y in a&&$1+10>a[y]){} print $0,FNR-x,y-FNR}
' file.txt file.txt > file-2.txt
When file-2.txt reaches 1.1GB it is frozen. I am waiting several hours, but i can not see finish of command and final output file.
Note: I am working on Gogole cloud. Machine type
e2-highmem-8 (8 vCPUs, 64 GB memory)
A single pass awk that keeps the sliding window of 10 last records and uses that to count the ups and downs. For symmetricy's sake there should be deletes in the END but I guess a few extra array elements in memory isn't gonna make a difference:
$ awk '
BEGIN {
FS=OFS="\t"
}
NR==1 {
print $1,$2,"Up10","Down10"
}
NR>1 {
a[NR]=$1
b[NR]=$2
for(i=NR-9;i<=NR;i++) {
if(a[i]>=a[NR]-10&&i>=2)
up[NR]++
if(a[i]<=a[NR-9]+10&&i>=2)
down[NR-9]++
}
}
NR>10 {
print a[NR-9],b[NR-9],up[NR-9],down[NR-9]
delete a[NR-9]
delete b[NR-9]
delete up[NR-9]
delete down[NR-9]
}
END {
for(nr=NR+1;nr<=NR+9;nr++) {
for(i=nr-9;i<=nr;i++)
if(a[i]<=a[nr-9]+10&&i>=2&&i<=NR)
down[nr-9]++
print a[nr-9],b[nr-9],up[nr-9],down[nr-9]
}
}' file
Output:
X Y Up10 Down10
1 a 1 5
2 b 2 5
...
35 k 3 5
...
43 o 5 2
46 p 5 1
Another single pass approach with a sliding window
awk '
NR == 1 { next } # skip the header
NR == 2 { min = max = cur = 1; X[cur] = $1; Y[cur] = $2; next }
{ X[++max] = $1; Y[max] = $2
if (X[cur] >= $1 - 10) next
for (; X[cur] + 10 < X[max]; ++cur) {
for (; X[min] < X[cur] - 10; ++min) {
delete X[min]
delete Y[min]
}
print X[cur], Y[cur], cur - min + 1, max - cur
}
}
END {
for (; cur <= max; ++cur) {
for (; X[min] < X[cur] - 10; ++min);
for (i = max; i > cur && X[cur] + 10 < X[i]; --i);
print X[cur], Y[cur], cur - min + 1, i - cur + 1
}
}
' file
The script assumes the X column is ordered numerically.
So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()
try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))
using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)
I have a sequence of characters in which I would like to split each sequence into 3-characters class from the beginning to the end. and the get the count of each class. here is a small example of sequences of characters for 2 IDs.
>ID1
ATGTCCAAGGGGATCCTGCAGGTGCATCCTCCGATCTGCGACTGCCCGGGCTGCCGAATA
TCCTCCCCGGTGAACCGGGGGCGGCTGGCAGACAAGAGGACAGTCGCCCTGCCTGCCGCC
>ID2
ATGAAACTTTCACCTGCGCTCCCGGGAACAGTTTCTGCTCGGACTCCTGATCGTTCACCT
CCCTGTTTTCCCGACAGCGAGGACTGTCTTTTCCAACCCGACATGGATGTGCTCCCAATG
ACCTGCCCGCCACCACCAGTTCCAAAGTTTGCACTCCTTAAGGATTATAGGCCTTCAGCT
and here is a small example of output for ID1. I want to get the same output for all IDs in the input file (the lines of characters belong each ID is in the next line). the counts for the next ID comes just after the first and so on.
ID1_3nt count
ATG 1
TCC 3
AAG 2
GGG 2
ATC 2
CTG 3
CAG 1
GTG 2
CAT 1
CCT 2
CCG 3
TGC 3
GAC 2
GGC 1
CGA 1
ATA 1
AAC 1
CGG 2
GCA 1
AGG 1
GCC 3
ACA 1
GTC 1
I tried this code:
awk '{i=0; printf ">%s\n",$2; while(i<=length($1)) {printf "%s\n", substr($1,i,3);i+=3}} /,substr,/ {count++}' | awk 'END { printf(" ID_3nt: %d",count)}
but did not return what I want. do you know how to improve it?
How about this patsplit()-based implementation?
#! /usr/bin/awk -f
# initialize publicly scoped vars...
function init() {
split("", idx) # index of our class (for ordering)
split("", cls) # our class name
split("", cnt) # num of classes we have seen
sz = 0 # number of classes for this ID
}
# process a class record
function proc( i, n, x) {
# split on each 3 characters
n = patsplit($0, a, /.../)
for (i=1; i<=n; ++i) {
x = a[i]
if (x in idx) {
# if this cls exists, just increment the count
++cnt[idx[x]]
} else {
# if this cls doesn't exist, index it in
cls[sz] = x
cnt[sz] = 1
idx[x] = sz++
}
}
}
# spit out class summary
function flush( i) {
if(!sz)
return
for(i=0; i<sz; ++i)
print cls[i], cnt[i]
init()
}
BEGIN {
init()
}
/^>ID/ {
flush()
sub(/^>/, "")
print $0 "_3nt count"
next
}
{
# we could have just inlined proc(), but using a function
# provides us with locally scoped variables
proc()
}
END {
flush()
}
$ cat tst.awk
sub(/^>/,"") { if (NR>1) prt(); name=$0; next }
{ rec = rec $0 }
END { prt() }
function prt( cnt, class) {
while ( rec != "" ) {
cnt[substr(rec,1,3)]++
rec = substr(rec,4)
}
print name "_3nt count"
for (class in cnt) {
print class, cnt[class]
}
}
.
$ awk -f tst.awk file
ID1_3nt count
ACA 1
AAC 1
CGA 1
CAT 1
GTG 2
CAG 1
GGG 2
CCG 3
CCT 2
GCA 1
ATA 1
GAC 2
AAG 2
GCC 3
ATC 2
TCC 3
CGG 2
CTG 3
GTC 1
AGG 1
GGC 1
TGC 3
ATG 1
ID2_3nt count
AAA 1
CCC 3
ACA 1
GTG 1
TTT 2
TGT 2
GTT 2
ACC 1
CCG 2
CTC 3
CCT 4
GCA 1
AAG 2
GAC 3
TCA 3
AGC 1
ACT 1
CGT 1
CGG 1
CTT 3
TAT 1
CAA 1
GAG 1
GAT 3
GGA 1
AGG 1
TGC 1
CCA 5
TTC 1
GCT 2
TCT 1
GCG 1
ATG 3
I'm fighting with awk again for pulling out data from a log file. The area in question of my log file looks like this, however there are a few thousand lines above and below this block:
4C*DJ - (B-C)*DJK + 2*(2A+B+C)*D1 - 4*(4A+B-3C)*D2 = 0
Value = 0.5293955920D-22
Alpha Matrix in cm-1
Axis Mode Inertia Coriol. Anharm. Total
x 1 -0.37699D-03 -0.36413D-02 0.10830D-01 0.68121D-02
x 2 -0.83656D-03 -0.53163D-02 0.14483D-01 0.83306D-02
x 3 -0.15253D-02 -0.10512D-01 0.20064D-01 0.80264D-02
x 4 -0.17103D-03 -0.73492D-03 0.14953D-01 0.14047D-01
x 5 -0.96312D-03 -0.11748D-01 0.15825D-02 -0.11128D-01
x 6 -0.46095D-03 -0.94225D-02 0.44165D-02 -0.54669D-02
x 7 -0.26926D-01 -0.10167D-01 0.29406D-01 -0.76866D-02
x 8 -0.17827D-02 -0.21079D-01 0.74564D-02 -0.15405D-01
x 9 -0.55840D-02 0.84897D-01 -0.29596D-02 0.76354D-01
x 10 -0.50287D-24 0.36312D-01 -0.44078D-02 0.31904D-01
x 11 -0.48777D-24 -0.63320D-01 0.18876D-02 -0.61432D-01
x 12 -0.35364D-24 0.42877D-01 0.62352D-03 0.43500D-01
y 1 -0.23141D-05 -0.13777D-03 0.53278D-03 0.39270D-03
y 2 -0.62128D-05 -0.87905D-04 0.36602D-03 0.27190D-03
y 3 -0.55613D-05 -0.33722D-04 0.28874D-03 0.24946D-03
y 4 -0.47995D-04 -0.60863D-03 0.17426D-02 0.10860D-02
y 5 -0.36076D-04 -0.20493D-03 0.12026D-03 -0.12075D-03
y 6 -0.12725D-03 -0.61930D-03 -0.15830D-03 -0.90485D-03
y 7 -0.19917D-03 -0.55423D-04 0.10520D-02 0.79740D-03
y 8 -0.48978D-03 -0.13733D-02 0.54899D-03 -0.13141D-02
y 9 -0.11432D-02 0.62058D-03 -0.20074D-04 -0.54272D-03
y 10 -0.16078D-24 0.20852D-02 -0.88466D-04 0.19967D-02
y 11 -0.63877D-25 0.18274D-03 -0.13682D-03 0.45922D-04
y 12 -0.43257D-25 0.92039D-03 -0.61669D-03 0.30370D-03
z 1 -0.69174D-07 -0.23737D-03 0.59290D-03 0.35547D-03
z 2 -0.60773D-05 -0.18704D-03 0.53271D-03 0.33960D-03
z 3 -0.46425D-05 -0.29722D-03 0.57403D-03 0.27217D-03
z 4 -0.22234D-04 -0.47670D-03 0.15748D-02 0.10759D-02
z 5 -0.20254D-04 0.24124D-03 0.11848D-03 0.33947D-03
z 6 -0.42788D-04 0.99264D-04 -0.40246D-04 0.16230D-04
z 7 -0.10941D-03 0.30020D-03 0.13135D-02 0.15043D-02
z 8 -0.19997D-03 0.32196D-03 0.54501D-03 0.66699D-03
z 9 -0.20819D-03 0.45666D-03 -0.67765D-04 0.18071D-03
z 10 -0.55249D-25 0.00000D+00 -0.14491D-03 -0.14491D-03
z 11 -0.55828D-26 0.00000D+00 -0.69139D-04 -0.69139D-04
z 12 -0.26265D-26 0.00000D+00 -0.45200D-03 -0.45200D-03
Vibro-Rot alpha Matrix (cm-1)
a(z) b(x) c(y)
Q( 1) 0.00681 0.00039 0.00036
I need to extract the data from (in this case) " x 1 -0.37..." through "z 12 -0.262..."
I can head and tail the file if I can just get awk to extract the data to some known point. I have about 300 of these files, each has a different number of lines so I can't just count lines, but they all start with "Axis Mode Inertia..." and end with "Vibro-Rot alpha Matrix".
I'm currently trying to use:
awk '$1=="Axis"&&$2=="Mode"{t=1};t;/[0-9]+ "Vibro-Rot alpha Matrix"/{exit}' file.log
Which works to get the start of the file (though it includes the header which I can subsequently cut off). But the end part of the awk command doesn't work. I've tried to end it with ^Vib/{exit} and other things, but nothing seems to work, I just get a few thousand lines of the log file when I do it.
As I'm sure it matters, there is a single space before "axis" at the top, and before "Vibro-Rot" at the bottom of the file. Though the " $1=="Axis"&&$2=="Mode" " part doesn't seem to care about a single white space.
What am I missing to cut until the line that has "Vibro-Rot alpha Matrix" in it?
Thanks in advance!
Ben
It worked for me:
awk '$1 == "Axis" && $2 == "Mode" {t = 1;} $1 == "Vibro-Rot" && $2 == "alpha" && $3 == "Matrix" {t = 0;} t == 1 && NF == 6 {print $0}' file.log
In case you do not want the header, try:
awk '$1 == "Vibro-Rot" && $2 == "alpha" && $3 == "Matrix" {t = 0;} t == 1 && NF == 6 {print $0} $1 == "Axis" && $2 == "Mode" {t = 1;}' file.log
Try something like:
awk '!NF{p=0}p; /Axis Mode/{p=1}' file.log
--
Using your original approach:
How about:
awk '/Vibro-Rot alpha Matrix/{exit}t; $1=="Axis"&&$2=="Mode"{t=1}' file.log
Huh? Use grep:
egrep "^x|^y|^z" yourfile
how can I read only lines: 3,9,12, 15 from the file containing the ff lines.
The idea is whenever I get x and y , I wanted to print the last line among lines containing x and y.
What I meant is , for example , if I have awk script like : BEGIN { name = $2; value=$3; } { if(name == x && value==y && the scan reaches at lines 3, 9, 12 and 15) printf("hello world") }. what expression can I use instead of "the scan reaches at lines 3, 9 12 and 15"
1 x y
2 x y
3 x y
4 a d
5 e f
6 x y
7 x y
8 x y
9 x y
10 g f
11 x y
12 x y
13 p r
14 w c
15 x y
16 a z
One way with awk:
$ awk '/^[0-9]+ x y$/{a=$0;f=1;next}f{print a;f=0}' file
3 x y
9 x y
12 x y
15 x y
One way without awk:
$ tac file | uniq -f1 | fgrep -w 'x y' | tac
3 x y
9 x y
12 x y
15 x y
Some like this?
awk 'a=="xy" && $2$3!="xy" {print b} {a=$2$3;b=$0}' file
3 x y
9 x y
12 x y
15 x y
You need to use two while loops here one to check the line and another to iterate. Something like this. Hope that helps
String line = "";
int i = 0;
try {
BufferedReader in = new BufferedReader(new FileReader("D:\\readline.txt"));
while ((line = in.readLine()) != null) {
i++;
if (line.charAt(0) == 'x' && line.charAt(2) == 'y') {
System.out.println("Line containg Y and Y");
String searchline = line;
while ((line = in.readLine()) != null) { //Iterate untill you find the last line of X and Y
i++; //To keep count of the line read
if (line.charAt(0) == 'x' && line.charAt(2) == 'y') {
searchline = line;
continue;
} else {
break;
}
}
System.out.println("Printing the line ::" + (i - 1) + ":: containing X and Y::::::::" + searchline);
}
}
} catch (Exception e) {
System.out.println("Exception Caught::::");
}
}