Selecting one output when regex matches more than once - awk

Given an input file containing multiple rows of my target
...
100 100 100 100
Expression: out1
200 200 200 200
300 300 300 300
Expression: out2
400 400 400 400
500 500 500 500
Expression: out3
...
If I do
awk '/Expression:/ {printf " %f ",$2 } ' $file
I get multiple outputs
out1 out2 out3
How could I choose only one of the output based on the position they are found on the file, for instance out3?

For - If n number of rows found for the regex matched, then how to select last one ?
awk '/Expression:/ { last_found = $2 }END{print last_found }' file
Display nth match for the given regex
awk '/Expression:/ { if(++i==3){ print $2; exit } }' file
Input
akshay#db-3325:~$ cat f
...
100 100 100 100
Expression: out1
200 200 200 200
300 300 300 300
Expression: out2
400 400 400 400
500 500 500 500
Expression: out3
...
Output
# for 3rd one
akshay#db-3325:~$ awk '/Expression:/ { if(++i==3){ print $2; exit } }' f
out3
# For 2nd one
akshay#db-3325:~$ awk '/Expression:/ { if(++i==2){ print $2; exit } }' f
out2
# For last one
akshay#db-3325:~$ awk '/Expression:/ { last_found = $2 }END{print last_found }' f
out3

Simply count the occurrences and print the last value once the counter has reached the threshold:
awk '/Expression/{c+=1;s=$2};c==3{print s}'

Related

Divide largest value by second largest value

I am having a file in the following format. Column one has ~20,000 uniq entry and column 2 has ~120,000 different entry and column 3 has count associated with column 2. For a single entry in column 1 there can be multiple entry in column 2. For each unique entry in column 1, I am trying to get ratio of maximum value to second maximum value of column 3.
F1.txt
S1 S2 C1
A A1 1
A AA 10
A A6 5
A A0 4
B BB 12
B BC 11
B B1 19
B B9 4
Expected Output
S1 S2 C1
B B1 19 1.58333
A AA 10 2
I can do in steps like bellow. But is there a smart way of doing in in one script?
awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r "}' F1.txt | awk '!seen[$1]++' >del1.txt
awk 'FNR==NR{a[$2]=1; next}FNR==1{print $0;}!a[$2]' del1.txt F1.txt | awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r"}' | awk '!seen[$1]++' >del2.txt
awk 'FNR==NR{a[$1]=$3; next}FNR==1{print $0"\t";"RT"}FNR>1 a[$1]{print $0"\t"$3/a[$1]}' del2.txt del1.txt
#!/usr/bin/awk -f
NR == 1 { print $1, $2, $3; next }
{ data[$1][$3] = $2 }
END {
for (key in data) {
asorti(data[key], s, "#ind_num_desc")
print key, data[key][s[1]], s[1], s[1] / s[2]
}
}
This^^^ assumes an arbitrary permutation of the lines (and requires gawk (which is pretty common) or another implementation with native multi-dimensional “arrays”).
If you can make more assumptions about the input — e.g. that it is always grouped by the first column —, then you can make it more memory-efficient and get rid of multi-dimensional arrays (by not delaying the evaluation until END and instead calculating it in a per-line block each time the first column’s value changes (and then one last time in END).)
To get a different handling of equal numeric values (e.g. to report the “subkey” (column 2) of the first (instead of last) encountered occurrence of a value), you could add if (!($3 in data[$1])) ... or the like.
Whenever you find yourself creating a pipeline containing awk, there is a very good chance that what you are trying to do can be done in a single call to awk much more efficiently.
A non-GNU awk approach that presumes all field1 'A' records are together and all 'B' records are together (as you show in your sample data) could be:
awk '
FNR==1 { print; next } ## 1st line, output heading
$1 != n { ## 1st field changed
if (n) { ## if n set, output result of last block
printf "%s\t%s\n", rec, max / nextmax
}
rec = $0 ## initialize vars for next block
n = $1
max = $3
nextmax = 1
next ## skip to next record
}
{
if ($3 > max) { ## check if 3rd field > max
rec = $0 ## save record
nextmax = max ## update nextmax
max = $3 ## update max
}
else if ($3 > nextmax) { ## if 3rd field > nextmax
nextmax = $3 ## update nextmax
}
} ## output final block results
END { printf "%s\t%s\n", rec, max / nextmax }
' file
Example Use/Output
With your data in the file file, you would have:
$ awk '
> FNR==1 { print; next } ## 1st line, output heading
> $1 != n { ## 1st field changed
> if (n) { ## if n set, output result of last block
> printf "%s\t%s\n", rec, max / nextmax
> }
> rec = $0 ## initialize vars for next block
> n = $1
> max = $3
> nextmax = 1
> next ## skip to next record
> }
> {
> if ($3 > max) { ## check if 3rd field > max
> rec = $0 ## save record
> nextmax = max ## update nextmax
> max = $3 ## update max
> }
> else if ($3 > nextmax) { ## if 3rd field > nextmax
> nextmax = $3 ## update nextmax
> }
> } ## output final block results
> END { printf "%s\t%s\n", rec, max / nextmax }
> ' file
S1 S2 C1
A AA 10 2
B B1 19 1.58333
Using any awk in any shell on every Unix box and using almost no memory (important since your input file would be huge given your description of it):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 { print; next }
$1 != prev {
if ( prev != "" ) {
print prev, val, max, (preMax ? max/preMax : 0)
}
prev = $1
max = ""
}
(max == "") || ($3 > max) {
val = $2
preMax = max
max = $3
}
END { print prev, val, max, (preMax ? max/preMax : 0) }
$ awk -f tst.awk F1.txt
S1 S2 C1
A AA 10 10
B B1 19 1.58333

How to replace data in pandas by using values in dict?

I have a series which contains several numbers. I want to replace them to other string type data by using dictionary values. But I don't know how to do that...
GDP_group['GdpForYearPer$1M'].head(5)
0 46.919625
1 47.515189
2 47.737955
3 54.832578
4 56.338028
5 63.101272 \
This is the dict that I made to replace data.
range_GDP = {'$0 ~ $100M': np.arange(0,100), '$100M ~ $1B': np.arange(100.0000001,1000), '$1B ~ $10B': np.arange(1000.000001, 10000), '$10B ~ $100B': np.arange(10000.000001, 100000),
'$100B ~ $1T': np.arange(100000.000001, 1000000), '$1T ~': np.arange(1000000.000001, 20000000)}
You can use pd.cut to segment your data in ranges and apply labels.
(re)generate dummy data sampled uniformly in log space:
import numpy as np
import pandas as pd
GdpForYearPer1M = pd.Series(10**np.random.randint(0, 8, 100))
"""
0 1
1 1000
2 100
3 10
4 100
...
95 1000000
96 100
97 100000
98 10000
99 10
"""
solution:
# generate "cuts" (bins) and associated labels from `range_GDP`.
cut_data = [(np.min(v), k) for k, v in range_GDP.items()]
bins, labels = zip(*cut_data)
# bins required to have one more value than labels
bins = list(bins) + [np.inf]
pd.cut(GdpForYearPer1M, bins=bins, labels=labels)
output:
0 $0 ~ $100M
1 $100M ~ $1B
2 $0 ~ $100M
3 $0 ~ $100M
4 $0 ~ $100M
...
95 $100B ~ $1T
96 $0 ~ $100M
97 $10B ~ $100B
98 $1B ~ $10B
99 $0 ~ $100M
Length: 100, dtype: category
Categories (6, object): [$0 ~ $100M < $100M ~ $1B < $1B ~ $10B < $10B ~ $100B < $100B ~ $1T < $1T ~]

Process multiple files with awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You
Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

rearrange from specific string into respective column

I'm trying to rearrange from specific string into respective column.
etc:
126N (will be sorted into "Normal" column)
Value 1 (the integer will be concatenated with 126)
Resulting :
N=Normal
126 # 1
Here is the input
(N=Normal, W=Weak)
Value 1
126N,
Value 3
18N,
Value 4
559N, 562N, 564N,
Value 6
553W, 565A, 553N,
Value 5
490W,
Value 9
564N,
And the output should be
W=Weak
490 # 5
553 # 6
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9
Let me know your thought on this.
I've tried this script, I'm still figuring out to concatenating the value
cat input.txt | sed '/^\s*$/d' | awk 'BEGIN{RS=","};match($0,/N/){print $3"c"$2}' | sed ':a;N;$!ba;s/\n/;/g' | sed 's/W//g;s/N//g;s/S//g'
And some of it, are missing
This should give you what you want using gnu awk
IT will work with any number of letters, not just A N W
awk -F, '
!/Value/ {
for (i=1;i<NF;i++) {
hd=substr($i,length($i),1);
arr[hd][++cnt[hd]]=($i+0" # "f)}
}
{split($0,b," ");f=b[2];}
END {
for (i in arr) { print "\n"i"\n---";
for (j in arr[i]) {
print arr[i][j]}}
}' file
A
---
565 # 6
N
---
562 # 4
564 # 4
553 # 6
564 # 9
126 # 1
18 # 3
559 # 4
W
---
553 # 6
490 # 5
Another alternative in awk would be:
awk -F',| ' '
$1 == "Value" {value = $2; next}
{ for (i=1; i<=NF; i++) {
if ($i~"N$")
N[substr($i, 1, length($i) - 1)] = value
if ($i~"W$")
W[substr($i, 1, length($i) - 1)] = value
}
}
END {
print "W=Weak"
for (i in W)
print i, "#", W[i]
print "\nN=Normal"
for (i in N)
print i, "#", N[i]
}
' file
(note: this relies on knowing the wanted headers are W=Weak and N=Normal. If would take a few additional expression if the headers are subject to change.)
Output
$ awk -F',| ' '
> $1 == "Value" {value = $2; next}
> { for (i=1; i<=NF; i++) {
> if ($i~"N$")
> N[substr($i, 1, length($i) - 1)] = value
> if ($i~"W$")
> W[substr($i, 1, length($i) - 1)] = value
> }
> }
> END {
> print "W=Weak"
> for (i in W)
> print i, "#", W[i]
> print "\nN=Normal"
> for (i in N)
> print i, "#", N[i]
> }
> ' file
W=Weak
490 # 5
N=Normal
18 # 3
126 # 1
559 # 4
562 # 4
564 # 9
$ cat tst.awk
NR%2 { val = $NF; next }
{
for (i=1; i<=NF; i++) {
num = $i+0
abbr = $i
gsub(/[^[:alpha:]]/,"",abbr)
list[abbr] = list[abbr] num " # " val ORS
}
}
END {
n = split("Weak Absolute Normal",types)
for (i=1; i<=n; i++) {
name = types[i]
abbr = substr(name,1,1)
print abbr "=" name ORS list[abbr]
}
}
.
$ awk -f tst.awk file
W=Weak
553 # 6
490 # 5
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.
The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)
#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.