Extract substring from grep/awk results? - awk

I have a grep command that finds rows in a file, passes those to awk and prints out the 1st and 15th columns.
grep String1 /path/to/file.txt | grep string2 | awk -F ' ' '{print $1, $15}'
So far, so good. This results in a list like this:
2023-01-20 [text1]>
2023-01-22 [text2]>
2023-01-23 [text3]>
2023-01-25 [text4]>
Ideally, I'd like to add some regex to the awk command so that I get this:
2023-01-20 text1
2023-01-22 text2
2023-01-23 text3
2023-01-25 text4
My searches have only returned how to use regex with awk to identify fields but not to extract a substring from the results. Is this possible with awk or some other command?

One awk idea that combines the current code with the new requirement:
awk -v s1="String1" -v s2="string2" ' # feed both search strings in as awk variables "s1" and "s2"
$0~s1 && $0~s2 { print $1,substr($15,2,index($15,"]")-2) } # if s1 and s2 are both present in the current line then print 1st field and 15th field (sans the "[" "]" wrappers)
' /path/to/file.txt
A non-sensical demo file:
$ cat file.txt
a b c d e f g h i j k l m n o p q r s t u v w x y z
a string2 c d e f g h i j k l m n [old]> p q r s t u v String1 x y z
a b c d e f g h i j k l m n o p q r s t u v w x y z
a String1 c d e f g h i j k l m n [older]> p q r s t u v string2 x y z
Running the awk script against this file generates:
a old
a older

If you're just basically want to delete the characters [, ] and >, you can simply use tr -d for this, something like:
... | tr -d "[]>"
Linux prompt>echo "2023-01-20 [text1]>" | tr -d "[]>"
2023-01-20 text1

Another option removing the leading [ and trailing ]> with gsub and an alternation:
awk '/String1/ && /string2/ {
gsub(/^\[|\]>$/, "", $15)
{print $1, $15}
}' file.txt
In gnu-awk you could use gensub:
awk '/String1/ && /string2/ {
{print $1, gensub(/^\[|\]>$/, "", "g", $15)}
}' file
Or find the occurrence of the string using index:
awk 'index($0, "String1") && index($0, "string2"){
gsub(/^\[|\]>$/, "", $15)
{print $1, $15}
}' file

Related

How can I use awk to calculate sum and replace column in file

I'm new to the site and to the programming world and I hope you have time to help me.
My problem is as follows: I have a file with several columns. In the 2nd column there are values. I'm tring to calculate the sum of each values to a given number and to replace the second column with a new column containing the results of the sum.
Here an example of my input:
A B C
x 1 t
y 2 u
z 3 v
I want to sum values in B column to 5 and obtain an output like the one below:
A B C
x 6 t
y 7 u
z 8 v
The code I tried unsucesfully is
zcat my_file.vcf.gz| tail -n +49 | awk 'BEGIN{FS=OFS="\t"} {print $0, $2+5}'>my.output.vcf
Thanks in advance
We could avoid using tail since printing of lines from 49th line could be handled within awk itself. Also you need to add value in 2nd field and then you could print the whole line itself by print command.
Important point, as per OP's sample if 2nd field is having alphabets then need NOT to add 5 in it, so taken care of that condition too here.
zcat my_file.vcf.gz |
awk '
BEGIN{ FS=OFS="\t" }
FNR>=49{
$2=($2~/[a-zA-Z]/?$2:$2+5)
print
}
' > my.output.vcf
You can use
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1'
Here, $2+=5 will add 5 to Filed 2 value, and 1 will trigger the display of the record (row, line, same as print $0).
See an online awk demo:
#!/bin/bash
s='A B C
x 1 t
y 2 u
z 3 v'
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1' <<< "$s"
Output:
A 5 C
x 6 t
y 7 u
z 8 v
Another form for clarity:
awk 'BEGIN{FS=OFS="\t"} {print $1, $2+5, $3}'
you can use:
awk 'BEGIN {FS=OFS="\t"} NR == 1 {print $0} NR > 1 {print $1,($2+5),$3;}'
output:
A B C
x 6 t
y 7 u
z 8 v
Maube this can help you:
cat file | awk '{if (NR > 1 && $2 = ($2+5))
print $0;
else print $0;}'
Answer apply to your code:
zcat my_file.vcf.gz| tail -n +49 | awk '{if (NR > 1 && $2 = ($2+5)) print $0; else print $0;}' > my.output.vcf
cat boo
A B C
x 1 t
y 2 u
z 3 v
cat boo | awk 'BEGIN{FS=OFS="\t"} $2 ~ /^[0-9]+$/ {print $1, $2+5, $3} $2 !~ /^[0-9]+$/ {print} '
A B C
x 6 t
y 7 u
z 8 v

How to filter empty line with a 'cut' command?

I have a tab delimited file with a few fields:
f1 f2 f3
a b c
a c
d e
f g a
I want to extract the 3rd column with a 'cut'command:
cut -f3 t
This works. However, how can I filter the empty line in the output? As it can be seen, the 2nd and 3rd lines are empty after they are extracted.
To remove empty output:
$ cut -f3 file | grep .
f3
c
a
Or:
$ awk -F'\t' '$3 {print $3}' file
f3
c
a
To replace the missing output with a filler:
$ awk -F'\t' '{if ($3) print $3; else print "FILL"}' file
f3
c
FILL
FILL
a
Or, for people who like the more compact ternary statement:
$ awk -F'\t' '{print ($3?$3:"FILL")}' file
f3
c
FILL
FILL
a
Example with multiple words in field 3
$ cat file2
f1 f2 f3
f g a b c d
$ cut -f3 file2 | grep .
f3
a b c d
$ awk -F'\t' '$3 {print $3}' file2
f3
a b c d

matching rows and fields from two files

I would like to match the record number in one file will the same field number in another file:
file1:
1
3
5
4
3
1
5
file2:
A B C D E F G
H I J J K L M
N O P Q R S T
I would like to use the record numbers corresponding to 5 in the first file to obtain the corresponding fields in the second file. Desired output:
C G
J M
P T
So far, I've done:
awk '{ if ($1=="5") print NR }' file1 > temp
for i in $(cat temp); do
awk '{ print $"'${i}'" }' file2
done
But get the output:
C
J
P
G
M
T
I would like to have this in the format of the desired output above, but can't get it to work. Perhaps using prinf or awk for-loop might work, but I have had no success.
Thank you all.
awk 'NR==FNR{if($1==5)a[NR];next}{for(i in a){printf $i" "}print ""}' a b
C G
J M
P T

How to merge two files based on the first three columns using awk

I wanted to merge two files into a single one line by line using the first three columns as a key. Example:
file1.txt
a b c 1 4 7
x y z 2 5 8
p q r 3 6 9
file2.txt
p q r 11
a b c 12
x y z 13
My desired output for the above two files is:
a b c 1 4 7 12
x y z 2 5 8 13
p q r 3 6 9 11
The number of columns in each file is not fixed, it can vary from line to line. Also, I got more than 27K lines in each file.
They are not ordered. They only thing is that the first three fields are the same for both files.
You could also use join, it requires sorted input and that the first 3 fields are merged. The example below sorts each file and lets sed merge and separate the fields:
join <(sort file1.txt | sed 's/ /-/; s/ /-/') \
<(sort file2.txt | sed 's/ /-/; s/ /-/') |
sed 's/-/ /; s/-/ /'
Output:
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
Join on the first three fields where the number of fields are variable (four or more):
{
# get the forth field until the last
for (i=4;i<=NF;i++)
f=f$i" "
# concat fields
arr[$1OFS$2OFS$3]=arr[$1OFS$2OFS$3]f;
# reset field string
f=""
}
END {
for (key in arr)
print key, arr[key]
}
Run like:
$ awk -f script.awk file1 file2
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
try this:
awk 'NR==FNR{a[$1$2$3]=$4;next}$1$2$3 in a{print $0, a[$1$2$3]}' file2 file1
If the columns have varying lengths, you could try something like this using SUBSEP:
awk 'NR==FNR{A[$1,$2,$3]=$4; next}($1,$2,$3) in A{print $0, A[$1,$2,$3]}' file2 file1
For varying columns in file1 and sorted output, try:
awk '{$1=$1; i=$1 FS $2 FS $3 FS; sub(i,x)} NR==FNR{A[i]=$0; next}i in A{print i $0, A[i]}' file2 file1 | sort

Number of fields returned by awk

Is there a way to get awk to return the number of fields that met a field-separator criteria? Say, for instance, my file contains
a b c d
so, awk --field-separator=" " | <something> should return 4
The NF variable is set to the total number of fields in the input record. So:
echo "a b c d" | awk --field-separator=" " "{ print NF }"
will display
4
Note, however, that:
echo -e "a b c d\na b" | awk --field-separator=" " "{ print NF }"
will display:
4
2
Hope this helps, and happy awking
NF gives the number of fields for a given record:
[]$ echo "a b c d" | gawk '{print NF}'
4
If you would like to know the set of all the numbers of fields in a multiline content you can run:
X | awk '{print NF}' | sort -n | uniq
being X a command that outputs content in the standard output: cat, echo, etc. Example:
With file.txt:
a b
b c
c d
e t a
e u
The command cat file.txt | awk '{print NF}' | sort -n | uniq will print:
2
3
And with file2.txt:
a b
b c
c d
e u
The command cat file2.txt | awk '{print NF}' | sort -n | uniq will print:
2
awk(1) on FreeBSD does not recognize --field-separator. Use -v instead:
echo "a b c d" | awk -v FS=" " "{ print NF }"
It is a portable, POSIX way to define the field separator.