Getting numerical sub-string of fields using awk - awk

I was wondering how I can get the numerical sub-string of fields using awk in a text file like what is shown below. I am already familiar with substr() function. However, since the length of fields are not fixed, I have no idea how to separate text from numerical part.
A.txt
"Asd.1"
"bcdujcd.2"
"mshde.3333"
"deuhdue.777"
P.S. All the numbers are separated from text part with a single dot (.).

You may try like this:
rt$ echo "bcdujcd.2"|awk -F'[^0-9]*' '$0=$2'

If you don't care about any non-digit parts of the line and only want to see the digit parts as output you could use:
awk '{gsub(/[^[:digit:]]+/, " ")}7' A.txt
which will generate:
1
2
3333
777
as output (there's a leading space on each line for the record).
If there can only be one number field per line than the replacement above can be "" instead of " " in the gsub and the leading space will do away. The replacement with the space will keep multiple numerical fields separated by a space if they occur on a single line. (i.e. "foo.88.bar.11" becomes 88 11 instead of 8811).
If you just need the second (period delimited) field of each line of that sort then awk -F. '{print $2}' will do that.

$ awk -F'[".]' '{print $3}' file
1
2
3333
777

Related

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

Find an exact match from a patterns file for another file using awk (patterns contain regex symbols to be ignored)

I have a file which has the following patterns.
NO_MATCH
NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH||NO_MATCH
These should be matched exactly with the 5th column of the target csv. I have tried:
awk 'NR==FNR{a[$0]=$0; next;} NR>FNR{if($5==a[$0])print $0}' pattern.csv input.csv > final_out.csv
But the || in the patterns file result in bad matches. The 5th column in the target csv looks something like this:
"AAAA||AAAA"
"BBBB||BBBB"
"NO_MATCH"
"NO_MATCH||NO_MATCH||NO_MATCH"
"NO_MATCH||BBBB"
I need to extract the 3rd and 4th lines.
Edit: I need exact match such as line 3 & 4. Hope this clears up the issue. The columns in the csv are double quoted as shown, and the quotes around fifth column should be removed.
awk 'BEGIN{FS=OFS=","} NR==FNR{a["\""$0"\""];next} ($5 in a){gsub(/^"|"$/,"",$5);print}' pattern.csv input.csv > final_out.csv
Keep pattern.csv's contents in an array with enclosing each line in quotes. For each line in input.csv, if fifth column exists in the array, remove quotes around it and print the line.

Filter fields with multiple delimiters

I've done extensive searching for a solution but can't quite find what I need. Have a file like this:
aaa|bbb|ccc|ddd~eee^fff^ggg|hhh|iii
111|222|333|444~555^666^777|888|999
AAA|BBB|CCC||EEE|FFF
What I want to do is use awk or something else to return lines from this file with a change to field 4(pipe delimited). Field 4 has a tilde and caret as delimiters which is where I'm struggling. We want the lines returned as this:
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
If field 4 is empty, it's returned as is. But when field 4 has multiple values, we want the first value right after the tilde returned only.
awk -F "[|^~]" 'BEGIN{OFS="|"}NF==6{print} NF==9{print $1,$2,$3,$5,$8,$9}' tmp.txt
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
use a regular expression as your delimiter
count the fields to decide what to do
set the output delimiter to pipe
$ awk -F'|' '{sub(/^[^~]*~/, "", $4); sub(/\^.*/, "", $4)} 1' OFS='|' file
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
This approach makes no assumption about the contents of fields other than field 4. The other fields may, for example, contain ~ or ^ characters and that will not affect the results.
How it works
-F'|'
This sets the field delimiter on input to |.
sub(/^[^~]*~/, "", $4)
If field 4 contains a ~, this removes the first ~ and everything before the first ~.
sub(/\^.*/, "", $4)
If field 4 contains ^, this removes the first ^ and everything after it.
1
This is awk's cryptic shorthand for print-the-line.
OFS='|'
This sets the field separator on output to |.

awk to ignore double quote and compare two files

I have two input file
FILE 1
123
125
123
129
and file 2
"a"|"123"|"anc"
"b"|"124"|"ind"
"c"|"123"|"su"
"d"|"122"|"aus"
OUTPUT:
"b"|"124"|"ind"
"d"|"122"|"aus"
now how can i compare and print the difference of $1 from file1 and $2 from file2. i'm having trouble cause of the double quote(").
So how can I compare the difference ignoring the double quote?
$ awk 'FNR==NR{a[$1]=1;next} a[$3]==0' file1 FS='["|]+' file2
"b"|"124"|"ind"
"d"|"122"|"aus"
How it works:
file1 FS='["|]+' file2
This list of files tells awk to read file1 first, then change the field separator to any combination of double-quotes and vertical bars and then read file2.
FNR==NR{a[$1]=1;next}
FNR is the number of lines that awk has read from the current file and NR is the total number of lines read. Consequently, FNR==NR is true only while reading the first file. The commands which follow in braces are only executed for the first file.
This creates an associative array a whose keys are the first fields of file1 and whose values are 1. The next command tells awk to skip the rest of the commands and start over on the next line.
a[$3]==0
This is true only if the number in field 3 did not occur in file1. If it is true, then the default action is taken which is to print the line. (With the field separator that we have chosen, the number you are interested in is in field 3.)
Alternative
$ awk 'FNR==NR{a[$1]=1;next} a[substr($2,2,length($2)-2)]==0' file1 FS='|' file2
"b"|"124"|"ind"
"d"|"122"|"aus"
This is similar to the above except that the field separator is just a vertical bar. In this case, the number that you are interested in is in field 2. We use substr to remove one character from either end of field 2 which has the effect of removing the double-quotes.

How do I tell awk to use = as a separator (with white spaces removed too)

Suppose I have the following file.
John=good
Tom = ok
Tim = excellent
I know the following let's me use = as a separator.
awk -F= '{print $1,$2}' file
This gives me the following results.
John good
Tom ok
Tim excellent
I would like the white spaces to be ignored, so that only the names and their performances are printed out.
One way to get around this is run another awk on the results.
awk -F= '{print$1,$2}' file | awk '{print $1,$2}'
But I wanted to know if I could do this in one awk?
Include them in the separator definition; it's a regexp.
jinx:1654 Z$ awk -F' *= *' '{print $1, $2}' foo.ds
John good
Tom ok
Tim excellent
The FS variable can be set to a regular expression.
From the AWK manual
The following table summarizes how fields are split, based on the value of FS.
FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and trailing occurrences.
FS == regexp
Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.