How to print out a specific field in AWK? - awk

A very simple question, which a found no answer to. How do I print out a specific field in awk?
awk '/word1/', will print out the whole sentence, when I need just a word1. Or I need a chain of patterns (word1 + word2) to be printed out only from a text.

Well if the pattern is a single word (which you want to print and can't contaion FS (input field separator)) why not:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print MYPATTERN }' INPUTFILE
If your pattern is a regex:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print gensub(".*(" MYPATTERN ").*","\\1","1",$0) }' INPUTFILE
If your pattern must be checked in every single field:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN {
for (i=1;i<=NF;i++) {
if ($i ~ MYPATTERN) { print "Field " i " in " NR " row matches: " MYPATTERN }
}
}' INPUTFILE
Modify any of the above to your taste.

The fields in awk are represented by $1, $2, etc:
$ echo this is a string | awk '{ print $2 }'
is
$0 is the whole line, $1 is the first field, $2 is the next field ( or blank ),
$NF is the last field, $( NF - 1 ) is the 2nd to last field, etc.
EDIT (in response to comment).
You could try:
awk '/crazy/{ print substr( $0, match( $0, "crazy" ), RLENGTH )}'

i know you can do this with awk :
an alternative would be :
sed -nr "s/.*(PATTERN_TO_MATCH).*/\1/p" file
or you can use grep -o

Something like this perhaps:
awk '{split("bla1 bla2 bla3",a," "); print a[1], a[2], a[3]}'

Related

awk to add prefix if not present in field

I am trying to add a prefix to a field in awk if it is not already present. That is if chr isn't present before the number it is inserted. However, if it is there it is skipped.
The first awk adds the prefix to each $2 even if it is present and the senond awk does skip the $2 with chr in them, but does print chr in the $2 without. Thank you :).
file
ASPA,17:3483575-3483585
ATM,11:108289609-108289613
ATP7B,13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
desired
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
awk
awk -F, '{$2="chr"$2; print}' file
awk 2
awk -F, '$2 !~/chr/{gsub("chr","chr",$2)}1' file
You can use:
awk 'BEGIN {FS=OFS=","} $2 !~ /^chr/ {$2="chr" $2} 1' file
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
Or without using any regex:
awk 'BEGIN {FS=OFS=","} index($2 , "chr") != 1 {$2="chr" $2} 1' file
Another solution that might be shortest of all:
awk '{sub(/,(chr)?/, ",chr")} 1' file
1st solution: With your shown samples, please try following awk code.
awk '
BEGIN{FS=OFS=":"}
{
split($1,arr,",")
if(int(arr[2]) || arr[2]==0){
$1=arr[1] ",chr" arr[2]
}
}
1
' Input_file
2nd solution: With GNU awk using its match function which captures values into an array from capturing groups try following code.
awk '
match($0,/^([^,]*,)([^:]*)(:.*)/,arr){
if(int(arr[2]) || arr[2]==0){
arr[2]="chr" arr[2]
}
print arr[1] arr[2] arr[3]
}
' Input_file
3rd solution(Bonus one): Just in case your 2nd field is having Negative values(integers) and you want to change it Eg: from -11 to -chr11 then you can try following GNU awk code.
awk '
match($0,/^([^,]*,)(-)?([^:]*)(:.*)/,arr){
if(int(arr[3]) || arr[3]==0){
if(arr[2]=="-"){
arr[3]="-chr" arr[3]
}
else{
arr[3]="chr" arr[3]
}
$0=arr[1] arr[3] arr[4]
}
print
}
' Input_file
mawk NF=NF FS=',(chr)?' OFS=',chr'
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

Merge lines based on first column without delimiter

I need to merge all the lines that have the same value on the first column.
The input file is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)
34600000031|(2|2|0|2|2|20190114180000|20191027185959)
34600000031|(3|3|0|3|3|20190114180000|20191027185959)
34600000031|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)
34600000015|(2|2|100|2|9|20190114180000|20191027185959)
34600000015|(3|3|100|3|10|20190114180000|20191027185959)
34600000015|(4|4|100|4|11|20190114180000|20191027185959)
I was able to partially achieve it using the following:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p,x); s=s $0} END{print s}' INPUT
The output is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)|(2|2|0|2|2|20190114180000|20191027185959)|(3|3|0|3|3|20190114180000|20191027185959)|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)|(2|2|100|2|9|20190114180000|20191027185959)|(3|3|100|3|10|20190114180000|20191027185959)|(4|4|100|4|11|20190114180000|20191027185959)
What I need (and i cannot find how) is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)
I could do a sed after the initial awk but I don't believe that this is the proper way to do it.
You need to substitute the separator in the values too. Your fixes awk would look like this:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p "\\|",x); s=s $0} END{print s}'
but it's also good to match beginning of the string:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub("^" p "\\|",x); s=s $0} END{print s}'
I would do it somewhat simpler, which uses more memory (as it stores everything in an array) but doesn't need the file to be sorted:
awk -F'|' '{ k=$1; sub("^" $1 "\\|", ""); a[k] = a[k] $0 } END{ for (i in a) print i "|" a[i] }'
For each line, remember the first field, substitute the first field with | for nothing, then add it to an array indexed by the first field. On the end, print each element in the array with the key, separator and value.
$ awk -F'|' '
{
curr = $1
sub(/^[^|]+\|/,"")
printf "%s%s", (curr==prev ? "" : ors curr FS), $0
ors = ORS
prev = curr
}
END { print "" }
' file
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)

awk - remove new line after printing all columns

i am running a following awk script
awk 'BEGIN { FS="|" ; OFS="|" }; { printf $0, $1 "_" $2 }' .someFile
unfortunatley the concatention of fields 1 and 2 is printed on new line, looks like the last field contains a new line character
how can i trim it ?
If you want to use printf (which may have been accidental), I think you can use this:
awk 'BEGIN { FS = OFS = "|" } { printf "%s%s%s_%s", $0, OFS, $1, $2 }' .someFile
printf should always be used with a format string. printf doesn't add the Output Record Separator to the end of what it prints, so you have to do that yourself using \n in the format string or by adding %s and passing ORS as the last argument to printf.
In this case, I think you can just use print though:
awk 'BEGIN { FS = OFS = "|" } { print $0, $1 "_" $2 }' .someFile

print unique lines based on field

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'