awk to add prefix if not present in field - awk

I am trying to add a prefix to a field in awk if it is not already present. That is if chr isn't present before the number it is inserted. However, if it is there it is skipped.
The first awk adds the prefix to each $2 even if it is present and the senond awk does skip the $2 with chr in them, but does print chr in the $2 without. Thank you :).
file
ASPA,17:3483575-3483585
ATM,11:108289609-108289613
ATP7B,13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
desired
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
awk
awk -F, '{$2="chr"$2; print}' file
awk 2
awk -F, '$2 !~/chr/{gsub("chr","chr",$2)}1' file

You can use:
awk 'BEGIN {FS=OFS=","} $2 !~ /^chr/ {$2="chr" $2} 1' file
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
Or without using any regex:
awk 'BEGIN {FS=OFS=","} index($2 , "chr") != 1 {$2="chr" $2} 1' file
Another solution that might be shortest of all:
awk '{sub(/,(chr)?/, ",chr")} 1' file

1st solution: With your shown samples, please try following awk code.
awk '
BEGIN{FS=OFS=":"}
{
split($1,arr,",")
if(int(arr[2]) || arr[2]==0){
$1=arr[1] ",chr" arr[2]
}
}
1
' Input_file
2nd solution: With GNU awk using its match function which captures values into an array from capturing groups try following code.
awk '
match($0,/^([^,]*,)([^:]*)(:.*)/,arr){
if(int(arr[2]) || arr[2]==0){
arr[2]="chr" arr[2]
}
print arr[1] arr[2] arr[3]
}
' Input_file
3rd solution(Bonus one): Just in case your 2nd field is having Negative values(integers) and you want to change it Eg: from -11 to -chr11 then you can try following GNU awk code.
awk '
match($0,/^([^,]*,)(-)?([^:]*)(:.*)/,arr){
if(int(arr[3]) || arr[3]==0){
if(arr[2]=="-"){
arr[3]="-chr" arr[3]
}
else{
arr[3]="chr" arr[3]
}
$0=arr[1] arr[3] arr[4]
}
print
}
' Input_file

mawk NF=NF FS=',(chr)?' OFS=',chr'
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123

Related

Grouping duplicated fields with awk

I have the following file:
ID|2018-04-29
ID|2018-04-29
ID|2018-04-29
ID1|2018-06-26
ID1|2018-06-26
ID1|2018-08-07
ID1|2018-08-22
and using awk, I want to add $3 that groups the duplicated IDs based on $1 and $2 so that the output would be
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
I tried the following code but it does not give me the desired output. Also, I am not sure if I can apply it to a column with date in it.
awk -F"|" '{print $0,"group"++seen[$1,$3]}' OFS="|"
Any hints on how to achieve it using awk (one-liner, if possible) would be highly appreciated.
With your shown samples, please try following awk code.
awk -v OFS="|" '!arr[$0]++{count++} {print $0,"group"count}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
OFS="|" ##Setting OFS to | here.
}
!arr[$0]++{ ##Checking if current line is NOT present in array then do following.
count++ ##Increasing count with 1 here.
}
{
print $0,"group"count ##Printing current line with group and count value here.
}
' Input_file ##Mentioning Input_file name here.
and using awk, I want to add $3 that groups the duplicated IDs based
on $1 and $2 so that the output would be
Using $1 and $2
If input file is sorted then:
$ awk 'BEGIN{FS=OFS="|"}{print $0, "group" (!a[$1,$2]++?++c:c)}' file
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
If file not sorted then :
$ awk 'BEGIN{FS=OFS="|"}{k=$1 SUBSEP $2}!(k in a){a[k]=++c}{print $0, "group" a[k]}' file
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
Better Readable version:
awk 'BEGIN{
FS=OFS="|"
}
{
k=$1 SUBSEP $2
}
!(k in a){
a[k]=++c
}
{
print $0, "group" a[k]
}' file
BEGIN {OFS = FS = "|"}
{ if ($0 != prev) { #new item
prev = $0
print $1, $2, "group" ++g
}
else {
print $1, $2, "group" g
}
}
Note that the list has to be sorted (from your example, I assume it is).
This is my first time posting answer here. Hope the code is readable for you and hope it helps.

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

awk search pattern in a specific field and replace its content

I need to found field of password that is empty, with space or tab, and replace it with x (on /etc/passwd file)
I found this syntax with awk, that show users where second field (using : as delimiter) is or empty, or has space or tab inside:
awk -F":" '($2 == "" || $2 == " " || $2 == "\t") {print $0}' $file
and result is the follow:
user1::53556:100::/home/user1:/bin/bash
user2: :53557:100::/home/user2:/bin/bash
user3: :53558:100::/home/user3:/bin/bash
How I can say to awk to replace this 2nd field (empty or with space or tab) with another character? (for example x)
Could you please try following.
awk 'BEGIN{FS=OFS=":"} {$2=$2=="" || $2~/^[[:space:]]+$/?"X":$2} 1' Input_file
Explanation: Adding explanation of above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here which will be executed before Input_file is being read.
FS=OFS=":" ##Setting FS and OFS as colon here for all lines of Input_file.
} ##Closing BEGIN section block here.
{
$2=$2=="" || $2~/^[[:space:]]+$/?"X":$2 ##Checking condition if $2(2nd field) of current line is either NULL or having complete space in it then put its vaklue as X or keep $2 value as same as it is.
}
1 ##mentioning 1 will print edited/non-edited current line.
' Input_file ##Mentioning Input_file name here.
EDIT: As per OP, OP need NOT to touch last line of Input_file so adding following solutio now.
tac Input_file | awk 'BEGIN{FS=OFS=":"} FNR==1{print;next} {$2=$2=="" || $2~/^[[:space:]]+$/?"X":$2} 1' | tac
EDIT2: In case you want to do it kin single awk itself then try following.
awk '
BEGIN{
FS=OFS=":"
}
prev{
num=split(prev,array,":")
array[2]=array[2]=="" || array[2]~/^[[:space:]]+$/?"X":array[2]
for(i=1;i<=num;i++){
val=(val?val OFS array[i]:array[i])
}
print val
val=""
}
{
prev=$0
}
END{
if(prev){
print prev
}
}' Input_file
In case you want to change Input_file itself append > temp_file && mv temp_file Input_file in above code.
$ awk 'BEGIN{FS=OFS=":"} (NF>1) && ($2~/^[[:space:]]*$/){$2="x"} 1' file
user1:x:53556:100::/home/user1:/bin/bash
user2:x:53557:100::/home/user2:/bin/bash
user3:x:53558:100::/home/user3:/bin/bash
To change the original file using GNU awk:
awk -i inplace 'BEGIN{FS=OFS=":"} (NF>1) && ($2~/^[[:space:]]*$/){$2="x"} 1' file
or with any awk:
awk 'BEGIN{FS=OFS=":"} (NF>1) && ($2~/^[[:space:]]*$/){$2="x"} 1' file > tmp && mv tmp file
The test for NF>1 ensures we only operate on lines that already have at least 2 fields and so we don't create a line like :x in the output when there's an empty line in the input file. The rest is hopefully obvious.

How to not remove the header while executing awk

I have a file file like this :
k_1_1
k_1_3
k_1_6
...
I have a file file2 :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_2,17,89,15,...
k_1_3,10,26,45,...
k_1_4,17,16,15,...
k_1_5,10,26,45,...
k_1_6,17,16,15,...
...
I want to print lines of file2 that is matched with fileThe desired output is :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
I tried
awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2 > result
But the header line is gone in result like this :
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
How can a maintain it? Thank you.
Always print the first line, unconditionally.
awk 'BEGIN{FS=OFS=","}
NR==FNR{a[$1];next}
FNR==1 || $1 in a' file file2 > result
Notice also how { print $0 } is not necessary because it's the default action.
A very ad-hoc solution to your problem could be to compose the output in a command group:
{ head -1 file2; awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2; } > result
Could you please try following.
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1 && ++count==1{print;next} a[$1]' Input_file Input_file2
OR
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1{print;next} a[$1]' Input_file Input_file2

How to print out a specific field in AWK?

A very simple question, which a found no answer to. How do I print out a specific field in awk?
awk '/word1/', will print out the whole sentence, when I need just a word1. Or I need a chain of patterns (word1 + word2) to be printed out only from a text.
Well if the pattern is a single word (which you want to print and can't contaion FS (input field separator)) why not:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print MYPATTERN }' INPUTFILE
If your pattern is a regex:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print gensub(".*(" MYPATTERN ").*","\\1","1",$0) }' INPUTFILE
If your pattern must be checked in every single field:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN {
for (i=1;i<=NF;i++) {
if ($i ~ MYPATTERN) { print "Field " i " in " NR " row matches: " MYPATTERN }
}
}' INPUTFILE
Modify any of the above to your taste.
The fields in awk are represented by $1, $2, etc:
$ echo this is a string | awk '{ print $2 }'
is
$0 is the whole line, $1 is the first field, $2 is the next field ( or blank ),
$NF is the last field, $( NF - 1 ) is the 2nd to last field, etc.
EDIT (in response to comment).
You could try:
awk '/crazy/{ print substr( $0, match( $0, "crazy" ), RLENGTH )}'
i know you can do this with awk :
an alternative would be :
sed -nr "s/.*(PATTERN_TO_MATCH).*/\1/p" file
or you can use grep -o
Something like this perhaps:
awk '{split("bla1 bla2 bla3",a," "); print a[1], a[2], a[3]}'