awk/sed - replace column with pattern using variables from other columns - awk

I have a tab delimited text file:
#CHROM
POS
ID
REF
ALT
1
188277
rs434
C
T
20
54183975
rs5321
CTAAA
C
and I try to replace the "ID" column with specific patern $CHROM_$POS_$REF_$ALT with sed or awk
#CHROM
POS
ID
REF
ALT
1
188277
1_188277_C_T
C
T
20
54183975
20_54183975_CTAAA_C
CTAAA
C
unfortunately, I managed only to delete this ID column with:
sed -i -r 's/\S+//3'
and all patterns I try do not work in all cases. To be honest I am lost in the documentation and I am looking for examples which could help me solve this problem.

Using awk, you can set the value of the 3rd field concatenating field 1,2,4 and 5 with an underscore except for the first line. Using column -t to present the output as a table:
awk '
BEGIN{FS=OFS="\t"}
NR>1 {
$3 = $1"_"$2"_"$4"_"$5
}1' file | column -t
Output
#CHROM POS ID REF ALT
1 188277 1_188277_C_T C T
20 54183975 20_54183975_CTAAA_C CTAAA C
Or writing all fields, with a custom value for the 3rd field:
awk '
BEGIN{FS=OFS="\t"}
NR==1{print;next}
{print $1, $2, $1"_"$2"_"$4"_"$5, $4, $5}
' file | column -t

GNU sed solution
sed '2,$s/\(\S*\)\t\(\S*\)\t\(\S*\)\t\(\S*\)\t\(\S*\)/\1\t\2\t\1_\2_\3_\4_\5\t\4\t\5/' file.txt
Explanation: from line 2 to last line, do following replace: put 5 \t-sheared columns (holding zero or more non-whitespace) into groups. Then replace it with these column joined using \t excluding third one, which is replace by _-join of 1st, 2nd, 3rd, 4th, 5th column.
(tested in sed (GNU sed) 4.2.2)

awk -v OFS='\t' 'NR==1 {print $0}; NR>1 {print $1, $2, $1"_"$2"_"$4"_"$5, $4, $5}' inputfile.txt

Related

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

awk conditional statement based on a value between colon

I was just introduced to awk and I'm trying to retrieve rows from my file based on the value on column 10.
I need to filter the data based on the value of the third value if ":" was used as a separator in column 10 (last column).
Here is an example data in column 10. 0/1:1,9:10:15:337,0,15.
I was able to extract the third value using this command awk '{print $10}' file.txt | awk -F ":" '/1/ {print $3}'
This returns the value 10 but how can I return other rows (not just the value in column 10) if this third value is less than or greater than a specific number?
I tried this awk '{if($10 -F ":" "/1/ ($3<10))" print $0;}' file.txt but it returns a syntax error.
Thanks!
Your code:
awk '{print $10}' file.txt | awk -F ":" '/1/ {print $3}'
should be just 1 awk script:
awk '$10 ~ /1/ { split($10,f,/:/); print f[3] }' file.txt
but I'm not sure that code is doing what you think it does. If you want to print the 3rd value of all $10s that contain :s, as it sounds like from your text, that'd be:
awk 'split($10,f,/:/) > 1 { print f[3] }' file.txt
and to print the rows where that value is less than 7 would be:
awk '(split($10,f,/:/) > 1) && (f[3] < 7)' file.txt

awk to remove text and split on two delimiters

I am trying to use awk to remove the text after the last digit and split by the :. That is common to both lines and I believe the first portion of the awk below will do that. If there is no _ in the line then $2 is repeated in $3 and I believe the split will do that. What I am not sure how to do is if the is an _ in the line then the number to the left of the _ is $2 and the number to the right of the _ is $3. Thank you :).
input
chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT
desired
chr7 140453136 140453136 
chr7 140453135 140453136
awk
awk '{sub(/[^0-9]+$/, "", $1); {split($0,a,":"); print a[1],a[2]a[2]} 1' input
Here is one:
$ awk '
BEGIN {
FS="[:_]" # using field separation for the job
OFS="\t"
}
{
sub(/[^0-9]*$/,"",$NF) # strip non-digits off the end of last field
if(NF==2) # if only 2 fields
$3=$2 # make the $2 from $2
}1' file # output
Output:
chr7 140453136 140453136
chr7 140453135 140453136
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.
Using GNU awk:
awk -v FPAT='[0-9]+|chr[0-9]*' -v OFS='\t' 'NF==2{$3=$2}{$1=$1}1'
This relies on the field pattern FPAT that is a regex representing a number or the string chr with a number.
The statement NF==2{$3=$2} is to duplicate the second field if there is only 2 in the record.
The last statement is to force awk to rebuild the record to have the wanted formatting.
$ awk -F'[:_]' '{print $1, $2+0, $NF+0}' file
chr7 140453136 140453136
chr7 140453135 140453136
Could you please try following, more generic solution in terms of NO hard coding of copying fields values to another fields etc, you can simply mention maximum number of field value in awk variable and it will check each line(along with removing alphabets from their value) and will copy last value to till end of max value for that line.
awk -F'[:_]' -v max="3" '
{
for(i=2;i<=max;i++){
if($i==""){
$i=$(i-1)
}
gsub(/[^0-9]+/,"",$i)
}
}
1
' Input_file
To get output in TAB delimited form append | column -t in above code.

How to remove field separators in awk when printing $0?

eg, each row of the file is like :
1, 2, 3, 4,..., 1000
How can print out
1 2 3 4 ... 1000
?
If you just want to delete the commas, you can use tr:
$ tr -d ',' <file
1 2 3 4 1000
If it is something more general, you can set FS and OFS (read about FS and OFS) in your begin block:
awk 'BEGIN{FS=","; OFS=""} ...' file
You need to set OFS (the output field separator). Unfortunately, this has no effect unless you also modify the string, leading the rather cryptic:
awk '{$1=$1}1' FS=, OFS=
Although, if you are happy with some additional space being added, you can leave OFS at its default value (a single space), and do:
awk -F, '{$1=$1}1'
and if you don't mind omitting blank lines in the output, you can simplify further to:
awk -F, '$1=$1'
You could also remove the field separators:
awk -F, '{gsub(FS,"")} 1'
Set FS to the input field separators. Assigning to $1 will then reformat the field using the output field separator, which defaults to space:
awk -F',\s*' '{$1 = $1; print}'
See the GNU Awk Manual for an explanation of $1 = $1

awk command to change field seperator from tilde to tab

I want to replace the delimter tilde into tab space in awk command, I have mentioned below how I would have expect.
input
~1~2~3~
Output
1 2 3
this wont work for me
awk -F"~" '{ OFS ="\t"; print }' inputfile
It's really a job for tr:
tr '~' '\t'
but in awk you just need to force the record to be recompiled by assigning one of the fields to its own value:
awk -F'~' -v OFS='\t' '{$1=$1}1'
awk NF=NF FS='~' OFS='\t'
Result
1 2 3
Code for sed:
$echo ~1~2~3~|sed 'y/~/\t/'
1 2 3