Using awk gsub with /1 to replace chars with a section of the original characters - awk

This is what I'm doing (I just want to get rid of the leading numbers in the fourth column)
cat text.txt | awk 'BEGIN {OFS="\t"} {gsub(/[0-9XY][0-9]?([pq])/,"\1",$4); print}'
This is my input
AADDC 4902 3 21q11.3-p11.1 4784 4793
DEEDA 4023 6 9p21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 8q21-q22 5811 5812
This is my Output
AADDC 4902 3 11.3-p11.1 4784 4793
DEEDA 4023 6 21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 21-q22 5811 5812
But I want this to be my output
AADDC 4902 3 q11.3-p11.1 4784 4793
DEEDA 4023 6 p21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 q21-q22 5811 5812

If you use GNU awk, you can use gensub which, unlike gsub, supports backreferences:
awk 'BEGIN {OFS="\t"} {$4=gensub(/[0-9XY][0-9]?([pq])/,"\\1",1,$4); print}' text.txt
Some explanations:
What is the extra "\" for by the 1:
Because otherwise, that would be the character of ascii code 1.
Why does 1 need to be placed between the \1" and the $4:
To tell gensub to replace only the first occurence of the pattern.
Is there a reason why you must put $4= as well as $4
Yes, unlike gsub, gensub doesn't modify the field but returns the updated one.

Related

Awk function to delete row in 21st column with empty field

I am trying to delete rows which have an empty field on the 21st column. For some reason this code works on other files (less columns) but not this particular one. I've tried converting the file into space separated, comma separated, tab-delimited nothing seems to work.
I've tried these 2 different methods:
awk -F'\t' '$21!=""'
awk -F'\t' '$21{print $0}'
For example here is a smaller version of my tab-delimited file. I would want to remove rows that are "" in the column "Gene"
"Gene_ID"
"Sample_1"
"Sample_x"
"Sample_19"
"Gene"
"ENSG00000223972"
12
2
1
"DDX11L1"
"ENSG00000227232"
6
12
45
"WASH7P"
"ENSG00000278267"
0
4
542
"MIR6859-1"
"ENSG00000186092"
4
2
34
"OR4F5"
"ENSG00000239945"
7
67
22
""
"ENSG00000233750"
9
4356
22
"CICP27"
"ENSG00000241599"
55
4
55
""
this should work, your field is not blank, it's empty quotes.
$ awk -F'\t' '$21!="\"\""'
or perhaps easier to read
$ awk -F'\t' -v empty='""' '$21!=empty'

Extract all numbers from string in list

Given some string 's' I would like to extract only the numbers from that string. I would like the outputted numbers to be each be separated by a single space.
Example input -> output
....IN:1,2,3
OUT:1 2 3
....IN:1 2 a b c 3
OUT:1 2 3
....IN:ab#35jh71 1,2,3 kj$d3kjl23
OUT:35 71 1 2 3 3 23
I have tried combinations of grep -o [0-9] and grep -v [a-z] -v [A-Z] but the issue is that other chars like - and # could be used between the numbers. Regardless of the number of non-numeric characters between the numbers I need them to be replaced with a single space.
I have also been experimenting with awk and sed but have had little luck.
Not sure about spaces in your expected output, based on your shown samples, could you please try following.
awk '{gsub(/[^0-9]+/," ")} 1' Input_file
Explanation: Globally substituting anything apart from digit with spaces. Mentioning 1 will print current line.
In case you want to remove initial/starting space and ending space in output then try following.
awk '{gsub(/[^0-9]+/," ");gsub(/^ +| +$/,"")} 1' Input_file
Explanation: Globally substituting everything apart from digits with space in current line and then globally substituting starting and ending spaces with NULL in current line. Mentioning 1 will print edited/non-edited current line.
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | grep -o '[[:digit:]]*'
35
71
1
2
3
3
23
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | tr -sc '[:digit:]' ' '
35 71 1 2 3 3 23

I want to split a merged field delimited by tab using awk for window?

the fields in column 5 and 17 were merged together and i want to split the merged ones and put in separate fields.
my data looks like this
326502010-12-10 320100807
368902010-12-14 420100716
But i want to see like this
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
Using awk,
$ awk -vOFS="\t" '{sub(/.{5}/, "&\t", $1); sub(/./, "&\t", $2)}1' file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
sub(/.{5}/, "&\t", $1) Substitutes the first 5 characters with itself followed by \t on the first field.
sub(/./, "&\t", $2)} Substitutes for the second field.
1 This evaluates to true always, awk prints the input line as default action.
In case the length of the number preceeding the date varies, use this:
$ awk '{sub(/....-..-../,"\t&",$1); sub(/^./,"&\t",$2)} 1' file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
sub replaces the date part with a tab (\t) and the matching part (&) ie. the date. About the same with the latter part for $2.
Better use sed to split by characters:
$ sed -r 's/^(.{5})(.{18})/\1\t\2\t/' file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
This captures the given characters and prints them back with a tab in between them.
You can also use cut for this:
$ cut --output-delimiter=$'\t' -c 1-5,6-17,18- file
32650 2010-12-10 3 20100807
36890 2010-12-14 4 20100716
With -c option you can set a list representing the part of the line you want to cut. The comma , is replaced by the --output-delimiter which is set as a tab.

incorrect count of unique text in awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file
Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'
Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2

Confusion about awk command when dealing with if statement

$ cat awk.txt
12 32 45
5 2 3
33 11 33
$ cat awk.txt | awk '{FS='\t'} $1==5 {print $0}'
5 2 3
$ cat awk.txt | awk '{FS='\t'} $1==33 {print $0}'
Nothing is returned when judging the first field is 33 or not. It's confusing.
By saying
awk '{FS='\t'} $1==5 {print}' file
You are defining the field separator incorrectly. To make it be a tab, you need to say "\t" (with double quotes). Further reading: awk not capturing first line / separator.
Also, you are setting it every line, so it does not affect the first one. You want to use:
awk 'BEGIN{FS='\t'} $1==5' file
Yes, but why did it work in one case but not in the other?
awk '{FS='\t'} $1==5' file # it works
awk '{FS='\t'} $1==33' file # it does not work
You're using single quotes around '\t', which means that you're actually concatenating 3 strings together: '{FS=', \t and '} $1==5' to produce your awk command. The shell interprets the \t as t, so your awk script is actually:
awk '{FS=t} $1==5'
The variable t is unset, so you're setting the field separator to the empty string "". This means that the line is split into as many fields as characters you have. You can see it doing awk 'BEGIN{FS='\t'} {print NF}' file, that will show how many fields each record has.
Then, $1 is just 3 and $2 contains the second 3.
first of all !. Could you explain better what you really want to do before you ask ?. look....!
more awk.txt
12 32 45
5 2 3
33 11 33
awk -F"[ \t]" '$1 == 5 { print $0}' awk.txt
5 2 3
awk -F"[ \t]" '$1 == 33 { print $0}' awk.txt
33 11 33
awk -F"[ \t]" '$1 == 12 { print $0}' awk.txt
12 32 45
http://www.staff.science.uu.nl/~oostr102/docs/nawk/nawk_23.html
Fcs