Awk function to delete row in 21st column with empty field - awk

I am trying to delete rows which have an empty field on the 21st column. For some reason this code works on other files (less columns) but not this particular one. I've tried converting the file into space separated, comma separated, tab-delimited nothing seems to work.
I've tried these 2 different methods:
awk -F'\t' '$21!=""'
awk -F'\t' '$21{print $0}'
For example here is a smaller version of my tab-delimited file. I would want to remove rows that are "" in the column "Gene"
"Gene_ID"
"Sample_1"
"Sample_x"
"Sample_19"
"Gene"
"ENSG00000223972"
12
2
1
"DDX11L1"
"ENSG00000227232"
6
12
45
"WASH7P"
"ENSG00000278267"
0
4
542
"MIR6859-1"
"ENSG00000186092"
4
2
34
"OR4F5"
"ENSG00000239945"
7
67
22
""
"ENSG00000233750"
9
4356
22
"CICP27"
"ENSG00000241599"
55
4
55
""

this should work, your field is not blank, it's empty quotes.
$ awk -F'\t' '$21!="\"\""'
or perhaps easier to read
$ awk -F'\t' -v empty='""' '$21!=empty'

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

Extract all numbers from string in list

Given some string 's' I would like to extract only the numbers from that string. I would like the outputted numbers to be each be separated by a single space.
Example input -> output
....IN:1,2,3
OUT:1 2 3
....IN:1 2 a b c 3
OUT:1 2 3
....IN:ab#35jh71 1,2,3 kj$d3kjl23
OUT:35 71 1 2 3 3 23
I have tried combinations of grep -o [0-9] and grep -v [a-z] -v [A-Z] but the issue is that other chars like - and # could be used between the numbers. Regardless of the number of non-numeric characters between the numbers I need them to be replaced with a single space.
I have also been experimenting with awk and sed but have had little luck.
Not sure about spaces in your expected output, based on your shown samples, could you please try following.
awk '{gsub(/[^0-9]+/," ")} 1' Input_file
Explanation: Globally substituting anything apart from digit with spaces. Mentioning 1 will print current line.
In case you want to remove initial/starting space and ending space in output then try following.
awk '{gsub(/[^0-9]+/," ");gsub(/^ +| +$/,"")} 1' Input_file
Explanation: Globally substituting everything apart from digits with space in current line and then globally substituting starting and ending spaces with NULL in current line. Mentioning 1 will print edited/non-edited current line.
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | grep -o '[[:digit:]]*'
35
71
1
2
3
3
23
$ echo 'ab#35jh71 1,2,3 kj$d3kjl23' | tr -sc '[:digit:]' ' '
35 71 1 2 3 3 23

Adding numbers forth column of first file to second file using awk.

Hoping somebody can help me with this task. I have two infiles,
File1
# name length av.qual #-reads mx.cov. av.cov GC% CnIUPAC CnFunny CnN CnX CnGap CnNoCov
10-1_rep_c1 1406 80 8017 4637 1641.26 31.98 1 0 4 0 7 0
10-1_rep_c2 832 80 1641 1462 557.34 32.13 1 0 0 0 5 0
10-1_rep_c3 1284 83 4674 2338 1040.80 24.75 7 0 0 0 8 0
10-1_rep_c4 750 83 2335 2017 886.31 24.73 2 0 0 0 3 0
10-1_rep_c5 1180 78 2326 1486 572.51 19.76 1 0 0 0 7 0
File2
>10-1_rep_c1
ttttttttttttttacaataaaatgcrccattattcctttcgtactaaacaatgccttat
ggccaccagatagaaaccaatctgactcacgtcgattttaactcaaatcatgtaaaattc
>10-1_rep_c2
aacagcagaattaatattgttcacaggtttttataaaacgacctattaatgaatttccat
cccctaaaaatggtcggcttacttgatgtaaccaccccctctagttaataataattgtat
>10-1_rep_c3
aattataaaaagaatttttaaagcataaattattagtaattttaagagaaattaaaggta
ttataaaagagtaatagtactgacaaggaaaaacttttatataaaaaaaagaaaatttaa
The outfile I would like to have is,
>10-1_rep_c1_8017
ttttttttttttttacaataaaatgcrccattattcctttcgtactaaacaatgccttat
ggccaccagatagaaaccaatctgactcacgtcgattttaactcaaatcatgtaaaattc
>10-1_rep_c2_1641
aacagcagaattaatattgttcacaggtttttataaaacgacctattaatgaatttccat
cccctaaaaatggtcggcttacttgatgtaaccaccccctctagttaataataattgtat
>10-1_rep_c3_4674
aattataaaaagaatttttaaagcataaattattagtaattttaagagaaattaaaggta
ttataaaagagtaatagtactgacaaggaaaaacttttatataaaaaaaagaaaatttaa
So the fourth column of the first file appended onto the second file header of each DNA sequence.
This is an alternative using awk:
awk 'FNR==NR{a[">"$1]="_"$4;next}{print $0a[$0]}' File1 File2
try following and let me know if this helps you.
awk 'FNR==NR{a[$1]=$4;next} ($2 in a){print $2"_"a[$2];next} 1' file1 FS=">" file2
Explanation: So checking here first condition FNR==NR(which will be only TRUE when first file named file1 is being read, because FNR and NR both represent the number of lines in a Input_file only difference between them is NR's value will be keep on increasing till all files being read and FNR's value will be RESET each time when a new Input_file start to get read.), so in first file reading time creating an array named a whose index is $1 and value is $4(as per your ask), now mentioning next keyword which will make sure cursor shouldn't go further and it will skip all further statements then.
Now checking condition where checking if $2 of file2(whose field separator I have set as > to remove it from mix, NOTE: we could set field separators differently for different Input_files).
So if it is present then print 2nd field of Input_file file2 and "_" then array a's value whose index is $2 of file2, then mentioning next to skip further statements. Now mentioning 1 will be printing the lines(apart from whose $2 comes in array a as index, so awk works on method like condition then action. If any condition is TRUE then some action should happen. Here condition is TRUE by mentioning 1 and action is Not defined so by default print will happen, print of current line of file2).
Then mentioning the Input_file1 name as file1. After that mentioning the FS(field separator's value) to ">"(explained above too). After that mentioning the second Input_file as file2 too then.

Cut column from multiple files with the same name in different directories and paste into one

I have multiple files with the same name (3pGtoA_freq.txt), but all located in different directories.
Each file looks like this:
pos 5pG>A
1 0.162421557770395
2 0.0989643268124281
3 0.0804131316857248
4 0.0616563298066399
5 0.0577551761714493
6 0.0582450832072617
7 0.0393129770992366
8 0.037037037037037
9 0.0301016419077404
10 0.0327510917030568
11 0.0301598837209302
12 0.0309050772626932
13 0.0262089331856774
14 0.0254612546125461
15 0.0226130653266332
16 0.0206971677559913
17 0.0181280059193489
18 0.0243993993993994
19 0.0181347150259067
20 0.0224429727740986
21 0.0175690211545357
22 0.0183916336098089
23 0.0196078431372549
24 0.0187983781791375
25 0.0173192771084337
I want to cut column 2 from each file and paste column by column in one file
I tried running:
for s in results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt; do awk '{print $2}' $s >> /home/users/istolarek/aDNA/3pGtoA_all; done
but it's not pasting the columns next to each other.
Also I wanted to name each column by the '*', which is the only string that changes in path.
Any help with that?
for i in $(find you_file_dir -name 3pGtoA_freq.txt);do awk '{print $2>>"NewFile"}' $i; done
I would do this by processing all files in parallel in awk:
awk 'BEGIN{printf "pos ";
for(i=1;i<ARGC;++i)
printf "%-19s",gensub("^results_Sample_","",1,gensub("_hg19.*","",1,ARGV[i]));
printf "\n";
while(getline<ARGV[1]){
printf "%-4s%-19s",$1,$2;
for(i=2;i<ARGC;++i){
getline<ARGV[i];
printf "%-19s",$2}
printf "\n"}}{exit}' \
results_Sample_*_hg19/results_MapDamage_Sample_*/results_Sample_*_bwa_LongSeed_sorted_hg19_noPCR/3pGtoA_freq.txt
If your awk doesn't have gensub (I'm using cygwin), you can remove the first four lines (printf-printf); headers won't be printed in that case.

move certain columns to end using awk

I have large tab delimited file with 1000 columns. I want to rearrange so that certain columns have to be moved to the end.
Could anyone help using awk
Example input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Move columns 5,6,7,8 to the end.
Output:
1 2 3 4 9 10 11 12 13 14 15 16 17 18 19 20 5 6 7 8
This prints columns 1 to a, then b to the last, and then columns a+1 to b-1:
$ awk -v a=4 -v b=9 '{for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i};for (i=a+1;i<b;i++) {printf "%s\t",$i};print""}' file
1 2 3 4 9 10 11 12 13 14 15 16
17 18 19 20 5 6 7 8
The columns are moved in this way for every line in the input file, however many lines there are.
How it works
-v a=4 -v b=9
This defines the variables a and b which determine the limits on which columns will be moved.
for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i}
This prints all columns except the ones from a+1 to b-1.
In this loop, i is incremented by one except when i==a in which case it is incremented by b-a so as to skip over the columns to be moved. This is done with awk's ternary statement:
i += i==a ? b-a : 1
+= simply means "add to." i==a ? b-a : 1 is the ternary statement. The value that it returns depends on whether i==a is true or false. If it is true, the value before the colon is returned. If it is false, the value after the colon is returned.
for (i=a+1;i<b;i++) {printf "%s\t",$i}
This prints columns a+1 to b-1.
print""
This prints a newline character to end the line.
Alternative solution that avoids printf
This approach assembles the output into the variable out and then prints with a plain print command, avoiding printf and the need for percent signs:
awk -v a=4 -v b=9 '{out="";for (i=1;i<=NF;i+=i==a?b-a:1) out=out $i"\t";for (i=a+1;i<b;i++) out=out $i "\t";print out}' file
One way to rearrange 2 columns ($5 become $20 and $20 become $5) the rest stay unchanged :
$ awk '{x=$5; $5=$20; $20=x; print}' file.txt
for 4 columns :
$ awk '{
x=$5; $5=$20; $9=x;
y=$9; $9=$10; $10=y;
print
}' file.txt
My approach:
awk 'BEGIN{ f[5];f[6];f[7];f[8] } \
{ for(i=1;i<=NF;i++) if(!(i in f)) printf "%s\t", $i; \
for(c in f) printf "%s\t", $c; printf "\n"} ' file
It's splitted in 3 parts:
The BEGIN{} part determines which field should be moved to the end. The indexes of the array f are moved. In the example it's 5, 6, 7 and 8.
Cycle trough every field (doesn't matter if there are 1000 fields or more) and check if they are in the array. If not print them.
Now we need the skipped fields. Cycle trough the f array and print those values.
Another way in awk
Switch last A-B with last N fields
awk -vA=4 -vB=8 '{x=B-A;for(i=A;i<=B;i++){y=$i;$i=$(t=(NF-x--));$t=y}}1' file
Put N rows from end into positon A
awk -vA=3 -vB=8 '{split($0,a," ");x=A++;while(x++<B)$x=a[NF-(B-x)];while(B++<NF)$B=a[A++]}1' file