Remove dots but not decimals in tab delimited file - awk

I have these data:
#chr pos ref alt af_alt filter an
22 10510033 T C . AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
(non-dot lines snipped)
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598
I wantto remove lone dots in every column but, eg columns 1 for af_alt but leave decimals (eg last row)
I tried this solution, but does not seem to change the file in anyway:
awk 'BEGIN {OFS=FS=" "} {gsub(/^\.$/,"",$1)}1'

In awk you can do:
awk -v OFS="\t" '/\./{for (i=1;i<=NF;i++) if ($i==".") $i=""} 1' file
This works on any field (where a regex relying on a leading or trailing space does not) and allows awk to handle the surrounding field separators. That allows simple string equality to be used to test for ".".
With your example (with the runs of spaces replaced with tabs) prints:
#chr pos ref alt af_alt filter an
22 10510033 T C AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
22 10510077 C A 0 AC0 18
22 10510103 A T 0 AC0;AS_VQSR 64
22 10510105 T A 0 AC0;AS_VQSR 70
22 10510113 C T 0 AC0;AS_VQSR 94
22 10510119 A G 0 AC0;AS_VQSR 120
22 10510130 A G 0 AC0;AS_VQSR 164
22 10510138 CATA C 0 AC0;AS_VQSR 218
22 10510143 T A 0 AC0;AS_VQSR 264
22 10510161 T A 0 AC0;AS_VQSR 430
22 10510164 A T 0 AC0;AS_VQSR 468
22 10510169 G A 0 AS_VQSR 502
22 10510171 C T 0 AC0;AS_VQSR 530
22 10510183 A G 0 AS_VQSR 718
22 10510193 G C 0 AC0;AS_VQSR 804
22 10510200 C T 0 AC0;AS_VQSR 936
22 10510212 A T 0 AS_VQSR 1070
22 10510228 G T 0 AC0;AS_VQSR 1318
22 10510232 A G 0 AS_VQSR 1364
22 10510233 G A 0 AC0 1370
22 10510235 C A 0 AC0;AS_VQSR 1376
22 10510236 G A 0 AC0 1394
22 10510250 C T 0 AC0;AS_VQSR 1434
22 10510258 C T 0 AS_VQSR 1442
22 10510263 A T 0 AC0;AS_VQSR 1486
22 10510276 G A 0 AC0;AS_VQSR 1550
22 10510277 A G 0 AC0;AS_VQSR 1570
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598

You might harness GNU sed for this task following way, let file.txt be tab-sheared with following content
1.1 2.2 3.3
1.1 2.2 .
1.1 . 3.3
1.1 . .
. 2.2 3.3
. 2.2 .
. . 3.3
. . .
then
sed -e 's/^\.\t/\t/' -e 's/\t\.$/\t/' -e 's/\t\.\t/\t\t/g' file.txt
gives output
1.1 2.2 3.3
1.1 2.2
1.1 3.3
1.1
2.2 3.3
2.2
3.3
Explanation: there are 3 cases: . might be at beginning of line, . might be at end of line, . might be in middle of line. 1st and 2nd is done using ^ (start of line) and $ (end of line) and it is sufficient to do each once, 3rd might require global (g) replacement, each is replace with number of TAB characters from match (1, 1 and 2 respectively). Observe that . needs to be escaped to mean literal dot, not any character.
(tested in GNU sed 4.7)

You can use sed to replace any dot between spaces by a space:
sed 's/ \. / /'
For the last column, you might need a $ instead of the final space.

For a regex solution, I'd use perl
perl -pe 's/(^|\t)\.(\t|$)/$1$2/g'
Demo:
Some tab-separated text:
$ printf '%s\t%s\t%s\n' 1 2 3 4 . 5 . 7.5 .
1 2 3
4 . 5
. 7.5 .
with the perl filter
$ printf '%s\t%s\t%s\n' 1 2 3 4 . 5 . 7.5 . | perl -pe 's/(^|\t)\.(\t|$)/$1$2/g'
1 2 3
4 5
7.5

if you don't wanna deal with loops but also don't mind an extra downstream pipe:
{m,g,n}awk 'gsub("\23\\456", "\23_", $!(NF = NF))^_' OFS='\23' |
column -s$'\23' -t
#chr pos ref alt af_alt filter an
22 10510033 T C _ AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
(non-dot lines snipped)
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598

Related

for loop passing variables to awk cmd

I'm trying to make this loop work
for i in 5 10 15; do awk -v var=${i} '$2>var' file.txt > par${i}.pos; done
Where file.txt is a tab delim like in
A 2
B 4
EE 5
F2 7
FF 12
C 5
D 13
GAG 15
so that I can collect lines of file.txt that have 2nd col > 5 in par5.pos and so on, but it doesn't work.
As in par5.pos
F2 7
FF 12
D 13
GAG 15
or par10.pos
FF 12
D 13
GAG 15
This might work for you (GNU parallel and awk):
parallel awk \''$2>{}'\' file '>' pos{}.pos ::: 5 10 15

Using awk to select rows with a specific value in column greater than x

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20
Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

extract specific lines based on another file

I have a folder containing text files. I need to extract specific lines from the files of this folder based on another file input.txt. I tried the following code . But it doesn't work.
awk '
NR==FNR{
if(NF>1)f=$3
else A[f,$1]
next
}
(FILENAME,$3) in A {
print > ( todir "/" FILENAME )
}
' todir=/home/alan/Desktop/output FS=\* /home/alan/Desktop/input.txt FS=" " *
file1
PHE .0 10 .0 0
ASP 59.8 11 59.8 0
LEU 66.8 15 66.8 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
THR 33.2 50 33.2 0
SER 10.2 51 .9 9.3
input.txt
*file1
10
16
*file2
43
44
49
Desired output
file1
PHE 0 10 0 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
On line 3,
$3 needs to be changed to $2.
Since asterisk is the field separator in input.txt, the (empty, non-existent) string before it is counted as $1 and the file name that comes after it as $2.
awk '
NR==FNR{
if(NF>1)f=$2
else A[f,$1]
next
}

merge similar files in awk

I have the following files
File A
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 20493
File B
Kmax Event File - Text Format
1 6 1000
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
File C, File D, ... and all those files have exact the same format. In addition the file name of each file is the following : run, run%1, run%2, run%3, run%4 etc. The file number could reach even up to 30, run%30 that is.
What I'd like to do is to merge the files in the following way
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 2049
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
I believe I can do it using
awk '{i=$1;sub(i,x);A[i]=A[i]$0} FILENAME==ARGV[ARGC-1]{print i A[i]}'
but in this way the two first lines of the second file will be present. In addition I don't know if the above line will work. The problem is that I will need to merge many files at the same time. Any idea to merge those almost identical files?
Using grouping braces in the shell
{ cat run; sed '1,2d' run%*; } > c
You can use cat and tail:
cat A > C && tail -n +3 B >> C
This will merge file A and B in a new file named C.
Using awk:
awk 'FNR==NR{print; next} FNR>2' A B > C
If you have more than one file to merge into one, you can list them next to A B in the awk version, e.g A B D. C in awk version is the output file containing merged data.
In cat and tail version you can repeat tail part of the code for other files, e.g
cat A > C && tail -n +3 B >> C && tail -n +3 D >> C
or create some kind of loop to iterate over files.
Print all lines from the first file (NR==FNR) and only line 3 and on from the rest of the files (FNR>2):
awk 'NR==FNR||FNR>2' run*

How do i add the second column based on column 1 in awk ? For example i used the following script

zcat *.gz | awk '{print $1}' |sort| uniq -c | sed 's/^[ ]\+//g' | cut -d' ' -f1 | sort | uniq -c | sort -k1n
I get the following output:
3 648
3 655
3 671
3 673
3 683
3 717
4 18
4 29
4 31
4 34
4 652
5 12
6 24
6 33
7 13
12 10
13 9
14 8
33 7
73 6
166 5
383 4
1178 3
3945 2
26692 1
I don't want repetitions in my 1st column. Example: if my first column is 3 , i should add all the values in the second column that are associated with 3. Thank you
Solution using arrays in awk
{
a[$1]=a[$1]+$2
}
END {
for (i in a)
printf("%d\t%d\n", i, a[i])
}
Pipe the output through sort -n once more to have it in ascending order
$ awk -f num.awk numbers | sort -n
3 4047
4 764
5 12
6 57
7 13
12 10
13 9
14 8
33 7
73 6
166 5
383 4
1178 3
3945 2
26692 1
awk 'NF == 1 {c=$1; print $0} NF>1 {if (c==$1) {print "\t" $2} else {c=$1; print $0}}'
can do it, but please note, that the indentation can be incorrect, as I had used a simple tab \t above.
HTH