extract specific lines based on another file

extract specific lines based on another file - awk

I have a folder containing text files. I need to extract specific lines from the files of this folder based on another file input.txt. I tried the following code . But it doesn't work.
awk '
NR==FNR{
if(NF>1)f=$3
else A[f,$1]
next
}
(FILENAME,$3) in A {
print > ( todir "/" FILENAME )
}
' todir=/home/alan/Desktop/output FS=\* /home/alan/Desktop/input.txt FS=" " *
file1
PHE .0 10 .0 0
ASP 59.8 11 59.8 0
LEU 66.8 15 66.8 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
THR 33.2 50 33.2 0
SER 10.2 51 .9 9.3
input.txt
*file1
10
16
*file2
43
44
49
Desired output
file1
PHE 0 10 0 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9

On line 3,
$3 needs to be changed to $2.
Since asterisk is the field separator in input.txt, the (empty, non-existent) string before it is counted as $1 and the file name that comes after it as $2.
awk '
NR==FNR{
if(NF>1)f=$2
else A[f,$1]
next
}

Related

Remove dots but not decimals in tab delimited file

I have these data:
#chr pos ref alt af_alt filter an
22 10510033 T C . AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
(non-dot lines snipped)
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598
I wantto remove lone dots in every column but, eg columns 1 for af_alt but leave decimals (eg last row)
I tried this solution, but does not seem to change the file in anyway:
awk 'BEGIN {OFS=FS=" "} {gsub(/^\.$/,"",$1)}1'

In awk you can do:
awk -v OFS="\t" '/\./{for (i=1;i<=NF;i++) if ($i==".") $i=""} 1' file
This works on any field (where a regex relying on a leading or trailing space does not) and allows awk to handle the surrounding field separators. That allows simple string equality to be used to test for ".".
With your example (with the runs of spaces replaced with tabs) prints:
#chr pos ref alt af_alt filter an
22 10510033 T C AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
22 10510077 C A 0 AC0 18
22 10510103 A T 0 AC0;AS_VQSR 64
22 10510105 T A 0 AC0;AS_VQSR 70
22 10510113 C T 0 AC0;AS_VQSR 94
22 10510119 A G 0 AC0;AS_VQSR 120
22 10510130 A G 0 AC0;AS_VQSR 164
22 10510138 CATA C 0 AC0;AS_VQSR 218
22 10510143 T A 0 AC0;AS_VQSR 264
22 10510161 T A 0 AC0;AS_VQSR 430
22 10510164 A T 0 AC0;AS_VQSR 468
22 10510169 G A 0 AS_VQSR 502
22 10510171 C T 0 AC0;AS_VQSR 530
22 10510183 A G 0 AS_VQSR 718
22 10510193 G C 0 AC0;AS_VQSR 804
22 10510200 C T 0 AC0;AS_VQSR 936
22 10510212 A T 0 AS_VQSR 1070
22 10510228 G T 0 AC0;AS_VQSR 1318
22 10510232 A G 0 AS_VQSR 1364
22 10510233 G A 0 AC0 1370
22 10510235 C A 0 AC0;AS_VQSR 1376
22 10510236 G A 0 AC0 1394
22 10510250 C T 0 AC0;AS_VQSR 1434
22 10510258 C T 0 AS_VQSR 1442
22 10510263 A T 0 AC0;AS_VQSR 1486
22 10510276 G A 0 AC0;AS_VQSR 1550
22 10510277 A G 0 AC0;AS_VQSR 1570
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598

You might harness GNU sed for this task following way, let file.txt be tab-sheared with following content
1.1 2.2 3.3
1.1 2.2 .
1.1 . 3.3
1.1 . .
. 2.2 3.3
. 2.2 .
. . 3.3
. . .
then
sed -e 's/^\.\t/\t/' -e 's/\t\.$/\t/' -e 's/\t\.\t/\t\t/g' file.txt
gives output
1.1 2.2 3.3
1.1 2.2
1.1 3.3
1.1
2.2 3.3
2.2
3.3
Explanation: there are 3 cases: . might be at beginning of line, . might be at end of line, . might be in middle of line. 1st and 2nd is done using ^ (start of line) and $ (end of line) and it is sufficient to do each once, 3rd might require global (g) replacement, each is replace with number of TAB characters from match (1, 1 and 2 respectively). Observe that . needs to be escaped to mean literal dot, not any character.
(tested in GNU sed 4.7)

You can use sed to replace any dot between spaces by a space:
sed 's/ \. / /'
For the last column, you might need a $ instead of the final space.

For a regex solution, I'd use perl
perl -pe 's/(^|\t)\.(\t|$)/$1$2/g'
Demo:
Some tab-separated text:
$ printf '%s\t%s\t%s\n' 1 2 3 4 . 5 . 7.5 .
1 2 3
4 . 5
. 7.5 .
with the perl filter
$ printf '%s\t%s\t%s\n' 1 2 3 4 . 5 . 7.5 . | perl -pe 's/(^|\t)\.(\t|$)/$1$2/g'
1 2 3
4 5
7.5

if you don't wanna deal with loops but also don't mind an extra downstream pipe:
{m,g,n}awk 'gsub("\23\\456", "\23_", $!(NF = NF))^_' OFS='\23' |
column -s$'\23' -t
#chr pos ref alt af_alt filter an
22 10510033 T C _ AC0;AS_VQSR 0
22 10510061 A T 0 AC0 2
(non-dot lines snipped)
22 10510286 G A 0 AC0;AS_VQSR 1598
22 10510286 G T 0.00125156 AS_VQSR 1598

Using awk to select rows with a specific value in column greater than x

I tried to use awk to select all rows with a value greater than 98 in the third column. In the output, only lines between 98 - 98.99... were selected and lines with a value more than 98.99 not.
I would like to extract all lines with a value greater than 98 including 99, 100 and so on.
Here my code and my input format:
for i in *input.file; do awk '$3>98' $i >{i/input./output.}; done
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Expected Output
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

Okay, if you have a series of files, *input.file and you want to select those lines where $3 > 98 and then write the values to the same prefix, but with output.file as the rest of the filename, you can use:
awk '$3 > 98 {
match (FILENAME,/input.file$/)
print $0 > substr(FILENAME,1,RSTART-1) "output.file"
}' *input.file
Which uses match to find the index where input.file begins and then uses substr to get the part of the filename before that and appends "output.file" to the substring for the final output filename.
match() sets the RSTART value to the index where input.file begins in the current filename which is then used by substr truncate the current filename at that index. See GNU awk String Functions for complete details.
For exmaple, if you had input files:
$ ls -1 *input.file
v1input.file
v2input.file
Both with your example content:
$ cat v1input.file
A chr11 98.80 83 1 0 2 84
B chr7 95.45 22 1 0 40 61
C chr7 88.89 27 0 1 46 72
D chr6 100.00 20 0 0 1 20
Running the awk command above would results in two output files:
$ ls -1 *output.file
v1output.file
v2output.file
Containing the records where the third-field was greater than 98:
$ cat v1output.file
A chr11 98.80 83 1 0 2 84
D chr6 100.00 20 0 0 1 20

AWK print next line of match between matches

Let's presume I have file test.txt with following data:
.0
41
0.0
42
0.0
43
0.0
44
0.0
45
0.0
46
0.0
START
90
34
17
34
10
100
20
2056
30
0.0
10
53
20
2345
30
0.0
10
45
20
875
30
0.0
END
0.0
48
0.0
49
0.0
140
0.0
With AWK how would I print the lines after 10 and 20 between START and END.
So the output would be.
100
2056
53
2345
45
875
I was able to get the lines with 10 and 20 with
awk '/START/,/END/ {if($0==10 || $0==20) print $0}' test.txt
but how would I get the next lines?

I actually got what I wanted with
awk '/^START/,/^END/ {if($0==10 || $0==20) {getline; print} }' test.txt

Range in awk works fine, but is less flexible than using flags.
awk '/^START/ {f=1} /^END/ {f=0} f && /^(1|2)0$/ {getline;print}' file
100
2056
53
2345
45
875

Don't use ranges as they make trivial things slightly briefer but require a complete rewrite or duplicate conditions when things get even slightly more complicated.
Don't use getline unless it's an appropriate application and you have read and fully understand http://awk.info/?tip/getline.
Just let awk read your lines as designed:
$ cat tst.awk
/START/ { inBlock=1 }
/END/ { inBlock=0 }
foundTgt { print; foundTgt=0 }
inBlock && /^[12]0$/ { foundTgt=1 }
$ awk -f tst.awk file
100
2056
53
2345
45
875
Feel free to use single-character variable names and cram it all onto one line if you find that useful:
awk '/START/{b=1} /END/{b=0} f{print;f=0} b&&/^[12]0$/{f=1}' file

How to insert two lines for every data frame using awk?

I have repeating data as follows
....
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
5 4 3 16 22 247 0 40168 40911 40944 40205 40000 40562
6 4 4 17 154 93 309 0 40930 40919 40903 40917 40852 40000 40419
7 3 2 233 311 0 40936 40932 40874 40000 40807
....
This data is made up of 115 data blocks, and each data block have 4000 lines like that format.
Here, I hope to put two new lines (number of line per data block = 4000 and empty line) at the begining of each data blocks, so it looks
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
3999 2 2 4079 4081 0 40873 40873 40746 40000 40634
4000 1 1 4080 0 40873 40923 40000 40345
4000
1 4 4 244 263 704 952 0 40936 40930 40934 40921 40820 40000 40570
2 4 4 215 172 305 33 0 40945 40942 40937 40580 40687 40000 40410
3 4 4 344 279 377 1945 0 40933 40915 40907 40921 40839 40000 40437
4 4 4 66 79 169 150 0 40928 40938 40923 40921 40789 40000 40498
...
Can I do this with awk or any other unix command?

My solution is more general, since the blocks can be of non-equal lenght as long as you restart the 1st field counter to denote the beginning of a new block
% cat mark_blocks
$1<count { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
# executed for each line
{ l[$1] = $0; count=$1}
END { print count; print "";
for(i=1;i<=count;i++) print l[i]; }
% awk -f mark_blocks your_data > marked_data
%
The working is simple, awk accumulates lines in memory and it prints the header lines and the accumulated data when it reaches a new block or EOF.
The (modest) trick is that the output action must take place before we do the usual stuff we do for each line.

A simple one liner using awk can do the purpose.
awk 'NR%4000==1{print "4000\n"} {print$0}' file
what it does.
print $0 prints every line.
NR%4000==1 selects the 4000th line. When it occures it prints a 4000 and a newline \n, that is two new lines.
NR Number of records, which is effectivly number of lines reads so far.
simple test.
inserts 4000 at 5th line
awk 'NR%5==1{print "4000\n"} {print$0}'
output:
4000
1
2
3
4
5
4000
6
7
8
9
10
4000
11
12
13
14
15
4000
16
17
18
19
20
4000

You can do it all in bash :
cat $FILE | ( let countmax=4000; let count=countmax; while read lin ; do if [ $count == $countmax ]; then let count=0; echo -e "$countmax\n" ; fi ; echo $lin ; let count=count+1 ; done )
Here we assume you are reading this data from $FILE . Then all we are doing is reading from the file and piping it into our little bash script.
The bash script reads lines one by one (with the while read lin) , and increments the counter countfor each line. When starting or when the counter count reaches the value countmax (set to 4000) , then it prints out the 2 lines you asked for.

merge similar files in awk

I have the following files
File A
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 20493
File B
Kmax Event File - Text Format
1 6 1000
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
File C, File D, ... and all those files have exact the same format. In addition the file name of each file is the following : run, run%1, run%2, run%3, run%4 etc. The file number could reach even up to 30, run%30 that is.
What I'd like to do is to merge the files in the following way
Kmax Event File - Text Format
1 6 1000
1 4143 9256 13645 16426 20490
49 4144 8820 14751 16529 20505
45 4146 8308 12303 16912 22715
75 4139 9049 14408 16447 20480
23 4137 8449 13223 16511 20498
22 4142 8795 14955 16615 2049
42 4143 9203 13401 16475 20480
44 4140 8251 12302 16932 21872
849 6283 8455 12301 16415 20673
18 4148 8238 12757 16597 20484
19 4144 8268 12306 17110 21868
50 4134 8331 12663 16501 20606
988 5682 8296 12306 16577 20592
61 4147 8330 12307 16945 22497
0 4138 8333 12310 16871 22749
I believe I can do it using
awk '{i=$1;sub(i,x);A[i]=A[i]$0} FILENAME==ARGV[ARGC-1]{print i A[i]}'
but in this way the two first lines of the second file will be present. In addition I don't know if the above line will work. The problem is that I will need to merge many files at the same time. Any idea to merge those almost identical files?

Using grouping braces in the shell
{ cat run; sed '1,2d' run%*; } > c

You can use cat and tail:
cat A > C && tail -n +3 B >> C
This will merge file A and B in a new file named C.
Using awk:
awk 'FNR==NR{print; next} FNR>2' A B > C
If you have more than one file to merge into one, you can list them next to A B in the awk version, e.g A B D. C in awk version is the output file containing merged data.
In cat and tail version you can repeat tail part of the code for other files, e.g
cat A > C && tail -n +3 B >> C && tail -n +3 D >> C
or create some kind of loop to iterate over files.

Print all lines from the first file (NR==FNR) and only line 3 and on from the rest of the files (FNR>2):
awk 'NR==FNR||FNR>2' run*

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

extract specific lines based on another file - awk

On line 3, $3 needs to be changed to $2. Since asterisk is the field separator in input.txt, the (empty, non-existent) string before it is counted as $1 and the file name that comes after it as $2. awk ' NR==FNR{ if(NF>1)f=$2 else A[f,$1] next }

Related

Remove dots but not decimals in tab delimited file

Using awk to select rows with a specific value in column greater than x

AWK print next line of match between matches

How to insert two lines for every data frame using awk?

merge similar files in awk

Categories

Resources