using awk to handle 2 text files - awk

I have a text file like this small example:
chr12 2904300 2904315 peak_8 167 . 8.41241 21.74573 16.71985 65
chr1 3663184 3663341 peak_9 77 . 7.86961 12.16321 7.70843 37
chr1 6284759 6285189 peak_10 220 . 13.85268 27.34231 22.06610 332
chr1 6653468 6653645 peak_11 196 . 13.59296 24.85586 19.68392 117
chr1 8934964 8935095 peak_12 130 . 8.82937 17.84867 13.03453 36
and have another file like the 2nd example:
ENSG00000004478|12|2904119|2904309
ENSG00000002933|7|150498624|150498638
ENSG00000173153|11|64073050|64073208
ENSG00000001626|7|117120017|117120148
ENSG00000003249|16|90085750|90085881
ENSG00000003056|12|9102084|9102551
the first example is tab separated and the 2nd example is |
separated.I want to select only the rows from the 1st example if "the
average of columns 2 and 3 in the first example is between the 3rd and
4th columns in the 2nd example and also the number of the first column
in the 1st example is equal to the 2nd column of the 2nd example".
for example the output from these 2 examples would be:
chr12 2904300 2904315 peak_8 167 . 8.41241 21.74573 16.71985 65
I am trying to do that using awk:
awk 'FNR==NR{a[FNR]=($2+$3)/2;b[FNR]=$0;next} (FNR in a) && ($3<=a[FNR] && $4>=a[FNR]){print b[FNR]}' file1 FS="|" file2
but it does not work and returns nothing. do you know how I can correct the code?

solution
awk 'NR==FNR ? \!((a[NR]=$1)&&(z[NR]=$0)&&(avr[NR]=($3+$2)/2)) : (($4>=avr[FNR] && avr[FNR]>=$3)&&(a[FNR]=="chr"$2)){print z[FNR] }' file1 FS="|" file2
file1 FS=" " by default, file2 FS="|"
awk ?: part description
1). NR==FNR check if the file we are parsing is the first or second
NR - The total number of input records seen so far.
FNR - The input record number in the current input file
2) if working on first file \!((a[NR]=$1)&&(z[NR]=$0)&&(avr[NR]=($3+$2)/2))
\! - prevent printing to screen
a[NR]=$1 - array holds the first field of file
z[NR]=$0 - array holds the lines of first file
avr[NR]=($3+$2)/2 - array holds the average requested from the first file
3) check for conditions of second file to print lines:
a) (($4>=avr[FNR] && avr[FNR]>=$3)&&(a[FNR]=="chr"$2)){print z[FNR] }
b)($4>=avr[FNR] && avr[FNR]>=$3) - check average of first file is in between the values of fields 3 & 4 of the second file
c)(a[FNR]=="chr"$2) - check that numbers in field 1(first file) & field 2(second file) are the same
**d)**if conditions are true print to screen the line from the first file (z[FNR])

Related

awk - no output after subtracting two matching columns in two files

I'm learning awk and I'd like to use it to get the difference between two columns in two files
If an entry in file_2 column-2 exists in file_1 column-4, I want to subtract file_2 column-3 from of file_1 column-2
file_1.txt
chrom_1 1000 2000 gene_1
chrom_2 3000 4000 gene_2
chrom_3 5000 6000 gene_3
chrom_4 7000 8000 gene_4
file_2.txt
chrom_1 gene_1 114 252
chrom_9 gene_5 24 183
chrom_2 gene_2 117 269
Here's my code but I get no output:
awk -F'\t' 'NR==FNR{key[$1]=$4;file1col1[$1]=$2;next} $2 in key {print file1col1[$1]-$3}' file_1.txt file_2.txt
You are close. But indexing key by the gene name storing the value from the 4th field will allow you to simply subtract key[$2] - $3 to get your result, e.g.
awk 'NR==FNR {key[$4] = $2; next} $2 in key {print key[$2] - $3}' file1 file2
886
2883
(note: there is no gene_5 so key[gene_5] is taken as 0. The test $2 in key conditions the 2nd rule to only execute if the gene is present in key)
Write the Rules Out
Sometimes it helps to write the rules for the script out rather than trying to make a 1-liner out of the script. This allows for better readability. For example:
awk '
NR==FNR { # Rule1 conditioned by NR==FNR (file_1)
key[$4] = $2 # Store value from field 2 indexed by field 4
next # Skip to next record
}
$2 in key { # Rule2 conditioned by $2 in key (file_2)
print key[$2] - $3 # Output value from file_1 - field 3
}
' file_1.txt file_2.txt
Further Explanation
awk will read each line of input (record) from the file(s) and it will apply each rule to the record in the order the rules appear. Here, when the record number equals the file record number (only true for file_1), the first rule is applied and then the next command tells awk to skip everything else and go read the next record.
Rule 2 is conditioned by $2 in key which tests whether the gene name from file 2 exists as an index in key. (the value in array test does not create a new element in the array -- this is a useful benefit of this test). If the gene name exists in the key array filled from file_1, then field 3 from file_2 is subtracted from that value and the difference is output.
One of the best refernces to use when learning awk is Tje GNU Awk User's Guide. It provides an excellent reference for awk and any gawk only features are clearly marked with '#'.

Print first column of a file and the substraction of two columns plus a number changing the separator

I am trying to print the first column of this file as well as the substraction between the fifth and fourth columns plus 1. In addition, I want to change the separator from a space to a tab.
This is the file:
A gene . 200 500 y
H gene . 1000 2000 j
T exon 1 550 650 m
U intron . 300 400 o
My expected output is:
A 301
H 1001
T 101
U 101
I´ve tried:
awk '{print $1'\t'$5-$4+1}' myFile
But my output is not tab separated, in fact, columns are not even separated by spaces.
I also tried:
awk OFS='\t' '{print $1 $5-$4+1}' myFile
But then I get a syntax error
Do you know how can I solve this?
Thanks!
Could you please try following. Written with shown samples.
awk 'BEGIN{OFS="\t"} {print $1,(($5-$4)+1)}' Input_file
Explanation: Why your output is not tab separated because you haven't used ,(comma) to print separator hence it will print them like A301 and so on. Also in case you want to set OFS in variable level in awk then you should use awk -v OFS='\t' '{print $1,(($5-$4)+1)}' Input_file where -v is important to let awk know that you are defining variable's value as TAB here. Also I have used parenthesis with subtraction and addition to make it clearer.

how to extract lines which have no duplicated values in first column?

For some statistics research, I want to separate my data which have duplicated value in first column. I work with vim.
suppose that a part of my data is like this:
Item_ID Customer_ID
123 200
104 134
734 500
123 345
764 347
1000 235
734 546
as you can see, some lines have equal values in first column,
i want to generate two separated files, which one of them contains just non repeated values and the other contains lines with equal first column value.
for above example i want to have these two files:
first one contains:
Item_ID Customer_ID
123 200
734 500
123 345
734 546
and second one contains:
Item_ID Customer_ID
104 134
764 347
1000 235
can anybody help me?
I think awk would be a better option here.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] > 1' input.txt input.txt > dup.txt
Prettier version of awk code:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
Overview
We loop over the text twice. By supplying the same file to our awk script twice we are effectively looping over the text twice. First time though the loop count the number of times we see our field's value. The second time though the loop output only the records which have a field value count of 1. For the duplicate line case we only output lines which have field value counts greater than 1.
Awk primer
awk loops over lines (or records) in a text file/input and splits each line into fields. $1 for the first field, $2 for the second field, etc. By default fields are separated by whitespaces (this can be configured).
awk runs each line through a series of rules in the form of condition { action }. Any time a condition matches then action is taken.
Example of printing the first field which line matches foo:
awk '/foo/ { print $1 }` input.txt
Glory of Details
Let's take a look at finding only the unique lines which the first field only appears once.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
Prettier version for readability:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
awk 'code' input > output - run code over the input file, input, and then redirect the output to file, output
awk can take more than one input. e.g. awk 'code' input1.txt input2.txt.
Use the same input file, input.txt, twice to loop over the input twice
awk 'FNR == NR { code1; next } code2' file1 file2 is a common awk idiom which will run code1 for file1 and run code2 for file2
NR is the current record (line) number. This increments after each record
FNR is the current file's record number. e.g. FNR will be 1 for the first line in each file
next will stop executing any more actions and go to the next record/line
FNR == NR will only be true for the first file
$1 is the first field's data
seen[$1]++ - seen is an array/dictionary where we use the first field, $1, as our key and increment the value so we can get a count
$0 is the entire line
print ... prints out the given fields
print $0 will print out the entire line
just print is short for print $0
condition { print $0 } can be shorted to condition { print } which can be shorted further to just condition
seen[$1] == 1 which check to see if the first field's value count is equal to 1 and print the line
Here is an awk solution:
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > (a[b[i]]==1?"single":"multiple")}' file
cat single
104 134
764 347
1000 235
cat multiple
123 200
734 500
123 345
734 546
PS I skipped the first line, but it could be implemented.
This way you get one file for single hits, one for double, one for triple etc.
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > "file"a[b[i]]}'
That would require some filtering of the list of lines in the buffer. If you're really into statistics research, I'd go search for a tool that is better suited than a general-purpose text editor, though.
That said, my PatternsOnText plugin has some commands that can do the job:
:2,$DeleteUniqueLinesIgnoring /\s\+\d\+$/
:w first
:undo
:2,$DeleteAllDuplicateLinesIgnoring /\s\+\d\+$/
:w second
As you want to filter on the first column, the commands' /{pattern}/ has to filter out the second column; \s\+\d\+$ matches the final number and its preceding whitespace.
:DeleteUniqueLinesIgnoring (from the plugin) gives you just the duplicates, :DeleteAllDuplicateLinesIgnoring just the unique lines. I simply :write them to separate files and :undo in between.

Select current and previous line if values are the same in 2 columns

Check values in columns 2 and 3, if the values are the same in the previous line and current line( example lines 2-3 and 6-7), then print the lines separated as ,
Input file
1 1 2 35 1
2 3 4 50 1
2 3 4 75 1
4 7 7 85 1
5 8 6 100 1
8 6 9 125 1
4 6 9 200 1
5 3 2 156 2
Desired output
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
I tried to modify this code, but not results
awk '{$6=$2 $3 - $p2 $p3} $6==0{print p0; print} {p0=$0;p2=p2;p3=$3}'
Thanks in advance.
$ awk -v OFS=',' '{$1=$1; cK=$2 FS $3} pK==cK{print p0, $0} {pK=cK; p0=$0}' file
2,3,4,50,1,2,3,4,75,1
8,6,9,125,1,4,6,9,200,1
With your own code and its mechanism updated:
awk '(($2=$2) $3) - (p2 p3)==0{printf "%s", p0; print} {p0=$0;p2=$2;p3=$3}' OFS="," file
2,3,4,50,12,3,4,75,1
8,6,9,125,14,6,9,200,1
But it has underlying problem, so better use this simplified/improved way:
awk '($2=$2) FS $3==cp{print p0,$0} {p0=$0; cp=$2 FS $3}' OFS=, file
The FS is needed, check the comments under Mr. Morton's answer.
Why your code fails:
Concatenate (what space do) has higher priority than minus-.
You used $6 to save the value you want to compare, and then it becomes a part of $0 the line.(last column). -- You can change it to a temporary variable name.
You have a typo (p2=p2), and you used $p2 and $p3, which means to get p2's value and find the corresponding column. So if p2==3 then $p2 equals $3.
You didn't set OFS, so even if your code works, the output will be separated by spaces.
print will add a trailing newline\n, so even if above problems don't exist, you will get 4 lines instead of the 2 lines output you wanted.
Could you please try following too.
awk 'prev_2nd==$2 && prev_3rd==$3{$1=$1;print prev_line,$0} {prev_2nd=$2;prev_3rd=$3;$1=$1;prev_line=$0}' OFS=, Input_file
Explanation: Adding explanation for above code now.
awk '
prev_2nd==$2 && prev_3rd==$3{ ##Checking if previous lines variable prev_2nd and prev_3rd are having same value as current line 2nd and 3rd field or not, if yes then do following.
$1=$1 ##Resetting $1 value of current line to $1 only why because OP needs output field separator as comma and to apply this we need to reset it to its own value.
print prev_line,$0 ##Printing value of previous line and current line here.
} ##Closing this condition block here.
{
prev_2nd=$2 ##Setting current line $2 to prev_2nd variable here.
prev_3rd=$3 ##Setting current line $3 to prev_3rd variable here.
$1=$1 ##Resetting value of $1 to $1 to make comma in its values applied.
prev_line=$0 ##Now setting pre_line value to current line edited one with comma as separator.
}
' OFS=, Input_file ##Setting OFS(output field separator) value as comma here and mentioning Input_file name here.

incorrect count of unique text in awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file
Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'
Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2