awk to use header field to count fields - awk

I am trying to use awk to count the headers and use those as field numbers. My problem is two fold:
The awk is close, but I need some expert help to make it better. Thank you :).
the awk as is ignores the field headers and defines the fields using the text (sometimes field 5 starts with NM_, other times it is LRG_) as the RefSeqGene.txt illustrates. I think that is because not all the fields have text, but what is consistent are the headers.
I only want to pull the row where $10 = "reference standard"
awk
awk 'FNR==NR {E[$1]; next }$3 in E {print $3, $5}' panel_genes.txt RefSeqGene.txt > update.txt
example of panel genes.txt (used to search RefSeqGene.txt)
ACTA2
BRAF
BHLHB9
example of RefSeqGene.txt
#tax_id GeneID Symbol RSG LRG RNA t Protein p Category
9606 59 ACTA2 NG_011541.1 NM_001613.2 NP_001604.1 reference standard
9606 59 ACTA2 NG_011541.1 NM_001141945.1 NP_001135417.1 reference standard
9606 673 BRAF NG_007873.3 LRG_299 NM_004333.4 t1 NP_004324.2 p1 reference standard
9606 80823 BHLHB9 NG_021340.1 NM_001142524.1 NP_001135996.1 aligned
9606 80823 BHLHB9 NG_021340.1 NM_001142525.1 NP_001135997.1 aligned
9606 80823 BHLHB9 NG_021340.1 NM_001142526.1 NP_001135998.1 aligned
desired output
ACTA2 NM_001613.2
ACTA2 NM_001141945.1
BRAF NM_004333.4

this one-liner gives your the desired output:
awk 'FNR==NR{a[$0];next}
$(NF-1)$NF=="referencestandard" && $3 in a{print $3, ($5~/^NM_/?$5:$6)}' file1 file2
$(NF-1)$NF=="referencestandard" checks your $10
if $5 begins with NM_ we take it, otherwise, we take the $6

Related

awk - no output after subtracting two matching columns in two files

I'm learning awk and I'd like to use it to get the difference between two columns in two files
If an entry in file_2 column-2 exists in file_1 column-4, I want to subtract file_2 column-3 from of file_1 column-2
file_1.txt
chrom_1 1000 2000 gene_1
chrom_2 3000 4000 gene_2
chrom_3 5000 6000 gene_3
chrom_4 7000 8000 gene_4
file_2.txt
chrom_1 gene_1 114 252
chrom_9 gene_5 24 183
chrom_2 gene_2 117 269
Here's my code but I get no output:
awk -F'\t' 'NR==FNR{key[$1]=$4;file1col1[$1]=$2;next} $2 in key {print file1col1[$1]-$3}' file_1.txt file_2.txt
You are close. But indexing key by the gene name storing the value from the 4th field will allow you to simply subtract key[$2] - $3 to get your result, e.g.
awk 'NR==FNR {key[$4] = $2; next} $2 in key {print key[$2] - $3}' file1 file2
886
2883
(note: there is no gene_5 so key[gene_5] is taken as 0. The test $2 in key conditions the 2nd rule to only execute if the gene is present in key)
Write the Rules Out
Sometimes it helps to write the rules for the script out rather than trying to make a 1-liner out of the script. This allows for better readability. For example:
awk '
NR==FNR { # Rule1 conditioned by NR==FNR (file_1)
key[$4] = $2 # Store value from field 2 indexed by field 4
next # Skip to next record
}
$2 in key { # Rule2 conditioned by $2 in key (file_2)
print key[$2] - $3 # Output value from file_1 - field 3
}
' file_1.txt file_2.txt
Further Explanation
awk will read each line of input (record) from the file(s) and it will apply each rule to the record in the order the rules appear. Here, when the record number equals the file record number (only true for file_1), the first rule is applied and then the next command tells awk to skip everything else and go read the next record.
Rule 2 is conditioned by $2 in key which tests whether the gene name from file 2 exists as an index in key. (the value in array test does not create a new element in the array -- this is a useful benefit of this test). If the gene name exists in the key array filled from file_1, then field 3 from file_2 is subtracted from that value and the difference is output.
One of the best refernces to use when learning awk is Tje GNU Awk User's Guide. It provides an excellent reference for awk and any gawk only features are clearly marked with '#'.

Print first column of a file and the substraction of two columns plus a number changing the separator

I am trying to print the first column of this file as well as the substraction between the fifth and fourth columns plus 1. In addition, I want to change the separator from a space to a tab.
This is the file:
A gene . 200 500 y
H gene . 1000 2000 j
T exon 1 550 650 m
U intron . 300 400 o
My expected output is:
A 301
H 1001
T 101
U 101
I´ve tried:
awk '{print $1'\t'$5-$4+1}' myFile
But my output is not tab separated, in fact, columns are not even separated by spaces.
I also tried:
awk OFS='\t' '{print $1 $5-$4+1}' myFile
But then I get a syntax error
Do you know how can I solve this?
Thanks!
Could you please try following. Written with shown samples.
awk 'BEGIN{OFS="\t"} {print $1,(($5-$4)+1)}' Input_file
Explanation: Why your output is not tab separated because you haven't used ,(comma) to print separator hence it will print them like A301 and so on. Also in case you want to set OFS in variable level in awk then you should use awk -v OFS='\t' '{print $1,(($5-$4)+1)}' Input_file where -v is important to let awk know that you are defining variable's value as TAB here. Also I have used parenthesis with subtraction and addition to make it clearer.

replace strings in column with matching value from another file using awk

I'm having a little issue trying to use awk to replace some strings in a column using another file as a reference for the replacement.
I want the strings in the third column of my File2 to be replaced by the strings in the second column of File1 when they match the string of the first column of File1.
Here are the files and the desired outcome to be more clear.
File1
AAA XZA
AAB XSZ
AAC XWQ
BAA XCD
File2
ADZ-4 128720 AAA 451351351 5135 jhgt 215
SZQ-2 036051 AAB 55654 grt
KFD-9 036266 AAC
ODS-10 036267 AAA 57321
POS-11 036268 AAC 8435435 764 frd
desired output :
ADZ-4 128720 XZA 451351351 5135 jhgt 215
SZQ-2 036051 XSZ 55654 grt
KFD-9 036266 XWQ
ODS-10 036267 XZA 57321
POS-11 036268 XWQ 8435435 764 frd
I tried the following command line.
awk 'FNR==NR{a[$1]=$2;next} {if ($3 in a){$3=a[1]}; print $0}' File1 File2
but I'm pretty sure I'm not doing something right in the second curly brakes, since it prints out a file with the third column removed.
If I only had a few, I would happily use sed by I have 500+ substitutions to do...
Any help would be appreciated and if you can explain so I can learn from my mistake, I would be immensely grateful.
You didn't reference the associative array in the right way. Please change:
...{if ($3 in a){$3=a[1]}; print $0...
into:
...{if ($3 in a){$3=a[$3]}; print $0
The keys of your array a are AAA,AAB... instead of 1,2,3....

unix - compare columns of two files

I have two files. First file is masterlist of IDS. Second file is normal input file.
I'm trying to print only the records of input where it's id (column 3) is NOT in masterlist (column 1).
sample:
masterlist.txt
111
222
333
input.txt
col1,col2,col3,col4,col5,col6
abc,abc,111,xyz,xyz,xyz
abc,abc,222,xyz,xyz,xyz
abc,abc,333,xyz,xyz,xyz
abc,abc,444,xyz,xyz,xyz
desired output:
col3,col4,col5,col6
abc,abc,444,xyz,xyz,xyz
I have come up with this code so far but I'm not getting the correct output.
awk -F\| '!b{a[$0]; next}$3 in a {true; next} {print $3","$4","$11","$12}' masterlist.txt b=1 input.txt
Could you please try following awk and let us know if this helps you.
awk 'FNR==NR{a[$1];next} !($3 in a)' masterlist.txt FS="," input.txt

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file