awk not printing expected result from file - awk

In the awk below I expected that adding NR==2 would only print check the 2nd line $1 value, the 1 and ensure that it is a number. If it is then print Index is a number else
Index is not a number. It seems close but the results is not expected... maybe I used the wrong variable? Thank you :).
file.txt
Index Chr Start End Ref Alt Quality Freq Score HGMD
1 1 10 100 A - GOOD .002 2 .
2 1 100 1000 - C STRAND BIAS .036 10 .
3 5 50 500 AA T GOOD 1 5 .
awk
awk -F'\t' 'NR==2 $1 ~ /^[[:digit:]]/ {print "Index is a number"} ELSE {print "Index is not a number"}' file.txt
Index is a number
Index is a number
Index is a number
Index is a number
Index is a number
desired output
Index is a number

awk 'NR==2 {print "Index is "($2~/^[0-9]+$/?"":"not ") "a number";exit}' file
If you just want to check line 2, you have to exit after the processing
If you need other Field Separator, add -F option

I think you are looking for something like below,
awk 'BEGIN{FS="\t"} NR==2 { if (match($1,/^[[:digit:]]/)) { print "Index is a number" } else { print "Index is not a number" } }' file
Index is a number
you can of-course extend this to any number of lines by dropping NR==2 or adding NR>1 which allows you to skip header only.

Use the following approach:
awk 'NR==2{printf("Index is%s a number\n", ($1~/^[0-9]+$/)? "":" not")}' file.txt
The output:
Index is a number
$1~/^[0-9]+$/ - ensures that the first field is a number

Related

awk - no output after subtracting two matching columns in two files

I'm learning awk and I'd like to use it to get the difference between two columns in two files
If an entry in file_2 column-2 exists in file_1 column-4, I want to subtract file_2 column-3 from of file_1 column-2
file_1.txt
chrom_1 1000 2000 gene_1
chrom_2 3000 4000 gene_2
chrom_3 5000 6000 gene_3
chrom_4 7000 8000 gene_4
file_2.txt
chrom_1 gene_1 114 252
chrom_9 gene_5 24 183
chrom_2 gene_2 117 269
Here's my code but I get no output:
awk -F'\t' 'NR==FNR{key[$1]=$4;file1col1[$1]=$2;next} $2 in key {print file1col1[$1]-$3}' file_1.txt file_2.txt
You are close. But indexing key by the gene name storing the value from the 4th field will allow you to simply subtract key[$2] - $3 to get your result, e.g.
awk 'NR==FNR {key[$4] = $2; next} $2 in key {print key[$2] - $3}' file1 file2
886
2883
(note: there is no gene_5 so key[gene_5] is taken as 0. The test $2 in key conditions the 2nd rule to only execute if the gene is present in key)
Write the Rules Out
Sometimes it helps to write the rules for the script out rather than trying to make a 1-liner out of the script. This allows for better readability. For example:
awk '
NR==FNR { # Rule1 conditioned by NR==FNR (file_1)
key[$4] = $2 # Store value from field 2 indexed by field 4
next # Skip to next record
}
$2 in key { # Rule2 conditioned by $2 in key (file_2)
print key[$2] - $3 # Output value from file_1 - field 3
}
' file_1.txt file_2.txt
Further Explanation
awk will read each line of input (record) from the file(s) and it will apply each rule to the record in the order the rules appear. Here, when the record number equals the file record number (only true for file_1), the first rule is applied and then the next command tells awk to skip everything else and go read the next record.
Rule 2 is conditioned by $2 in key which tests whether the gene name from file 2 exists as an index in key. (the value in array test does not create a new element in the array -- this is a useful benefit of this test). If the gene name exists in the key array filled from file_1, then field 3 from file_2 is subtracted from that value and the difference is output.
One of the best refernces to use when learning awk is Tje GNU Awk User's Guide. It provides an excellent reference for awk and any gawk only features are clearly marked with '#'.

how to extract lines which have no duplicated values in first column?

For some statistics research, I want to separate my data which have duplicated value in first column. I work with vim.
suppose that a part of my data is like this:
Item_ID Customer_ID
123 200
104 134
734 500
123 345
764 347
1000 235
734 546
as you can see, some lines have equal values in first column,
i want to generate two separated files, which one of them contains just non repeated values and the other contains lines with equal first column value.
for above example i want to have these two files:
first one contains:
Item_ID Customer_ID
123 200
734 500
123 345
734 546
and second one contains:
Item_ID Customer_ID
104 134
764 347
1000 235
can anybody help me?
I think awk would be a better option here.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] > 1' input.txt input.txt > dup.txt
Prettier version of awk code:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
Overview
We loop over the text twice. By supplying the same file to our awk script twice we are effectively looping over the text twice. First time though the loop count the number of times we see our field's value. The second time though the loop output only the records which have a field value count of 1. For the duplicate line case we only output lines which have field value counts greater than 1.
Awk primer
awk loops over lines (or records) in a text file/input and splits each line into fields. $1 for the first field, $2 for the second field, etc. By default fields are separated by whitespaces (this can be configured).
awk runs each line through a series of rules in the form of condition { action }. Any time a condition matches then action is taken.
Example of printing the first field which line matches foo:
awk '/foo/ { print $1 }` input.txt
Glory of Details
Let's take a look at finding only the unique lines which the first field only appears once.
$ awk 'FNR == NR { seen[$1]++; next } seen[$1] == 1' input.txt input.txt > uniq.txt
Prettier version for readability:
FNR == NR {
seen[$1]++;
next
}
seen[$1] == 1
awk 'code' input > output - run code over the input file, input, and then redirect the output to file, output
awk can take more than one input. e.g. awk 'code' input1.txt input2.txt.
Use the same input file, input.txt, twice to loop over the input twice
awk 'FNR == NR { code1; next } code2' file1 file2 is a common awk idiom which will run code1 for file1 and run code2 for file2
NR is the current record (line) number. This increments after each record
FNR is the current file's record number. e.g. FNR will be 1 for the first line in each file
next will stop executing any more actions and go to the next record/line
FNR == NR will only be true for the first file
$1 is the first field's data
seen[$1]++ - seen is an array/dictionary where we use the first field, $1, as our key and increment the value so we can get a count
$0 is the entire line
print ... prints out the given fields
print $0 will print out the entire line
just print is short for print $0
condition { print $0 } can be shorted to condition { print } which can be shorted further to just condition
seen[$1] == 1 which check to see if the first field's value count is equal to 1 and print the line
Here is an awk solution:
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > (a[b[i]]==1?"single":"multiple")}' file
cat single
104 134
764 347
1000 235
cat multiple
123 200
734 500
123 345
734 546
PS I skipped the first line, but it could be implemented.
This way you get one file for single hits, one for double, one for triple etc.
awk 'NR>1{a[$1]++;b[NR]=$1;c[NR]=$2} END {for (i=2;i<=NR;i++) print b[i],c[i] > "file"a[b[i]]}'
That would require some filtering of the list of lines in the buffer. If you're really into statistics research, I'd go search for a tool that is better suited than a general-purpose text editor, though.
That said, my PatternsOnText plugin has some commands that can do the job:
:2,$DeleteUniqueLinesIgnoring /\s\+\d\+$/
:w first
:undo
:2,$DeleteAllDuplicateLinesIgnoring /\s\+\d\+$/
:w second
As you want to filter on the first column, the commands' /{pattern}/ has to filter out the second column; \s\+\d\+$ matches the final number and its preceding whitespace.
:DeleteUniqueLinesIgnoring (from the plugin) gives you just the duplicates, :DeleteAllDuplicateLinesIgnoring just the unique lines. I simply :write them to separate files and :undo in between.

How to find the difference between two files using multiple conditions?

I have two files file1.txt and file2.txt like below -
cat file1.txt
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
cat file2.txt
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Requirement is to fetch the records which are not available in
file1.txt using below condition.
file1.txt file2.txt
col1(date) col1(Date)
col2(number: 4343250019 ) col2(last value of number: 9)
col3(number) col3(number)
col5(alphanumeric) col5(alphanumeric)
Expected Output :
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
This output line doesn't available in file1.txt but available in
file2.txt after satisfying the matching criteria.
I was trying below steps to achieve this output -
###Replacing the space/tab from the file1.txt with pipe
awk '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS="|" file1.txt > file1.txt1
### Looping on a combination of four column of file1.txt1 with combination of modified column of file2.txt and output in output.txt
awk 'BEGIN{FS=OFS="|"} {a[$1FS$2FS$3FS$5];next} {(($1 FS substr($2,length($2),1) FS $3 FS $5) in a) print $0}' file2.txt file1.txt1 > output.txt
###And finally, replace the "N" from column 8th and put "NULL" if the value is "N".
awk -F'|' '{ gsub ("N","NULL",$8);print}' OFS="|" output.txt > output.txt1
What is the issue?
My 2nd operation is not working and I am trying to put all 3 operations in one operation.
awk -F'[|]|[[:blank:]]+' 'FNR==NR{E[$1($2%10)$3$5]++;next}!($1$2$3$5 in E)' file1.txt file2.txt
and your sample output is wrong, it should be (last field if different: data45333)
2016-07-20-22|9|1003116|001|data45333|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2017-06-22-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Commented code
# separator for both file first with blank, second with `|`
awk -F'[|]|[[:blank:]]+' '
# for first file
FNR==NR{
# create en index entry based on the 4 field. The forat of filed allow to use them directly without separator (univoq)
E[ $1 ( $2 % 10 ) $3 $5 ]++
# for this line (file) don't go further
next
}
# for next file lines
# if not in the index list of entry, print the line (default action)
! ( ( $1 $2 $3 $5 ) in E ) { print }
' file1.txt file2.txt
Input
$ cat f1
2016-07-20-22 4343250019 1003116 001 data45343 25-JUL-16 11-MAR-16 1 N 0 0 N
2016-06-20-22 654650018 1003116 001 data45343 25-JUL-17 11-MAR-16 1 N 0 0 N
$ cat f2
2016-07-20-22|9|1003116|001|data45343|25-JUL-16 11-MAR-16|1|N|0|0|N|hello|table|one
2016-06-20-22|8|1003116|001|data45343|25-JUL-17 11-MAR-16|1|N|0|0|N|hi|this|kill
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Output
$ awk 'FNR==NR{a[$1,substr($2,length($2)),$3,$5];next}!(($1,$2,$3,$5) in a)' f1 FS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|N|0|0|N|kill|boll|one
Explanation
awk ' # call awk.
FNR==NR{ # This is true when awk reads first file
a[$1,substr($2,length($2)),$3,$5] # array a where index being $1(field1), last char from $2, $3 and $5
next # stop processing go to next line
}
!(($1,$2,$3,$5) in a) # here we check index $1,$2,$3,$5 exists in array a by reading file f2
' f1 FS="|" f2 # Read f1 then
# set FS and then read f2
FNR==NR If the number of records read so far in the current file
is equal to the number of records read so far across all files,
condition which can only be true for the first file read.
a[$1,substr($2,length($2)),$3,$5] populate array "a" such that the
indexed by the first
field, last char of second field, third field and fifth field from
current record of file1
next Move on to the next record so we don't do any processing
intended for records from the second file.
!(($1,$2,$3,$5) in a) IF the array a index constructed from the
fields ($1,$2,$3,$5) of the current record of file2 does not exist
in array a, we get boolean true (! Called Logical NOT Operator. It is used to reverse the logical state of its operand. If a condition is true, then Logical NOT operator will make it false and vice versa.) so awk does default operation print $0 from file2
f1 FS="|" f2 read file1(f1), set field separator "|" after
reading first file, and then read file2(f2)
--edit--
When filesize is huge around 60GB(900 millions rows), its not a good
idea to process the file two times. 3rd operation - (replace "N" with
"NULL" from col - 8 ""awk -F'|' '{ gsub ("N","NULL",$8);print}'
OFS="|" output.txt
$ awk 'FNR==NR{
a[$1,substr($2,length($2)),$3,$5];
next
}
!(($1,$2,$3,$5) in a){
sub(/N/,"NULL",$8);
print
}' f1 FS="|" OFS="|" f2
2017-06-22-22|8|1003116|001|data45333|25-JUL-17 11-MAR-16|1|NULL|0|0|N|kill|boll|one
You can try this awk:
awk -F'[ |]*' 'NR==FNR{su=substr($2,length($2),1); a[$1":"su":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{print $0}' f1 f2
Here,
a[] - an associative array
$1":"su":"$3":"$5 - this forms key for an array index. su is last digit of field $2 (su=substr($2,length($2),1)). Then, assigning an 1 as value for this key.
NR==FNR{...;next} - this block works for processing f1.
Update:
awk 'NR==FNR{$2=substr($2,length($2),1); a[$1":"$2":"$3":"$5]=1;next} !a[$1":"$2":"$3":"$5]{gsub(/^N$/,"NULL",$8);print}' f1 FS="|" OFS='|' f2

awk script for finding smallest value from column

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].

Counting and matching process

I have a matching problem with awk :(
I will count first column elements in main.file and if its value is more than 2 I will print the first and the second column.
main.file
1725009 7211378
3353866 11601802
3353866 8719104
724973 3353866
3353866 7211378
For example number of "3353866" in the first column is 3, so output.file will be like that:
output.file
3353866 11601802
3353866 8719104
3353866 7211378
How can I do this in awk?
If you mean items with at least 3 occurrences, you can collect occurrences in one array and the collected values as a preformatted or delimited string in another.
awk '{o[$1]++;v[$1]=v[$1] "\n" $0}
END{for(k in o){if(o[k]<3)continue;
print(substr(v[k],1)}' main.file
Untested, not at my computer. The output order will be essentially random; you'll need another variable to keep track of line numbers if you require the order to be stable.
This would be somewhat less hackish in Perl or Python, where a hash/dict can contain a structured value, such as a list.
Another approach is to run through the file twice: it's a little bit slower, but the code is very neat:
awk '
NR==FNR {count[$1]++; next}
count[$1] > 2 {print}
' main.file main.file
awk '{store[$1"-"lines[$1]] = $0; lines[$1]++;}
END {for (l in store) {
split(l, pair, "-"); if (lines[pair[1]] > 2) { print store[l] } } }'
One approach is to track all the records seen, the corresponding key $1 for each record, and how often each key occurs. Once you've record those for all the lines, you can then iterate through all the records stored, only printing those for which the count of the key is greater than two.
awk '{
record[NR] = $0;
key[$0] = $1;
count[$1]++
}
END {
for (n=1; n <= length(record); n++) {
if (count[key[record[n]]] > 2) {
print record[n]
}
}
}'
Sort first, and then use awk to print only when you have 3 times or more the 1st field:
cat your_file | sort -n | awk 'prev == $1 {count++; p0=p1; p1=p2; p2=$2}
prev != $1 {prev=$1; count=1; p2=$2}
count == 3 {print $1 " " p0; print $1 " " p1; print $1 " " p2}
count > 3 {print $1 " " $2}'
This will avoid awk to use too much memory in case of big input file.
based on how the question looks and the Ray Toal edit, I'm guessing you mean based on count, so something like this works:
awk '!y[$1] {y[$1] = 1} x[$1] {if(y[$1]==1) {y[$1]==2; print $1, x[$1]}; print} {x[$1] = $2}'