Gawk: Insert line using lookup data from another file

Gawk: Insert line using lookup data from another file - awk

I'm trying to insert lines into a file, where the data being inserted is based on data stored in another file. I've tried this in both Sed and Awk, but can't figure out in either how to access the second file. This is probably a stretch for Sed, perhaps less so for Awk?
The main file:
# alpha --
some data
some more data
# beta --
some data
some more data
# gamma --
some data
some more data
The lookup file:
alpha This is a description of alpha
gamma This guys description
delta And a third description
The result should look like this:
# alpha --
Description = This is a description of alpha
some data
some more data
# beta --
some data
some more data
# gamma --
Description = This guys description
some data
some more data
Notice that the lookup file may not have a description for the item, and that's ok; the "Description = " line will just be omitted.
I figured this much out in Awk, but don't know how to reference the lookup file:
awk '{
if ($0 ~ /^# [^ ]* --/) {
print $0;
print "Description = "; # How to lookup $2's description??
} else {
print $0;
}
}' <file1.txt
How can I obtain the description from the second file using Awk? Or is there a better tool for this? Thanks!

another similar awk
$ awk 'NR==FNR {k=$1; sub(/^\S+\s+/,"Description = "); dict[k]=$0; next}
1;
/^#/ {if($2 in dict) print dict[$2]}' dict file

You could do something like this, supplying both files on the awk command line in the logical order (descriptions first, so it can read and store them, followed by the data that needs them inserted):
$ awk '(NR == FNR) {
desc[$1]=$2;
for (i=3;i<=NF;i++) {
desc[$1]=desc[$1]" "$i
};
}
(NR > FNR) {
print;
if (/^#/) {
print "Description = "desc[$2];
}
}' desc.txt main.txt
Which produces this output given your sample file contents:
# alpha --
Description = This is a description of alpha
some data
some more data
# beta --
Description =
some data
some more data
# gamma --
Description = This guys description
some data
some more data
Explanation:
The awk variable NR contains the Number of Records seen so far. Normally, a record is a line - although you can change the record separator - so this is effectively the current line number, counted continuously across all the files being processed. In this case its value will run from 1 to 14.
The variable FNR (File Number of Records) works the same way, but resets to 1 at the start of each new file. So in this case its value will run from 1 to 4 and then 1 to 10.
By comparing these two values, the program can determine which file is currently being processed. If NR and FNR are the same, we know we're in the first file, and use the contents of the line to populate the associative array desc. The first field ($1) is the key; we concatenate the rest of the fields together to form the value.
If NR is not equal to FNR (it can only be greater, never less), we know we're in the second file. In that case, we first print the line (which we always do, so we just make it unconditional instead of repeating the statement). Then we check to see if we need to append the description. If we do, look it up in the desc array - using $2 (the second whitespace-separated field on the line, the first being the "#") as the lookup key.

Related

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903

Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

Comparing column of two files

I want to compare the first column of two csv files. I found this answer and tried to adapt it minimally (I want the first column, not the second and I want a print out on any mismatch, regardless of whether the value was present in a control column).
I thought this would be the way to go:
BEGIN { FS = "," }
{
if(FNR==NR) {a[$1]=$1}
else {if (a[$1] != $1) {print}}
}
[Here I have already removed one Syntax Error thanks to comment by RavinderSingh13]
The first line was supposed to set the separator to comma.
The second line was supposed to fill the array exactly for as long as I am still reading the first file.
The third line was to compare the elements of the first column of the second file elementwise to said array. Then print the entire line with a mismatch.
However, if I apply this to the following tiny files, which differ in the first non-header entry:
output2.csv:
#ID,COU,YEA,VOT#
4238,"CHN",2000,1
4239,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
and output.csv:
#ID,COU,YEA,VOT#
4237,"CHN",2000,1
4238,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
I dont get any print out. I call it like this:
ludi#ludi-M17xR4:~/Jason$ gawk -f compare_col_print_diff.awk output.csv output2.csv
ludi#ludi-M17xR4:~/Jason$

for line by line comparison, it's easier to match the records first
$ paste -d, file1 file2 | awk -F, '$1!=(f=$(NF/2+1)){print NR":",$1, f}'
will print values for which the first fields don't agree.
With your input files, this will give
2: 4238 4237
3: 4239 4238

The comment by Luuk made me realise a huge fundamental error in my original script, which I think should be recorded. The instruction
a[$1]=$1
Does not produce an array entry per line, but an array entry per distinct ID. Hence, such array is no basis for general strict comparison of the files. To remedy this, I wrote the following, which works on the example, but may still contain traps, as I am still learning:
BEGIN { FS = "," }
{
if(FNR==NR) {a[NR]=$1}
else {if (a[FNR] != $1) {print FNR, $0}}
}
Producing:
$ gawk -f compare_col_print_diff.awk output.csv output2.csv
2 4238,"CHN",2000,1
3 4239,"CHN",2000,1

How to use multiple passes with gawk?

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.

An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.

So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.

The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.

AWK : Ensure only one blank line after the output block

The following awk code outputs what is required except that it outputs two blank lines after each block of data. Only one blank line needs to be inserted. (Without the last {"print "\n"} statement, no blank lines are output. With the statement, there are two blank lines. I need only one blank line.)
/Reco/ {for(i=0; i<=2; i++) {getline; print} {print "\n"}}

Based on your comment below that you actually want the line that matches /Reco/ and 2 subsequent lines and a blank line (to be inserted after that) here's how to do that based on idiom "g" below:
awk '/Reco/{c=3} c&&c--{print; if(!c)print ""}' file
wrt an explanation - just remember that awk provides this functionality for you:
WHILE read line from file
DO
execute the users script (/Reco/{c=3} c&&c--{print; if(!c)print ""})
DONE
and that the body of an awk script is made up of:
<condition> { <action> }
statements with the default condition being TRUE and the default action being to print the current record/line.
The posted awk script above does the following:
/Reco/ { # IF the pattern "Reco" is present on the current line THEN
c=3 # Set the count of the number of lines to print to 3
} # ENDIF
c&&c-- { # IF c is non-zero THEN decrement c and THEN
print; # print the current line
if(!c) # IF c is now zero (i.e. this is the 3rd line) THEN
print "" # print a blank line
# ENDIF
} # ENDIF
so the whole execution of parsing the input file is:
WHILE read line from file
DO
/Reco/ { # IF the pattern "Reco" is present on the current line THEN
c=3 # Set the count of the number of lines to print to 3
} # ENDIF
c&&c-- { # IF c is non-zero THEN decrement c and THEN
print; # print the current line
if(!c) # IF c is now zero (i.e. this is the 3rd line) THEN
print "" # print a blank line
# ENDIF
} # ENDIF
DONE
Maybe it'd be a little clearer if the script was written as something like:
awk '/Reco/{c=3} c{c--; print; if(c == 0)print ""}' file
You got the answer you were looking for but here's how to really print the N lines after some pattern in awk:
c&&c--;/pattern/{c=N}
which in your case would be:
c&&c--;/Reco/{c=3}
and if you want to add that extra newline then it becomes:
c&&c--{print; if(!c)print ""} /Reco/{c=3}
If you're considering using getline make sure you read http://awk.info/?tip/getline first and understand all of the caveats so you know what you're getting yourself into.
P.S. The following idioms describe how to select a range of records given
a specific pattern to match:
a) Print all records from some pattern:
awk '/pattern/{f=1}f' file
b) Print all records after some pattern:
awk 'f;/pattern/{f=1}' file
c) Print the Nth record after some pattern:
awk 'c&&!--c;/pattern/{c=N}' file
d) Print every record except the Nth record after some pattern:
awk 'c&&!--c{next}/pattern/{c=N}1' file
e) Print the N records after some pattern:
awk 'c&&c--;/pattern/{c=N}' file
f) Print every record except the N records after some pattern:
awk 'c&&c--{next}/pattern/{c=N}1' file
g) Print the N records from some pattern:
awk '/pattern/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where
appropriate as that's more expressive of what the variable actually IS.

#Kevin's post provides the specific answer (use print "" or, as suggested by #BMW, printf ORS), but here's some background:
In awk,
print
is the same as:
print $0
i.e., it prints the current input line followed by the output record separator - which defaults to \n and is stored in the special ORS variable.
You can pass arguments to print to print something other than (or in addition to) $0, but the record separator is invariably appended.
Note that if you pass multiple arguments separated with , to print, they will be output separated by the output field separator - which defaults to a space and is stored in the special variable OFS.
By contrast, the - more flexible - printf function takes a format string (as in its C counterpart) and as many arguments as are needed to instantiate the placeholders (fields) in the format string.
An output record separator is NOT appended to the result.
For instance, the printf equivalent of what print without arguments does is:
printf "%s\n", $0 # assumes that \n is the output record separator
Or, more generally:
printf "%s%s", $0, ORS
Note that, as the names suggest, the output field/record separators (OFS/ORS) have input counterparts (FS/RS) - their respective default values are identical (single space / \n - though on parsing input multiple adjacent spaces are treated as a single field separator).

print already includes the newline. Just use print "".

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.

a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown

Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*

a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Gawk: Insert line using lookup data from another file - awk

another similar awk $ awk 'NR==FNR {k=$1; sub(/^\S+\s+/,"Description = "); dict[k]=$0; next} 1; /^#/ {if($2 in dict) print dict[$2]}' dict file

Related

awk choose a line with $1 present in a file and output with a changed field

Comparing column of two files

How to use multiple passes with gawk?

AWK : Ensure only one blank line after the output block

The meaning of "a" in an awk command?

Categories

Resources