weird awk outputs in reading/writing file - awk

I'm working on a Kaldi project about the existing example using the Tedlium dataset. Every step works well until the clean-up stage. I have a length mismatch issue. After examing all the scripts, I found the issue is in the lattice_oracle_align.sh
reference:https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/lattice_oracle_align.sh
I believe the issue is line 142.
awk '{if ($2 == "#csid") print $1" "($4+$5+$6)}' $dir/analysis/per_utt_details.txt > $dir/edits.txt
The above line should read per_utt_details.tx line by line, every time it reads a #csid it should write a line in edits.txt
texts in per_utt_details look like this.
ref
hyp
op
#csid 0 0 0 0
...repeat the above 4 lines.
There are 1073046 lines in per_utt_details.txt. I expect 268262 lines in edits.txt. However, only 48746 lines exist in edits.txt.

By seeing your samples I believe you are looking to compare 1st field NOT 2nd field(which shows in your shown code), so if this is the case then try running following(where I have changed from $2 to $1 for comparing with 1st field).
awk '($1 == "#csid"){print $1,($4+$5+$6)}' per_utt_details.txt > edits.txt

Related

combine two awk commands NF and BEGIN (.csv processing)

How to combine two awk commands:
awk NF file_name.csv > file_name_after_NF.csv the output is used in next step:
awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' file_name_after_NF.csv > file_name_postprocess.csv
Assuming the intermediate file file_name_after_NF.csv was written soley to feed the 'no blank lines' version of the .csv file into the 'remove repeat lines' command, the two procedures can be combined by making NF the condition pattern for the main awk code block:
awk 'BEGIN{f=""} NF{if($0!=f){print $0}if(NR==1){f=$0}}' file_name.csv > file_name_postprocess.csv
In the above procedure, the main awk block is only applied where there are one or more fields in the record.
If you need a file copy of file_name_after_NF.csv, this can be created by adding a file-write block within the main block of the previous awk procedure:
awk 'BEGIN{f=""} NF{if($0!=f){print $0}if(NR==1){f=$0}{print $0 > file_name_after_NF.csv}}' file_name.csv > file_name_postprocess.csv
{print $0 > file_name_after_NF.csv} does the file writing from within awk. As this block is within the main block processed according to the NF condition pattern, only lines with fields are written to file_name_after_NF.csv
More generally, if a little more cumbersomely, awk procedures can be combined by piping their output into successive awk procedures. In your case this could be achieved using:
awk NF file_name.csv | awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}}' > file_name_postprocess.csv
or, if you need the intermediate file, again include an awk file print block:
awk NF file_name.csv | awk 'BEGIN{f=""}{if($0!=f){print $0}if(NR==1){f=$0}{print $0 > file_name_after_NF.csv}}' > file_name_postprocess.csv
Edit dealing with cases where the header line is after one or more blank lines
#glennjackman raised a point relating to an example where blank lines existed before the first/header row, since the NR==1 condition in the first two examples above would no longer reset f to contain the header. The last two examples, where awk procedures are joined by pipes, would still work but the first two would not.
To fix this, a further variable can be added to the BEGIN awk block that is updated as soon as the first non-empty line is seen in the main block. This allows for that line to be identified so that prior empty lines do not matter:
awk 'BEGIN{f="";headerFlag="none"} NF{if($0!=f){print $0}if(headerFlag=="none"){f=$0;headerFlag="yes"}}' file_name.csv > file_name_postprocess.csv
The conditional NR==1 in the original script is changed here to check whether the the headerFlag set in BEGIN has been changed. If not, f is set to track repeats of the line and headerFlag is changed so the block will only run on the first encounter of a non-empty record.
The same change can be used in the second solution above.
I wasn't planning to try to answer this until after you posted sample input and expected output but since you already have answers, here's my guess at what you might be asking how to write:
awk 'BEGIN{f=""} !NF{next} $0!=f; f==""{f=$0}' file_name.csv > file_name_postprocess.csv
but without sample input/output it's untested.
I'd recommend you start using white space in your code to improve readability btw.
Also - testing NF for a CSV without setting FS to , is probably the wrong thing to do but idk if you're trying to skip lines of all blanks or lines of all commas or something else so idk what the right thing is to do but maybe it's this:
awk 'BEGIN{FS=","; f=""} $0~("^"FS"*$"){next} $0!=f; f==""{f=$0}' file_name.csv > file_name_postprocess.csv

Comparing column of two files

I want to compare the first column of two csv files. I found this answer and tried to adapt it minimally (I want the first column, not the second and I want a print out on any mismatch, regardless of whether the value was present in a control column).
I thought this would be the way to go:
BEGIN { FS = "," }
{
if(FNR==NR) {a[$1]=$1}
else {if (a[$1] != $1) {print}}
}
[Here I have already removed one Syntax Error thanks to comment by RavinderSingh13]
The first line was supposed to set the separator to comma.
The second line was supposed to fill the array exactly for as long as I am still reading the first file.
The third line was to compare the elements of the first column of the second file elementwise to said array. Then print the entire line with a mismatch.
However, if I apply this to the following tiny files, which differ in the first non-header entry:
output2.csv:
#ID,COU,YEA,VOT#
4238,"CHN",2000,1
4239,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
and output.csv:
#ID,COU,YEA,VOT#
4237,"CHN",2000,1
4238,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
I dont get any print out. I call it like this:
ludi#ludi-M17xR4:~/Jason$ gawk -f compare_col_print_diff.awk output.csv output2.csv
ludi#ludi-M17xR4:~/Jason$
for line by line comparison, it's easier to match the records first
$ paste -d, file1 file2 | awk -F, '$1!=(f=$(NF/2+1)){print NR":",$1, f}'
will print values for which the first fields don't agree.
With your input files, this will give
2: 4238 4237
3: 4239 4238
The comment by Luuk made me realise a huge fundamental error in my original script, which I think should be recorded. The instruction
a[$1]=$1
Does not produce an array entry per line, but an array entry per distinct ID. Hence, such array is no basis for general strict comparison of the files. To remedy this, I wrote the following, which works on the example, but may still contain traps, as I am still learning:
BEGIN { FS = "," }
{
if(FNR==NR) {a[NR]=$1}
else {if (a[FNR] != $1) {print FNR, $0}}
}
Producing:
$ gawk -f compare_col_print_diff.awk output.csv output2.csv
2 4238,"CHN",2000,1
3 4239,"CHN",2000,1

Breakdown of one line of code involving awk

I am currently working on a project which involves comparing data from two different files. I had been looking for a command which would compare one line of file1 with each line of file2, and print out a '1' if there is a match and a '0' if not matched. Then, it would repeat this command for the second line of file1 for each line of file1.
I found this bit of code online which seems to work for my project, but I was hoping someone would help to break it down for me and provide more of an explanation.
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
Additionally, I'm new to this so any resources which may guide me to my answer would be really helpful. Thank you.
Here's what this awk is saying:
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
IF the record number of this specific file being processed FNR is the same as the overall record number being processed NR execute {a[$1]; next} block. We can safely assume that if this condition is true that we are processing the first file.
{a[$1]; next} Add the first column as a key in the array named a and then go to the next line without processing anymore of the awk script. Once the first file is fully processed here we will have an array with a key for every distinct value found in the first column of the first file.
{print $1, ($1 in a) ? "1":"0"} Since we must now be on the second file we are printing every line/record we encounter. Here we print the first column, then if that column's value is in the array as a key, then we print 1 otherwise we print 0.
In short this is printing every first column of the second file and stating if that column also exists in the first file printing a 1 or 0.
Repeated here for locality of reference, and in case the question gets edited:
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
You should really read a basic awk primer first. Basically the clause FNR==NR is a common idiom to check if we're reading the first file. NR is the overall record number (the line number), while FNR is the record number in the current file, so you're still processing the first file when these are equal. The action then stores the first column (not the entire line) into an array. So the first thing this program does is read the first column of the first file into the array named a. Then it starts reading the second file, and prints the first column of each line, followed by "1" or "0" depending on if the value in the first column is in the array.

awk repeat reading for all lines

I am using the following line command which reads a file with 90289 columns and begins reading after 90307 lines but the results i am getting are only for the first line the 90307nth line. I want also to read the line 90308,90309...etc but skip only the first time the 90307 lines.
awk '{if (FNR==90307) for(i=2;i<=90289;i+=3) print x=$i, y=$(i+1), z=$(i+2)}'
I need a script which
1.skips 90307 only one time
2.read the 90289 columns at EVERY line after the first 90307
3 repeat no 2 for all the lines
is it possible?
Surely you have changed == to > yourself:
awk 'NR>90307{for(i=2;i<=90289;i+=3) print $i, $(i+1), $(i+2) }'
I'm not sure why you are assigning to x,y and z either? Is your actual script larger and uses these values? Also do you actually want to be printing sets of 3 fields you don't mention this. You should edit your question with a clean description with a simple example and expected output.

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.