I am using the following line command which reads a file with 90289 columns and begins reading after 90307 lines but the results i am getting are only for the first line the 90307nth line. I want also to read the line 90308,90309...etc but skip only the first time the 90307 lines.
awk '{if (FNR==90307) for(i=2;i<=90289;i+=3) print x=$i, y=$(i+1), z=$(i+2)}'
I need a script which
1.skips 90307 only one time
2.read the 90289 columns at EVERY line after the first 90307
3 repeat no 2 for all the lines
is it possible?
Surely you have changed == to > yourself:
awk 'NR>90307{for(i=2;i<=90289;i+=3) print $i, $(i+1), $(i+2) }'
I'm not sure why you are assigning to x,y and z either? Is your actual script larger and uses these values? Also do you actually want to be printing sets of 3 fields you don't mention this. You should edit your question with a clean description with a simple example and expected output.
Related
I'm working on a Kaldi project about the existing example using the Tedlium dataset. Every step works well until the clean-up stage. I have a length mismatch issue. After examing all the scripts, I found the issue is in the lattice_oracle_align.sh
reference:https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/lattice_oracle_align.sh
I believe the issue is line 142.
awk '{if ($2 == "#csid") print $1" "($4+$5+$6)}' $dir/analysis/per_utt_details.txt > $dir/edits.txt
The above line should read per_utt_details.tx line by line, every time it reads a #csid it should write a line in edits.txt
texts in per_utt_details look like this.
ref
hyp
op
#csid 0 0 0 0
...repeat the above 4 lines.
There are 1073046 lines in per_utt_details.txt. I expect 268262 lines in edits.txt. However, only 48746 lines exist in edits.txt.
By seeing your samples I believe you are looking to compare 1st field NOT 2nd field(which shows in your shown code), so if this is the case then try running following(where I have changed from $2 to $1 for comparing with 1st field).
awk '($1 == "#csid"){print $1,($4+$5+$6)}' per_utt_details.txt > edits.txt
I want to compare the first column of two csv files. I found this answer and tried to adapt it minimally (I want the first column, not the second and I want a print out on any mismatch, regardless of whether the value was present in a control column).
I thought this would be the way to go:
BEGIN { FS = "," }
{
if(FNR==NR) {a[$1]=$1}
else {if (a[$1] != $1) {print}}
}
[Here I have already removed one Syntax Error thanks to comment by RavinderSingh13]
The first line was supposed to set the separator to comma.
The second line was supposed to fill the array exactly for as long as I am still reading the first file.
The third line was to compare the elements of the first column of the second file elementwise to said array. Then print the entire line with a mismatch.
However, if I apply this to the following tiny files, which differ in the first non-header entry:
output2.csv:
#ID,COU,YEA,VOT#
4238,"CHN",2000,1
4239,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
and output.csv:
#ID,COU,YEA,VOT#
4237,"CHN",2000,1
4238,"CHN",2000,1
4239,"CHN",2000,1
4240,"CHN",2000,1
I dont get any print out. I call it like this:
ludi#ludi-M17xR4:~/Jason$ gawk -f compare_col_print_diff.awk output.csv output2.csv
ludi#ludi-M17xR4:~/Jason$
for line by line comparison, it's easier to match the records first
$ paste -d, file1 file2 | awk -F, '$1!=(f=$(NF/2+1)){print NR":",$1, f}'
will print values for which the first fields don't agree.
With your input files, this will give
2: 4238 4237
3: 4239 4238
The comment by Luuk made me realise a huge fundamental error in my original script, which I think should be recorded. The instruction
a[$1]=$1
Does not produce an array entry per line, but an array entry per distinct ID. Hence, such array is no basis for general strict comparison of the files. To remedy this, I wrote the following, which works on the example, but may still contain traps, as I am still learning:
BEGIN { FS = "," }
{
if(FNR==NR) {a[NR]=$1}
else {if (a[FNR] != $1) {print FNR, $0}}
}
Producing:
$ gawk -f compare_col_print_diff.awk output.csv output2.csv
2 4238,"CHN",2000,1
3 4239,"CHN",2000,1
I've been sitting on this one for quite a while:
I would like to search for a pattern in a sample.file using awk and print the index:
>sample
ATGCGAAAAGATGAACGA
GTGACAGACAGACAGACA
GATAAACTGACGATAAAA
...
Let's say I want to find the index of the following pattern: "AAAA" (occurs twice), so the result should be 6 and 51.
EDIT:
I was able to use the following script:
cat ./sample.fasta |\
awk '{
s=$0
o=0
m="AAAA"
l=length(m)
i=index(s,m)
while (i>0) {
o+=i
print o
s=substr(s,i+l)
o+=l-1
i=index(s,m)
}
}'
However, it restarts the index on every new line, so the result is 6 and 15. I can always concatenate all lines into one single line, but maybe there's a more elegant way.
Thanks in advance
awk reads files line-by-line so it would never be a problem to find "all" indices in a multi-line file. Your problem is that you're trying to use a BEGIN block which, as its name suggests, only runs at the beginning of the program. As well, the index() function takes two arguments.
For your sample data, this should work:
awk '/AAAA/{print index($0,"AAAA")+l} NR>1{l+=length}' sample.file
The first block of code only runs when AAAA is matched, the second runs for every line after the first, incrementing the counter with the length of the line.
For the case where you have multiple matches per line, this should work:
awk -v pat=AAAA 'BEGIN{for(n=0;n<length(pat);n++) rep=rep"x"} NR>1{while(i=index($0,pat)){print i+l; sub(pat,rep);} l+=length}' sample.file
The pattern is passed as a variable; when the program starts a replacement text is generated based on the length of the pattern. Then each line after the first is looped over, getting the index of the pattern and replacing it so the next iteration returns the next instance.
It's worth mentioning that both these methods will match AAAAAA.
AWK indexes of course:
awk '{ l=index($0, "AAAA"); if (l) print l+i; i+=length(); }' dna.txt
6
51
if you're fine with zero based indices, this may be simpler.
$ sed 1d file | tr -d '\n' | grep -ob AAAA
5:AAAA
50:AAAA
assumes you have the header row as posted, if not remove sed command. Note that this assumes single byte chars as shown. For extended charsets it won't be the char position but byte-offset.
awk command to compare lines in file and print only first line if there are some new words in other lines.
For example: file.txt is having
i am going
i am going today
i am going with my friend
output should be
I am going
this will work for the sample input but perhaps will fail for the actual one, unless you have a representative input we wouldn't know...
$ awk 'NR>1 && $0~p {if(!f) print p; f=1; next} {p=$0; f=0}' file
i am going
you may want play with p=$0 to restrict matching number of fields if the line lengths are not in increasing order...
I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.