How to improve the speed of this awk script - awk

I have a large file, say file1.log, that looks like this:
1322 a#gmail.com
2411 b#yahoo.com
and a smaller file, say file2.log, that looks like this:
a#gmail.com
c#yahoo.com
In fact, file1.log contains about 6500000 lines and file2.log contains about 140000.
I want to find all lines in file2.log that do not appear in file1.log. I wrote this awk command:
awk 'NR==FNR{c[$2]++} NR!=FNR && c[$1]==0 {print $0}' file1.log file2.log > result.log'
after half an hour or so I find the command is still running, and less result.log shows that result.log is empty.
I am wondering whether there is something I can do to do the job quicker?

Hash the smaller file file2 into memory. Remember Tao of Programming, 1.3: How could it be otherwise?:
$ awk '
NR==FNR { # hash file2 since its smaller
a[$0]
next
}
($2 in a) { # if file1 entry found in hash
delete a[$2] # remove it
}
END { # in the end
for(i in a) # print the ones that remain in the hash
print i
}' file2 file1 # mind the order
Output:
c#yahoo.com

If you sort the files, you can use comm to print only those lines present in the second file with:
comm -13 <(awk '{ print $2 }' file1.log | sort) <(sort file2.log)

I believe the easiest is just a simple grep-pipeline
grep -Fwof file2 file1 | grep -Fwovf - file2
you can also just extract the second column of file1 and use the last part of the above command again:
awk '{print $2}' file1 | grep -Fwovf - file2
Or everything in a single awk:
awk '(NR==FNR){a[$2]; next}!($1 in a)' file1 file2

Related

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

awk printing the second to last record of a file

I have a file set up like
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
and I would like to output the second to last record of this file where the record is delimited by % after each block of text.
I have used:
awk -v RS=\% ' END{ print NR }' $f
to find the number of records (1136). Then I did
awk -v RS=\% ' { print $(NR-1) }' $f
and
awk -v RS=\% ' { print $(NR=1135) }' $f
.
Neither of these worked, and, instead, displayed a record towards the beginning of the file and a many blank lines.
OUTPUT:
"You know, of course, that the Tasmanians, who never committed adultery, are
now extinct."
-- M. Somerset Maugham
"The
is
what
that
This output had many, many more blank lines and contains a record near the middle of the file.
awk -v RS=\% 'END{ print $(NR-1) }' $f
returns a blank line. The same command with different $(NR-x) values also returns a blank line.
Can someone help me to print the second to last record in this case?
Thanks
You can do:
awk '{this=last;last=$0} END{print this}' file
Or, if you don't mind having the entire file in memory:
awk '{a[NR]=$0} END{print a[NR-1]}' file
Or, if it is just line count (or record count) based, you can keep a rolling deletion going so you are not too piggish on memory:
$ seq 999999 | tail -2
999998
999999
$ seq 999999 | awk '{a[NR]=$0; delete a[NR-3]} END{print a[NR-1]}'
999998
If they are blocks of text the same method works if you can separate the blocks into delimited records.
Given:
$ echo "$txt"
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
You can do:
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-1]}'
Even More Words
on many lines
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-2]}'
More Words
on many lines
If you want to not print the leading and trailing \n you can do:
$ echo "$txt" | awk 'BEGIN{RS="%\n"} {a[NR]=$0} END{printf a[NR-2]}'
Words on
many line
Finally, if you know the specific record you want to print, do it this way in awk:
$ seq 999999 | awk -v mrk=1135 'NR==mrk{print; exit}'
1135
If you want a random record, you can do:
$ awk -v min=1 -v max=1135 'BEGIN{srand()
RS="%\n"
tgt=int(min+rand()*(max-min+1))
}
NR==tgt{print; exit}' file
Does the solution have to be with awk? Just using head and tail would be simpler.
tail -2 file.txt | head 1 > justthatline.txt
The best way for this would be to use the BEGIN construct.
awk 'BEGIN{RS="%\n"; ORS="%\n"}(NR>=2){print}' file
RS and ORS set the input file and output record separators respectively.

awk field separator not working for first line

echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'FS="_length" {print $1}'
Obtained output:
NODE_1_length_317516_cov_18.568_ID_4005
Expected output:
NODE_1
How is that possible? I'm missing something.
When you are going through lines using Awk, the field separator is interpreted before processing the record. Awk reads the record according the current values of FS and RS and goes ahead performing the operations you ask it for.
This means that if you set the value of FS while reading a record, this won't have effect for that specific record. Instead, the FS will have effect when reading the next one and so on.
So if you have a file like this:
$ cat file
1,2 3,4
5,6 7,8
And you set the field separator while reading one record, it takes effect from the next line:
$ awk '{FS=","} {print $1}' file
1,2 # FS is still the space!
5
So what you want to do is to set the FS before starting to read the file. That is, set it in the BEGIN block or via parameter:
$ awk 'BEGIN{FS=","} {print $1}' file
1,2 # now, FS is the comma
5
$ awk -F, '{print $1}' file
1
5
There is also another way: make Awk recompute the full record with {$0=$0}. With this, Awk will take into account the current FS and act accordingly:
$ awk '{FS=","} {$0=$0;print $1}' file
1
5
awk Statement used incorrectly
Correct way is
awk 'BEGIN { FS = "#{delimiter}" } ; { print $1 }'
In your case you can use
awk 'BEGIN { FS = "_length" } ; { print $1 }'
Inbuilt variables like FS, ORS etc must be set within a context i.e in 1 of the following blocks: BEGIN, condition blocks or END.
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk 'BEGIN{FS="_length"} {print $1}'
NODE_1
$
You can also pass the delimiter using -F switch like this:
$ echo 'NODE_1_length_317516_cov_18.568_ID_4005' | awk -F "_length" '{print $1}'
NODE_1
$

Reading from 2 text files one line at a time in UNIX

I have 2 files, file1 and file2. I am trying to read one line from file1 and read another line from file2 and insert HTML
flags to make is usealbe in an html file. I have been trying to work with awk with little success. Can someone please help?
File1:
SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem
SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes
File2:
FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt
FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt
Desired output:
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
Using bash:
printItem() { printf "<%s>%s</%s>\n" "$1" "${!1}" "$1"; }
paste file1 file2 |
while read workflow File; do
echo "<ParameterFile>"
printItem workflow
printItem File
done
With awk, it would be:
awk '
NR==FNR {workflow[FNR]=$1; next}
{
print "<ParameterFile>"
printf "<workflow>%s</workflow>\n", workflow[FNR]
printf "<File>%s</File>\n", $1
}
' file1 file2
Another approach that does not require storing the first file in memory:
awk '{
print "<ParameterFile>"
print "<workflow>" $0 "</workflow>"
getline < "file2"
print "<File>" $0 "</File>"
}' file1
If you don't mind mixing in some shell:
$ paste -d$'\n' file1 file2 |
awk '{ printf (NR%2 ? "<ParameterFile>\n<workflow>%s</workflow>\n" : "<File>%s</File>\n"), $0 }'
<ParameterFile>
<workflow>SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SILOS.SIL_Stage_GroupAccountNumberDimension_FinStatementItem.txt</File>
<ParameterFile>
<workflow>SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes</workflow>
<File>FlatFileConnection.DBConnection_OLAP.SDE_ORA11510_Adaptor.SDE_ORA_Stage_GLAccountDimension_FinSubCodes.txt</File>
otherwise see #GlennJackman's solution for the pure awk way to do it.

Using each line of awk output as grep pattern

I want to find every line of a file that contains any of the strings held in a column of a different file.
I have tried
grep "$(awk '{ print $1 }' file1.txt)" file2.txt
but that just outputs file2.txt in its entirety.
I know I've done this before with a pattern I found on this site, but I can't find that question anymore.
I see in the OP's comment that maybe the question is no longer a question. However, the following slight modification will handle the blank line situation. Just add a check to make sure the line has at least one field:
grep "$(awk '{if (NF > 0) print $1}' file1)" file2
And if the file with the patterns is simply a set of patterns per line, then a much simpler version of it is:
grep -f file1 file2
That causes grep to use the lines in file1 as the patterns.
THere is no need to use grep when you have awk
awk 'FNR==NR&&NF{a[$0];next}($1 in a)' file2 file1
$(awk '{ print $1 }' file1.txt) | grep text > file.txt