ZGrep for first occurence of pattern *after* given line

ZGrep for first occurence of pattern *after* given line - awk

so I know that to find the line number of the first occurrence of a pattern in a file I do:
zgrep -n -m 1 "pattern" big_file.txt.gz
but what If I want to skip the first 500K lines?
(I can't decompress the file. It's too large.)

You may use this gzcat | awk command:
gzcat big_file.txt.gz |
awk 'NR > 500000 && /pattern/ {print NR ":" $0; exit}'

Related

How to improve the speed of this awk script

I have a large file, say file1.log, that looks like this:
1322 a#gmail.com
2411 b#yahoo.com
and a smaller file, say file2.log, that looks like this:
a#gmail.com
c#yahoo.com
In fact, file1.log contains about 6500000 lines and file2.log contains about 140000.
I want to find all lines in file2.log that do not appear in file1.log. I wrote this awk command:
awk 'NR==FNR{c[$2]++} NR!=FNR && c[$1]==0 {print $0}' file1.log file2.log > result.log'
after half an hour or so I find the command is still running, and less result.log shows that result.log is empty.
I am wondering whether there is something I can do to do the job quicker?

Hash the smaller file file2 into memory. Remember Tao of Programming, 1.3: How could it be otherwise?:
$ awk '
NR==FNR { # hash file2 since its smaller
a[$0]
next
}
($2 in a) { # if file1 entry found in hash
delete a[$2] # remove it
}
END { # in the end
for(i in a) # print the ones that remain in the hash
print i
}' file2 file1 # mind the order
Output:
c#yahoo.com

If you sort the files, you can use comm to print only those lines present in the second file with:
comm -13 <(awk '{ print $2 }' file1.log | sort) <(sort file2.log)

I believe the easiest is just a simple grep-pipeline
grep -Fwof file2 file1 | grep -Fwovf - file2
you can also just extract the second column of file1 and use the last part of the above command again:
awk '{print $2}' file1 | grep -Fwovf - file2
Or everything in a single awk:
awk '(NR==FNR){a[$2]; next}!($1 in a)' file1 file2

Why does awk not filter the first column in the first line of my files?

I've got a file with following records:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt;2;CLI001
depots/import/HDN1YYAA_20102018.txt;32;CLI001
depots/import/HDN1YYAA_25102018.txt;1;CAB001
depots/import/HDN1YYAA_50102018.txt;1;CAB001
depots/import/HDN1YYAA_65102018.txt;1;CAB001
depots/import/HDN1YYAA_80102018.txt;2;CLI001
depots/import/HDN1YYAA_93102018.txt;2;CLI001
When I execute following oneliner awk:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR==1){print $1}}END {}'
the output is not the expected:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
While I am suppose get only the frist column:
If I run it through all the records:
cat lignes_en_erreur.txt | awk 'FS=";"{ if(NR>0){print $1}}END {}'
then it will start filtering only after the second line and I get the following output:
depots/import/HDN1YYAA_15102018.txt;1;CAB001
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_20102018.txt
depots/import/HDN1YYAA_25102018.txt
depots/import/HDN1YYAA_50102018.txt
depots/import/HDN1YYAA_65102018.txt
depots/import/HDN1YYAA_80102018.txt
depots/import/HDN1YYAA_93102018.txt
Does anybody knows why awk is skiping the first line only.
I tried deleting first record but the behaviour is the same, it will skip the first line.

First, it should be
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}END {}' filename
You can omit the END block if it is empty:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}' filename
You can use the -F command line argument to set the field delimiter:
awk -F';' '{if(NR==1){print $1}}' filename
Furthermore, awk programs consist of a sequence of CONDITION [{ACTIONS}] elements, you can omit the if:
awk -F';' 'NR==1 {print $1}' filename

You need to specify delimiter in either BEGIN block or as a command-line option:
awk 'BEGIN{FS=";"}{ if(NR==1){print $1}}'
awk -F ';' '{ if(NR==1){print $1}}'

cut might be better suited here, for all lines
$ cut -d';' -f1 file
to skip the first line
$ sed 1d file | cut -d';' -f1
to get the first line only
$ sed 1q file | cut -d';' -f1
however at this point it's better to switch to awk
if you have a large file and only interested in the first line, it's better to exit early
$ awk -F';' '{print $1; exit}' file

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)

$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file

this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file

Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.

You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

awk printing the second to last record of a file

I have a file set up like
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
and I would like to output the second to last record of this file where the record is delimited by % after each block of text.
I have used:
awk -v RS=\% ' END{ print NR }' $f
to find the number of records (1136). Then I did
awk -v RS=\% ' { print $(NR-1) }' $f
and
awk -v RS=\% ' { print $(NR=1135) }' $f
.
Neither of these worked, and, instead, displayed a record towards the beginning of the file and a many blank lines.
OUTPUT:
"You know, of course, that the Tasmanians, who never committed adultery, are
now extinct."
-- M. Somerset Maugham
"The
is
what
that
This output had many, many more blank lines and contains a record near the middle of the file.
awk -v RS=\% 'END{ print $(NR-1) }' $f
returns a blank line. The same command with different $(NR-x) values also returns a blank line.
Can someone help me to print the second to last record in this case?
Thanks

You can do:
awk '{this=last;last=$0} END{print this}' file
Or, if you don't mind having the entire file in memory:
awk '{a[NR]=$0} END{print a[NR-1]}' file
Or, if it is just line count (or record count) based, you can keep a rolling deletion going so you are not too piggish on memory:
$ seq 999999 | tail -2
999998
999999
$ seq 999999 | awk '{a[NR]=$0; delete a[NR-3]} END{print a[NR-1]}'
999998
If they are blocks of text the same method works if you can separate the blocks into delimited records.
Given:
$ echo "$txt"
Words on
many line
%
More Words
on many lines
%
Even More Words
on many lines
%
You can do:
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-1]}'
Even More Words
on many lines
$ echo "$txt" | awk -v RS=\% '{a[NR]=$0} END{print a[NR-2]}'
More Words
on many lines
If you want to not print the leading and trailing \n you can do:
$ echo "$txt" | awk 'BEGIN{RS="%\n"} {a[NR]=$0} END{printf a[NR-2]}'
Words on
many line
Finally, if you know the specific record you want to print, do it this way in awk:
$ seq 999999 | awk -v mrk=1135 'NR==mrk{print; exit}'
1135
If you want a random record, you can do:
$ awk -v min=1 -v max=1135 'BEGIN{srand()
RS="%\n"
tgt=int(min+rand()*(max-min+1))
}
NR==tgt{print; exit}' file

Does the solution have to be with awk? Just using head and tail would be simpler.
tail -2 file.txt | head 1 > justthatline.txt

The best way for this would be to use the BEGIN construct.
awk 'BEGIN{RS="%\n"; ORS="%\n"}(NR>=2){print}' file
RS and ORS set the input file and output record separators respectively.

Difference of files from Nth line

I am trying to get the difference of two text files. However, the first line can always change. For this reason I was executing this from a python:
tail -n +2 file1
tail -n +2 file2
Then to compare I match the results from the outputs.
However, I would like to use awk or sed if possible.
What I have found so far is:
awk 'NR == FNR { A[$0]=3; next } !A[$0]' file2 file1
but this compares from the first line.
How can I diff from the second line?

You can use diff together with process substitution:
diff <(tail -n +2 file1) <(tail -n +2 file2)

You can write something like
awk 'NR == FNR { A[$0]=3; next } !A[$0]&&FNR>1' file2 file1
FNR>1 The FNR value is reset to 1 for each file read. So FNR>1 selects all lines from the second line onwards.

All of the current AWK answers won't show differences between files, they will simply show if one file doesn't contains lines from the other, with no respect to order or number of occurences.
An awk way that compares line by line.
awk 'NR==FNR{A[FNR]=$0}FNR>1&&!(A[FNR]==$0)' file1 file2
If you want both lines to be output(similar to diff(ish))
awk 'NR==FNR{A[FNR]=$0}
FNR>1&&!(A[FNR]==$0){
print "Line:",FNR"\n"ARGV[1]":"A[FNR]"\n->\n"ARGV[2]":"$0"\n"
}' file file2
Explanation
Sets an array with File record number(FNR) as key to the line for first file.
Checks if line in second file is the same for the same FNR as the first file.
If it isn't print
Second one is mostly just formatting for the output.
It outputs FNR,first arg to awk(filename1),line from array,arrow,second arg to awk(filename2),line from file2

In addition to nu11p01n73R solution, you can always use <(...) for input files:
awk 'NR == FNR { A[$0]=3; next } !A[$0]' <(tail -n+2 f2) <(tail -n+2 f1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

ZGrep for first occurence of pattern after given line - awk

so I know that to find the line number of the first occurrence of a pattern in a file I do: zgrep -n -m 1 "pattern" big_file.txt.gz but what If I want to skip the first 500K lines? (I can't decompress the file. It's too large.)

You may use this gzcat | awk command: gzcat big_file.txt.gz | awk 'NR > 500000 && /pattern/ {print NR ":" $0; exit}'

Related

How to improve the speed of this awk script

Why does awk not filter the first column in the first line of my files?

Removing content of a column based on number of occurences

awk printing the second to last record of a file

Difference of files from Nth line

Categories

Resources