AWK taking very long to process file - awk

I have a large file with about 6 million records. I need to chunk this file into smaller files based on the first 17 chars. So records where the first 17 chars are same will be grouped into a file with the same name
The command I use for this is :
awk -v FIELDWIDTHS="17" '{print > $1".txt"}' $file_name
The problem is that this is painfully slow. For a file with 800K records it took about an hour to complete.
sample input would be :-
AAAAAAAAAAAAAAAAAAAAAAAAAAAA75838458
AAAAAAAAAAAAAAAAAAAAAAAAAAAA48234283
BBBBBBBBBBBBBBBBBBBBBBBBBBBB34723643
AAAAAAAAAAAAAAAAAAAAAAAAAAAA64734987
BBBBBBBBBBBBBBBBBBBBBBBBBBBB18741274
CCCCCCCCCCCCCCCCCCCCCCCCCCCC38123922
Is there a faster solution to this problem?
I read that perl can also be used to split files but I couldnt find an option like fieldwidths in perl..
any help will be greatly appreciated
uname : Linux
bash-4.1$ ulimit -n
1024

sort file |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
Performance improvements included:
By not referring to any field it lets awk not do field splitting
By sorting first and changing output file names only when the key part of the input changes, it lets awk only use 1 output file at a time instead of having to manage opening/closing potentially thousands of output files
And it's portable to all awks since it's not using gawk-specific extension like FIELDWIDTHS.
If the lines in each output file have to retain their original relative order after sorting then it'd be something like this (assuming no white space in the input just like in the example you provided):
awk '{print substr($0,1,17)".txt", NR, $0}' file |
sort -k1,1 -k2,2n |
awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
After borrowing #dawg's script (perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file - thanks!) to generate the same type of sample input file he has, here's the timings for the above:
$ time sort ../file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
real 0m45.709s
user 0m15.124s
sys 0m34.090s
$ time awk '{print substr($0,1,17)".txt", NR, $0}' ../file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
real 0m49.190s
user 0m11.170s
sys 0m34.046s
and for #dawg's for comparison running on the same machine as the above with the same input ... I killed it after it had been running for 14+ minutes:
$ time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' ../file
real 14m23.473s
user 0m7.328s
sys 1m0.296s

I created a test file of this form:
% head file
SXXYTTLDCNKRTDIHE00004
QAMKKMCOUHJFSGFFA00001
XGHCCGLVASMIUMVHS00002
MICMHWQSJOKDVGJEO00005
AIDKSTWRVGNMQWCMQ00001
OZQDJAXYWTLXSKAUS00003
XBAUOLWLFVVQSBKKC00005
ULRVFNKZIOWBUGGVL00004
NIXDTLKKNBSUMITOA00003
WVEEALFWNCNLWRAYR00001
% wc -l file
600000 file
ie, 120,000 different 17 letter prefixes to with 01 - 05 appended in random order.
If you want a version for yourself, here is that test script:
perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file
If I run this:
% time awk -v FIELDWIDTHS="17" '{print > $1".txt"}' file
Well I gave up after about 15 minutes.
You can do this instead:
% time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file
You asked about Perl, and here is a similar program in Perl that is quite fast:
perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file
Here is a little script that compares Ed's awk to these:
#!/bin/bash
# run this in a clean directory Luke!
perl -le 'for (1..12000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} '
| awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}'
| awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'
| sort -n
| cut -c8- > file.txt
wc -l file.txt
#awk -v FIELDWIDTHS="17" '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file
echo "abd awk"
time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file.txt
echo "abd Perl"
time perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file.txt
echo "Ed 1"
time sort file.txt |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 2"
time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 3"
time awk '{print substr($0,1,17)".txt", NR, $0}' file.txt | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
Which prints:
60000 file.txt
abd awk
real 0m3.058s
user 0m0.329s
sys 0m2.658s
abd Perl
real 0m3.091s
user 0m0.332s
sys 0m2.600s
Ed 1
real 0m1.158s
user 0m0.174s
sys 0m0.992s
Ed 2
real 0m1.069s
user 0m0.175s
sys 0m0.932s
Ed 3
real 0m1.174s
user 0m0.275s
sys 0m0.946s

Related

Pad a file using awk

So I have this working on a server with awk 3.1.7 and when I try using on a server with awk 4.0.2 it does not seem to work. I am just trying to add a filename 14 digits, is the goal. I don't have the perl version on rename just in case it is brought up.
Anyhow this is my code
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
I get this error mv: ‘800001.1.pull’ and ‘800001.1.pull’ are the same file
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
mv: ‘800001.1.pull’ and ‘800001.1.pull’ are the same file
This same things that has been running on the other server for a few years and I don't get an error, I get a file named:
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
000000800001.1.pull
and this is why I said this:
I will take it that the newer awk handles delimiters different?
The expected output is 000000800001.1.pull
The original file needs to be 0 padded to 14.
Thanks!
$ echo 800001.1.pull | awk -F'.pull' '{ new=sprintf("%14s%s",$1,FS); gsub(/ /,0,new); print $0, new }'
800001.1.pull 000000800001.1.pull
$ echo 800001.12.345.pull | awk -F'.pull' '{ new=sprintf("%14s%s",$1,FS); gsub(/ /,0,new); print $0, new }'
800001.12.345.pull 0800001.12.345.pull

Need to retrieve a value from an HL7 file using awk

In a Linux script program, I've got the following awk command for other purposes and to rename the file.
cat $edifile | awk -F\| '
{ OFS = "|"
print $0
} ' | tr -d "\012" > $newname.hl7
While this is happening, I'd like to grab the 5th field of the MSH segment and save it for later use in the script. Is this possible?
If no, how could I do it later or earlier on?
Example of the segment.
MSH|^~\&|business1|business2|/u/tmp/TR0049-GE-1.b64|routing|201811302126||ORU^R01|20181130212105810|D|2.3
What I want to do is retrieve the path and file name in MSH 5 and concatenate it to the end of the new file.
I've used this to capture the data but no luck. If fpth is getting set, there is no evidence of it and I don't have the right syntax for an echo within the awk phrase.
cat $edifile | awk -F\| '
{ OFS = "|"
{fpth=$(5)}
print $0
} ' | tr -d "\012" > $newname.hl7
any suggestions?
Thank you!
Try
filename=`awk -F'|' '{print $5}' $edifile | head -1`
You can skip the piping through head if the file is a single line
First of all, it must be mentioned that the awk line in your first piece of code, has zero use:
$ cat $edifile | awk -F\| ' { OFS = "|"; print $0 }' | tr -d "\012" > $newname.hl7
This is totally equivalent to
$ cat $edifile | tr -d "\012" > $newname.hl7
because OFS is only used to redefine $0 if you redefine a field.
Example:
$ echo "a|b|c" | awk -F\| '{OFS="/"; print $0}'
a|b|c
$ echo "a|b|c" | awk -F\| '{OFS="/"; $1=$1; print $0}'
a/b/c
I understand that you have a hl7 file in which you have a single line starting with the string "MSH". From this line you want to store the 5th field: this is achieved in the following way:
fpth=$(awk -v outputfile="${newname}.hl7" '
BEGIN{FS="|"; ORS="" }
($1 == "MSH"){ print $5 }
{ print $0 > outputfile }' $edifile)
I have replaced ORS to an empty character set, as it is equivalent to tr -d "\012". The above will work very nicely if you only have a single MSH in your file.

awk split with asterix

I am trying to split a variable as follows. is there any efficient way to do this preferably using awk.
echo 262146*10,69636*32 |awk -F, 'split($1, DCAP,"\\*") {print DCAP[1]}; split($2, DCAP,"\\*"){print DCAP[1]}'
echo '262146*10,69636*32' | awk -F '[,*]' '{print $1; print $3}'
or
echo '262146*10,69636*32' | awk -F '[,*]' '{printf("%d\n%d\n",$1,$3)}'
Output:
262146
69636
If you have a longer sequence you could try:
echo 262146*10,69636*32,10*3 | awk 'BEGIN {FS="*"; RS=","} {print $1}'

How to grep the outputs of awk, line by line?

Let's say I have the following text file:
$ cat file1.txt outputs
MarkerName Allele1 Allele2 Freq1 FreqSE P-value Chr Pos
rs2326918 a g 0.8510 0.0001 0.5255 6 130881784
rs2439906 c g 0.0316 0.0039 0.8997 10 6870306
rs10760160 a c 0.5289 0.0191 0.8107 9 123043147
rs977590 a g 0.9354 0.0023 0.8757 7 34415290
rs17278013 t g 0.7498 0.0067 0.3595 14 24783304
rs7852050 a g 0.8814 0.0006 0.7671 9 9151167
rs7323548 a g 0.0432 0.0032 0.4555 13 112320879
rs12364336 a g 0.8720 0.0015 0.4542 11 99515186
rs12562373 a g 0.7548 0.0020 0.6151 1 164634379
Here is an awk command which prints MarkerName if Pos >= 11000000
$ awk '{ if($8 >= 11000000) { print $1 }}' file1.txt
This command outputs the following:
MarkerName
rs2326918
rs10760160
rs977590
rs17278013
rs7323548
rs12364336
rs12562373
Question: I would like to feed this into a grep statement to parse another text file, textfile2.txt. Somehow, one pipes the output from the previous awk command into grep AWKOUTPUT textfile2.txt
I would like each row of the awk command above to be grepped against textfile2.txt, i.e.
grep "rs2326918" textfile2.txt
## and then
grep "rs10760160" textfile2.txt
### and then
...
Naturally, I would save all resulting rows from textfile2.txt into a final file, i.e.
$ awk '{ if($8 >= 11000000) { print $1 }}' file1.txt | grep PIPE_OUTPUT_BY_ROW textfile2.txt > final.txt
How does one grep from a pipe line by line?
EDIT: To clarify, the one constraint I have is that file1.txt is actually the output of a previous pipe. (I'm trying to simplify the question somewhat.) How would that change the answer?
awk + grep solution:
grep -f <(awk '$8 >= 11000000{ print $1 }' file1.txt) textfile2.txt > final.txt
-f file - obtain patterns from file, one per line
You can use bash to do this:
bash-3.1$ echo "rs2326918" > filename2.txt
bash-3.1$ (for i in `awk '{ if($8 >= 11000000) { print $1 }}' file1.txt |
grep -v MarkerName`; do grep $i filename2.txt; done) > final.txt
bash-3.1$ cat final.txt
rs2326918
Alternatively,
bash-3.1$ cat file1.txt | (for i in `awk '{ if($8 >= 11000000) { print $1 }}' |
grep -v MarkerName`; do grep $i filename2.txt; done) > final.txt
The switch grep -v tells grep to reverse its usual activity and print all lines that do not match the pattern. This switch "inVerts" the match.
only using awk can do this for you:
$ awk 'NR>1 && NR==FNR {if ($8 >= 110000000) a[$1]++;next} \
{ for(i in a){if($0~i) print}}' file1.txt file2.txt> final.txt

Tab separated values in awk

How do I select the first column from the TAB separated string?
# echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F'\t' '{print $1}'
The above will return the entire line and not just "LOAD_SETTLED" as expected.
Update:
I need to change the third column in the tab separated values.
The following does not work.
echo $line | awk 'BEGIN { -v var="$mycol_new" FS = "[ \t]+" } ; { print $1 $2 var $4 $5 $6 $7 $8 $9 }' >> /pdump/temp.txt
This however works as expected if the separator is comma instead of tab.
echo $line | awk -v var="$mycol_new" -F'\t' '{print $1 "," $2 "," var "," $4 "," $5 "," $6 "," $7 "," $8 "," $9 "}' >> /pdump/temp.txt
You need to set the OFS variable (output field separator) to be a tab:
echo "$line" |
awk -v var="$mycol_new" -F'\t' 'BEGIN {OFS = FS} {$3 = var; print}'
(make sure you quote the $line variable in the echo statement)
Make sure they're really tabs! In bash, you can insert a tab using C-v TAB
$ echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F$'\t' '{print $1}'
LOAD_SETTLED
Use:
awk -v FS='\t' -v OFS='\t' ...
Example from one of my scripts.
I use the FS and OFS variables to manipulate BIND zone files, which are tab delimited:
awk -v FS='\t' -v OFS='\t' \
-v record_type=$record_type \
-v hostname=$hostname \
-v ip_address=$ip_address '
$1==hostname && $3==record_type {$4=ip_address}
{print}
' $zone_file > $temp
This is a clean and easy to read way to do this.
You can set the Field Separator:
... | awk 'BEGIN {FS="\t"}; {print $1}'
Excellent read:
https://docs.freebsd.org/info/gawk/gawk.info.Field_Separators.html
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -v var="test" 'BEGIN { FS = "[ \t]+" } ; { print $1 "\t" var "\t" $3 }'
If your fields are separated by tabs - this works for me in Linux.
awk -F'\t' '{print $1}' < tab_delimited_file.txt
I use this to process data generated by mysql, which generates tab-separated output in batch mode.
From awk man page:
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS prede‐
fined variable).
1st column only
— awk NF=1 FS='\t'
LOAD_SETTLED
First 3 columns
— awk NF=3 FS='\t' OFS='\t'
LOAD_SETTLED LOAD_INIT 2011-01-13
Except first 2 columns
— {g,n}awk NF=NF OFS= FS='^([^\t]+\t){2}'
— {m}awk NF=NF OFS= FS='^[^\t]+\t[^\t]+\t'
2011-01-13 03:50:01
Last column only
— awk '($!NF=$NF)^_' FS='\t', or
— awk NF=NF OFS= FS='^.*\t'
03:50:01
Should this not work?
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk '{print $1}'