Pad a file using awk - awk

So I have this working on a server with awk 3.1.7 and when I try using on a server with awk 4.0.2 it does not seem to work. I am just trying to add a filename 14 digits, is the goal. I don't have the perl version on rename just in case it is brought up.
Anyhow this is my code
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
I get this error mv: ‘800001.1.pull’ and ‘800001.1.pull’ are the same file
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
mv: ‘800001.1.pull’ and ‘800001.1.pull’ are the same file
This same things that has been running on the other server for a few years and I don't get an error, I get a file named:
ls 800001.1.pull | awk -F'.pull' '{ printf "%s %014s.pull%s\n", $0, $1, $2; }' | xargs -n 2 mv
000000800001.1.pull
and this is why I said this:
I will take it that the newer awk handles delimiters different?
The expected output is 000000800001.1.pull
The original file needs to be 0 padded to 14.
Thanks!

$ echo 800001.1.pull | awk -F'.pull' '{ new=sprintf("%14s%s",$1,FS); gsub(/ /,0,new); print $0, new }'
800001.1.pull 000000800001.1.pull
$ echo 800001.12.345.pull | awk -F'.pull' '{ new=sprintf("%14s%s",$1,FS); gsub(/ /,0,new); print $0, new }'
800001.12.345.pull 0800001.12.345.pull

Related

AWK taking very long to process file

I have a large file with about 6 million records. I need to chunk this file into smaller files based on the first 17 chars. So records where the first 17 chars are same will be grouped into a file with the same name
The command I use for this is :
awk -v FIELDWIDTHS="17" '{print > $1".txt"}' $file_name
The problem is that this is painfully slow. For a file with 800K records it took about an hour to complete.
sample input would be :-
AAAAAAAAAAAAAAAAAAAAAAAAAAAA75838458
AAAAAAAAAAAAAAAAAAAAAAAAAAAA48234283
BBBBBBBBBBBBBBBBBBBBBBBBBBBB34723643
AAAAAAAAAAAAAAAAAAAAAAAAAAAA64734987
BBBBBBBBBBBBBBBBBBBBBBBBBBBB18741274
CCCCCCCCCCCCCCCCCCCCCCCCCCCC38123922
Is there a faster solution to this problem?
I read that perl can also be used to split files but I couldnt find an option like fieldwidths in perl..
any help will be greatly appreciated
uname : Linux
bash-4.1$ ulimit -n
1024
sort file |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
Performance improvements included:
By not referring to any field it lets awk not do field splitting
By sorting first and changing output file names only when the key part of the input changes, it lets awk only use 1 output file at a time instead of having to manage opening/closing potentially thousands of output files
And it's portable to all awks since it's not using gawk-specific extension like FIELDWIDTHS.
If the lines in each output file have to retain their original relative order after sorting then it'd be something like this (assuming no white space in the input just like in the example you provided):
awk '{print substr($0,1,17)".txt", NR, $0}' file |
sort -k1,1 -k2,2n |
awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
After borrowing #dawg's script (perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file - thanks!) to generate the same type of sample input file he has, here's the timings for the above:
$ time sort ../file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
real 0m45.709s
user 0m15.124s
sys 0m34.090s
$ time awk '{print substr($0,1,17)".txt", NR, $0}' ../file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
real 0m49.190s
user 0m11.170s
sys 0m34.046s
and for #dawg's for comparison running on the same machine as the above with the same input ... I killed it after it had been running for 14+ minutes:
$ time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' ../file
real 14m23.473s
user 0m7.328s
sys 1m0.296s
I created a test file of this form:
% head file
SXXYTTLDCNKRTDIHE00004
QAMKKMCOUHJFSGFFA00001
XGHCCGLVASMIUMVHS00002
MICMHWQSJOKDVGJEO00005
AIDKSTWRVGNMQWCMQ00001
OZQDJAXYWTLXSKAUS00003
XBAUOLWLFVVQSBKKC00005
ULRVFNKZIOWBUGGVL00004
NIXDTLKKNBSUMITOA00003
WVEEALFWNCNLWRAYR00001
% wc -l file
600000 file
ie, 120,000 different 17 letter prefixes to with 01 - 05 appended in random order.
If you want a version for yourself, here is that test script:
perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file
If I run this:
% time awk -v FIELDWIDTHS="17" '{print > $1".txt"}' file
Well I gave up after about 15 minutes.
You can do this instead:
% time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file
You asked about Perl, and here is a similar program in Perl that is quite fast:
perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file
Here is a little script that compares Ed's awk to these:
#!/bin/bash
# run this in a clean directory Luke!
perl -le 'for (1..12000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} '
| awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}'
| awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'
| sort -n
| cut -c8- > file.txt
wc -l file.txt
#awk -v FIELDWIDTHS="17" '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file
echo "abd awk"
time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file.txt
echo "abd Perl"
time perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file.txt
echo "Ed 1"
time sort file.txt |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 2"
time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 3"
time awk '{print substr($0,1,17)".txt", NR, $0}' file.txt | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
Which prints:
60000 file.txt
abd awk
real 0m3.058s
user 0m0.329s
sys 0m2.658s
abd Perl
real 0m3.091s
user 0m0.332s
sys 0m2.600s
Ed 1
real 0m1.158s
user 0m0.174s
sys 0m0.992s
Ed 2
real 0m1.069s
user 0m0.175s
sys 0m0.932s
Ed 3
real 0m1.174s
user 0m0.275s
sys 0m0.946s

Need to retrieve a value from an HL7 file using awk

In a Linux script program, I've got the following awk command for other purposes and to rename the file.
cat $edifile | awk -F\| '
{ OFS = "|"
print $0
} ' | tr -d "\012" > $newname.hl7
While this is happening, I'd like to grab the 5th field of the MSH segment and save it for later use in the script. Is this possible?
If no, how could I do it later or earlier on?
Example of the segment.
MSH|^~\&|business1|business2|/u/tmp/TR0049-GE-1.b64|routing|201811302126||ORU^R01|20181130212105810|D|2.3
What I want to do is retrieve the path and file name in MSH 5 and concatenate it to the end of the new file.
I've used this to capture the data but no luck. If fpth is getting set, there is no evidence of it and I don't have the right syntax for an echo within the awk phrase.
cat $edifile | awk -F\| '
{ OFS = "|"
{fpth=$(5)}
print $0
} ' | tr -d "\012" > $newname.hl7
any suggestions?
Thank you!
Try
filename=`awk -F'|' '{print $5}' $edifile | head -1`
You can skip the piping through head if the file is a single line
First of all, it must be mentioned that the awk line in your first piece of code, has zero use:
$ cat $edifile | awk -F\| ' { OFS = "|"; print $0 }' | tr -d "\012" > $newname.hl7
This is totally equivalent to
$ cat $edifile | tr -d "\012" > $newname.hl7
because OFS is only used to redefine $0 if you redefine a field.
Example:
$ echo "a|b|c" | awk -F\| '{OFS="/"; print $0}'
a|b|c
$ echo "a|b|c" | awk -F\| '{OFS="/"; $1=$1; print $0}'
a/b/c
I understand that you have a hl7 file in which you have a single line starting with the string "MSH". From this line you want to store the 5th field: this is achieved in the following way:
fpth=$(awk -v outputfile="${newname}.hl7" '
BEGIN{FS="|"; ORS="" }
($1 == "MSH"){ print $5 }
{ print $0 > outputfile }' $edifile)
I have replaced ORS to an empty character set, as it is equivalent to tr -d "\012". The above will work very nicely if you only have a single MSH in your file.

awk split with asterix

I am trying to split a variable as follows. is there any efficient way to do this preferably using awk.
echo 262146*10,69636*32 |awk -F, 'split($1, DCAP,"\\*") {print DCAP[1]}; split($2, DCAP,"\\*"){print DCAP[1]}'
echo '262146*10,69636*32' | awk -F '[,*]' '{print $1; print $3}'
or
echo '262146*10,69636*32' | awk -F '[,*]' '{printf("%d\n%d\n",$1,$3)}'
Output:
262146
69636
If you have a longer sequence you could try:
echo 262146*10,69636*32,10*3 | awk 'BEGIN {FS="*"; RS=","} {print $1}'

Convert a decimal data field to hexadecimal using sed or awk

Who can correct this command to get the desired output :
input : "1|2|30|4"
echo "1|2|30|4" | awk -F, -v OFS=| '{print $1,$2; printf "%04X", $3; print $4}'
Output expected :
1|2|001E|4
Best regards.
$ echo "1|2|30|4" |
awk -F'|' -v OFS='|' '{print $1, $2, sprintf("%X", $3), $4}'
1|2|1E|4
echo "1|2|30|4" | awk -F"|" '{printf "%s|%s|%04X|%s", $1, $2, $3, $4}'
Output:
1|2|001E|4

Tab separated values in awk

How do I select the first column from the TAB separated string?
# echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F'\t' '{print $1}'
The above will return the entire line and not just "LOAD_SETTLED" as expected.
Update:
I need to change the third column in the tab separated values.
The following does not work.
echo $line | awk 'BEGIN { -v var="$mycol_new" FS = "[ \t]+" } ; { print $1 $2 var $4 $5 $6 $7 $8 $9 }' >> /pdump/temp.txt
This however works as expected if the separator is comma instead of tab.
echo $line | awk -v var="$mycol_new" -F'\t' '{print $1 "," $2 "," var "," $4 "," $5 "," $6 "," $7 "," $8 "," $9 "}' >> /pdump/temp.txt
You need to set the OFS variable (output field separator) to be a tab:
echo "$line" |
awk -v var="$mycol_new" -F'\t' 'BEGIN {OFS = FS} {$3 = var; print}'
(make sure you quote the $line variable in the echo statement)
Make sure they're really tabs! In bash, you can insert a tab using C-v TAB
$ echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F$'\t' '{print $1}'
LOAD_SETTLED
Use:
awk -v FS='\t' -v OFS='\t' ...
Example from one of my scripts.
I use the FS and OFS variables to manipulate BIND zone files, which are tab delimited:
awk -v FS='\t' -v OFS='\t' \
-v record_type=$record_type \
-v hostname=$hostname \
-v ip_address=$ip_address '
$1==hostname && $3==record_type {$4=ip_address}
{print}
' $zone_file > $temp
This is a clean and easy to read way to do this.
You can set the Field Separator:
... | awk 'BEGIN {FS="\t"}; {print $1}'
Excellent read:
https://docs.freebsd.org/info/gawk/gawk.info.Field_Separators.html
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -v var="test" 'BEGIN { FS = "[ \t]+" } ; { print $1 "\t" var "\t" $3 }'
If your fields are separated by tabs - this works for me in Linux.
awk -F'\t' '{print $1}' < tab_delimited_file.txt
I use this to process data generated by mysql, which generates tab-separated output in batch mode.
From awk man page:
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS prede‐
fined variable).
1st column only
— awk NF=1 FS='\t'
LOAD_SETTLED
First 3 columns
— awk NF=3 FS='\t' OFS='\t'
LOAD_SETTLED LOAD_INIT 2011-01-13
Except first 2 columns
— {g,n}awk NF=NF OFS= FS='^([^\t]+\t){2}'
— {m}awk NF=NF OFS= FS='^[^\t]+\t[^\t]+\t'
2011-01-13 03:50:01
Last column only
— awk '($!NF=$NF)^_' FS='\t', or
— awk NF=NF OFS= FS='^.*\t'
03:50:01
Should this not work?
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk '{print $1}'