Internal piping with awk - awk

Let say i have input line:
input:
{x:y} abc det uyt llu
how to process it, to get expected output:
output:
{x:y} abc%det%uyt%llu
Question is how to concatanate fields 2-end of line, and in that string change space with %
where separator is space
I need fixed first part {x:y} and implementing pipe for fields 2-end of line

Here is another awk
awk '{$1=$1;sub(/%/," ")}1' OFS="%" file
echo '{x:y} abc det uyt llu' | awk '{$1=$1;sub(/%/," ")}1' OFS="%"
{x:y} abc%det%uyt%llu
This change all space to %, using OFS and $1=$1, then change the first % to space.

You can use this awk:
s='{x:y} abc det uyt llu'
awk '{printf "%s%s", $1, OFS; for (i=2; i<=NF; i++) printf "%s%s", $i, (i==NF)?RS:"%"}' <<< "$s"
{x:y} abc%det%uyt%llu
Another awk:
awk '{printf "%s%s", $1, OFS; OFS="%"; $1=""; print substr($0, 2)}' <<< "$s"
{x:y} abc%det%uyt%llu

Related

AWK taking very long to process file

I have a large file with about 6 million records. I need to chunk this file into smaller files based on the first 17 chars. So records where the first 17 chars are same will be grouped into a file with the same name
The command I use for this is :
awk -v FIELDWIDTHS="17" '{print > $1".txt"}' $file_name
The problem is that this is painfully slow. For a file with 800K records it took about an hour to complete.
sample input would be :-
AAAAAAAAAAAAAAAAAAAAAAAAAAAA75838458
AAAAAAAAAAAAAAAAAAAAAAAAAAAA48234283
BBBBBBBBBBBBBBBBBBBBBBBBBBBB34723643
AAAAAAAAAAAAAAAAAAAAAAAAAAAA64734987
BBBBBBBBBBBBBBBBBBBBBBBBBBBB18741274
CCCCCCCCCCCCCCCCCCCCCCCCCCCC38123922
Is there a faster solution to this problem?
I read that perl can also be used to split files but I couldnt find an option like fieldwidths in perl..
any help will be greatly appreciated
uname : Linux
bash-4.1$ ulimit -n
1024
sort file |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
Performance improvements included:
By not referring to any field it lets awk not do field splitting
By sorting first and changing output file names only when the key part of the input changes, it lets awk only use 1 output file at a time instead of having to manage opening/closing potentially thousands of output files
And it's portable to all awks since it's not using gawk-specific extension like FIELDWIDTHS.
If the lines in each output file have to retain their original relative order after sorting then it'd be something like this (assuming no white space in the input just like in the example you provided):
awk '{print substr($0,1,17)".txt", NR, $0}' file |
sort -k1,1 -k2,2n |
awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
After borrowing #dawg's script (perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file - thanks!) to generate the same type of sample input file he has, here's the timings for the above:
$ time sort ../file | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
real 0m45.709s
user 0m15.124s
sys 0m34.090s
$ time awk '{print substr($0,1,17)".txt", NR, $0}' ../file | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
real 0m49.190s
user 0m11.170s
sys 0m34.046s
and for #dawg's for comparison running on the same machine as the above with the same input ... I killed it after it had been running for 14+ minutes:
$ time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' ../file
real 14m23.473s
user 0m7.328s
sys 1m0.296s
I created a test file of this form:
% head file
SXXYTTLDCNKRTDIHE00004
QAMKKMCOUHJFSGFFA00001
XGHCCGLVASMIUMVHS00002
MICMHWQSJOKDVGJEO00005
AIDKSTWRVGNMQWCMQ00001
OZQDJAXYWTLXSKAUS00003
XBAUOLWLFVVQSBKKC00005
ULRVFNKZIOWBUGGVL00004
NIXDTLKKNBSUMITOA00003
WVEEALFWNCNLWRAYR00001
% wc -l file
600000 file
ie, 120,000 different 17 letter prefixes to with 01 - 05 appended in random order.
If you want a version for yourself, here is that test script:
perl -le 'for (1..120000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} ' | awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}' | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'| sort -n | cut -c8- > /tmp/test/file
If I run this:
% time awk -v FIELDWIDTHS="17" '{print > $1".txt"}' file
Well I gave up after about 15 minutes.
You can do this instead:
% time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file
You asked about Perl, and here is a similar program in Perl that is quite fast:
perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file
Here is a little script that compares Ed's awk to these:
#!/bin/bash
# run this in a clean directory Luke!
perl -le 'for (1..12000) {print map { (q(A)..q(Z))[rand(26)] } 1 .. 17} '
| awk '{for (i=1; i<6; i++) printf ("%s%05i\n", $0, i)}'
| awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}'
| sort -n
| cut -c8- > file.txt
wc -l file.txt
#awk -v FIELDWIDTHS="17" '{cnt[$1]++} END{for (e in cnt) print e, cnt[e]}' file
echo "abd awk"
time awk -v FIELDWIDTHS="17" '{of=$1 ".txt"; if (of in seen){ print >>of } else {print >of; seen[of]; } close(of);}' file.txt
echo "abd Perl"
time perl -lne '$p=unpack("A17", $_); if ($seen{$p}) { open(fh, ">>", "$p.txt"); print fh $_;} else { open(fh, ">", "$p.txt"); $seen{$p}++; }close fh' file.txt
echo "Ed 1"
time sort file.txt |
awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 2"
time sort file.txt | awk '{out=substr($0,1,17)".txt"} out != prev{close(prev); prev=out} {print > out}'
echo "Ed 3"
time awk '{print substr($0,1,17)".txt", NR, $0}' file.txt | sort -k1,1 -k2,2n | awk '$1 != prev{close(prev); prev=$1} {print $3 > $1}'
Which prints:
60000 file.txt
abd awk
real 0m3.058s
user 0m0.329s
sys 0m2.658s
abd Perl
real 0m3.091s
user 0m0.332s
sys 0m2.600s
Ed 1
real 0m1.158s
user 0m0.174s
sys 0m0.992s
Ed 2
real 0m1.069s
user 0m0.175s
sys 0m0.932s
Ed 3
real 0m1.174s
user 0m0.275s
sys 0m0.946s

Understanding how OFS works in AWK

This is a follow-up to my question to understand more about the OFS in AWK.
My understanding is, set it once in the beginning and it will be used in "print" to separate the fields. However, it didn't work as expected, as explained in my original question.
My File: someone.txt
LN_A,FN_A<aa#xyz.com>;
LN_B,FN_B<bb#xyz.com>;
Expected output:
FN_A,LN_A,aa
FN_B,LN_B,bb
I have tried the following:
awk -F'[,<#]' -v OFS=',' '{print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' 'NF=3 {print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' 'NF=3; {print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' '{$1=$1} {print $2 $1 $3}' someone.txt
awk -F'[,<#]' -v OFS=',' '{$1=$1} {print $0}' someone.txt
Finally, I managed to get the required output with the following:
awk -F'[,<#]' '{print $2 "," $1 "," $3}' someone.txt
Consider these cases:
a) $ echo '1 2 3' | awk '{print}'
1 2 3
b) $ echo '1 2 3' | awk '{print $1, $2, $3}'
1 2 3
c) $ echo '1 2 3' | awk -v OFS=',' '{print}'
1 2 3
d) $ echo '1 2 3' | awk -v OFS=',' '{print $1, $2, $3}'
1,2,3
e) $ echo '1 2 3' | awk -v OFS=',' '{$1=$1; print}'
1,2,3
The above show OFS being used in "b" and "d" (when individual fields are being printed in a comma-separated list) and in "e" (when the record $0 is being reconstructed as a result of a value being assigned to a field before the record is printed).
Those are the only 2 times when OFS is used implicitly - when printing a comma-separated list of values and when reconstructing the record.
When you print the record (e.g. by print or print $0) as in "a" and "c" above or print any other string you are not using OFS. OFS may have been used earlier to reconstruct the record as in "e" above but the act of printing anything that's not a comma-separated list is not using OFS, it's just printing any old string which just happens to be $0 in this case.
Note:
Explicitly changing a field reconstructs $0 from the existing fields using OFS between the fields, it does not resplit $0 into fields again so FS is not used in this process. So $1=$1 or sub(/1/,2,$1) uses OFS but not FS.
Explicitly changing $0 (i.e. not implicitly as a result of 1 above) resplits $0 into fields using FS as the separator, it does not use OFS in any way. So $0=$0 or sub(/1/,2) uses FS but not OFS.
Understanding how FS and OFS work together and how they effect assignments to fields and $0 is very important. If you can explain this behavior then you've got it:
f) $ echo 'a b' | awk -v OFS=',' '{print NF, $0, $1, $2}'
2,a b,a,b
g) $ echo 'a b' | awk -v OFS=',' '{$1=$1; print NF, $0, $1, $2}'
2,a,b,a,b
h) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; print NF, $0, $1, $2}'
1,a,b,a,b,
i) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; FS=OFS; print NF, $0, $1, $2}'
1,a,b,a,b,
j) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; FS=OFS; $1=$1; print NF, $0, $1, $2}'
1,a,b,a,b,
k) $ echo 'a b' | awk -v OFS=',' '{$1=$1; $0=$0; FS=OFS; $1=$1; $0=$0; print NF, $0, $1, $2}'
2,a,b,a,b
If not then feel free to ask questions.
It is simple, you have set the OFS="," in beginning of your awk statement but you are simply printing the fields(NOTE: without editing the line OR without mentioning field separator(using comma etc)) in that case OFS will not come in picture that is why your output is NOT having anything like separator.
awk -F'[,<#]' -v OFS=',' '{print $2,$1,$3}' Input_fie
If you use above command where I have mentioned , between printing fields you will see you are getting OFS now and this is how it works.
Or in case you want to see use of OFS you could use this(though above solution is BEST one but for your understanding I am adding this one too).
awk -F'[,<#]' -v OFS=',' '{$0=$2 OFS $1 OFS $3} 1' Input_file
Example to understand OFS by printing whole line(s): Let us understand it more clearly by printing whole line with OFS and withoutOFS` effect.
Let us run this code:
awk -F'[,<#]' -v OFS=',' 'FNR==1{$1=$1} 1' Input_file
What it does is when line number 1 is there then I am resetting $1's value as mentioned above to let OFS come into picture so that new value of OFS comes(off course wherever field separator was picked it will place OFS value there). So it will only be done for first line and REST of the lines nothing should happen. Let us see what output comes now?
LN_A,FN_A,aa,xyz.com>;
LN_B,FN_B<bb#xyz.com>;
You see the difference? See first line is having , in output and 2nd line is printing as it is, why because in only 1st line we have edited the first field so OFS came into picture.
As I just found an unused copy of Aho, Kernighan, Weinberger: The AWK Programming language from 1988, I(t)'ll take you to the source (pages 35-36):
"Field Variables. The fields of the current input line are called $1, $2,
through $NF; $0 refers to the whole line. Fields share the properties of other
variables — they may be used in arithmetic or string operations, and may be
assigned to. - -
One can assign a new string to a field:
BEGIN { FS = OFS = "\t" }
$4 == "North America" { $4 = "NA" }
$4 == "South America" { $4 = "SA" }
{ print }
In this program, the BEGIN action sets FS, the variable that controls the input
field separator, and OFS, the output field separator, both to a tab. The print
statement in the fourth line prints the value of $0 after it has been modified by
previous assignments. This is important: when $0 is changed by assignment or
substitution, $1, $2, etc., and NF will be recomputed; likewise, when one of $1, $2, etc., is changed, $0 is reconstructed using OFS to separate fields."

awk to format file using a specific order

I am trying to format a tab-delimited file using awk and the command runs but no output results. The output is also tab-delimited. The format of the output is $1 $2 $2 $3 REF=$4;OBS=$5 $6. Maybe the awk is not the best approach as it seems like it should work. Thank you :).
file (~370 lines all in the below format)
chr4 70501545 rs28560191 C A UGT2A1;UGT2A2
desired output
chr4 70501545 70501545 rs28560191 REF=C;OBS=A UGT2A1;UGT2A2
awk
awk -F'\t' -v OFS='\t' '{print $1,$2,$2,$3,"REF="$4";""OBS="$5,$6}' file
You are forgetting the print statement.
awk '{ print $1 "\t" $2 "\t" $2 "\t" $3 "\t" "REF="$4";""OBS="$5 "\t" $6}' file

Print default value if index is not in awk array

$ cat file1 #It contains ID:Name
5:John
4:Michel
$ cat file2 #It contains ID
5
4
3
I want to Replace the IDs in file2 with Names from file1, output required
John
Michel
NO MATCH FOUND
I need to expand the below code to reult NO MATCH FOUND text.
awk -F":" 'NR==FNR {a[$1]=$2;next} {print a[$1]}' file1 file2
My current result:
John
Michel
<< empty line
Thanks,
You can use a ternary operator for this: print ($1 in a)?a[$1]:"NO MATCH FOUND". That is, if $1 is in the array, print it; otherwise, print the text "NO MATCH FOUND".
All together:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} {print ($1 in a)?a[$1]:"NO MATCH FOUND"}' f1 f2
John
Michel
NO MATCH FOUND
You can test whether the index occurs in the array:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} $1 in a {print a[$1]; next} {print "NOT FOUND"}' file1 file2
John
Michel
NOT FOUND
if file2 has only digit (no space at the end)
awk -F ':' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1
if not
awk -F '[:[:blank:]]' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1 file2

Tab separated values in awk

How do I select the first column from the TAB separated string?
# echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F'\t' '{print $1}'
The above will return the entire line and not just "LOAD_SETTLED" as expected.
Update:
I need to change the third column in the tab separated values.
The following does not work.
echo $line | awk 'BEGIN { -v var="$mycol_new" FS = "[ \t]+" } ; { print $1 $2 var $4 $5 $6 $7 $8 $9 }' >> /pdump/temp.txt
This however works as expected if the separator is comma instead of tab.
echo $line | awk -v var="$mycol_new" -F'\t' '{print $1 "," $2 "," var "," $4 "," $5 "," $6 "," $7 "," $8 "," $9 "}' >> /pdump/temp.txt
You need to set the OFS variable (output field separator) to be a tab:
echo "$line" |
awk -v var="$mycol_new" -F'\t' 'BEGIN {OFS = FS} {$3 = var; print}'
(make sure you quote the $line variable in the echo statement)
Make sure they're really tabs! In bash, you can insert a tab using C-v TAB
$ echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F$'\t' '{print $1}'
LOAD_SETTLED
Use:
awk -v FS='\t' -v OFS='\t' ...
Example from one of my scripts.
I use the FS and OFS variables to manipulate BIND zone files, which are tab delimited:
awk -v FS='\t' -v OFS='\t' \
-v record_type=$record_type \
-v hostname=$hostname \
-v ip_address=$ip_address '
$1==hostname && $3==record_type {$4=ip_address}
{print}
' $zone_file > $temp
This is a clean and easy to read way to do this.
You can set the Field Separator:
... | awk 'BEGIN {FS="\t"}; {print $1}'
Excellent read:
https://docs.freebsd.org/info/gawk/gawk.info.Field_Separators.html
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -v var="test" 'BEGIN { FS = "[ \t]+" } ; { print $1 "\t" var "\t" $3 }'
If your fields are separated by tabs - this works for me in Linux.
awk -F'\t' '{print $1}' < tab_delimited_file.txt
I use this to process data generated by mysql, which generates tab-separated output in batch mode.
From awk man page:
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS prede‐
fined variable).
1st column only
— awk NF=1 FS='\t'
LOAD_SETTLED
First 3 columns
— awk NF=3 FS='\t' OFS='\t'
LOAD_SETTLED LOAD_INIT 2011-01-13
Except first 2 columns
— {g,n}awk NF=NF OFS= FS='^([^\t]+\t){2}'
— {m}awk NF=NF OFS= FS='^[^\t]+\t[^\t]+\t'
2011-01-13 03:50:01
Last column only
— awk '($!NF=$NF)^_' FS='\t', or
— awk NF=NF OFS= FS='^.*\t'
03:50:01
Should this not work?
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk '{print $1}'