Using awk to split a record into multiple fields - awk

I have a file with records which are not separated by any delimiter . A sample is shared below:
XXXXXYYYYZZZ
XXXXXYYYYZZZ
XXXXXYYYYZZZ
XXXXXYYYYZZZ
XXXXXYYYYZZZ
I have been given a DDL for the file such that field 1 lies in the position 1-5, field 2 lies in the position 6-9 , field 3 lies in the position 10-12
How to use awk command to print the below output?
field1,field2,field3
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ

IN GNU awk using FIELDWIDTHS:
$ awk '
BEGIN {
FIELDWIDTHS="5 4 3" # here you state the field widths
OFS="," # output field separator
print "field1","field2","field3" } # print header in BEGIN
{
print $1,$2,$3 } # print 3 first fields, you could also:
' file # {$1=$1; print} or even:
field1,field2,field3 # {$1=$1}1
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
If you don't have GNU awk, use f1=substr($0,1,5);f2=substr($0,6,4)...print f1,f2,f3.
Edit:
$ awk '
BEGIN {
OFS=","
print "field1","field2","field3" }
{
f1=substr($0,1,5)
f2=substr($0,6,4)
f3=substr($0,10,3)
print f1,f2,f3 }
' file
Latter as one-liner with ;s inserted:
$ awk 'BEGIN {OFS=","; print "field1","field2","field3"}{f1=substr($0,1,5); f2=substr($0,6,4); f3=substr($0,10,3); print f1,f2,f3}' file
The former as one-liner:
$ awk 'BEGIN{FIELDWIDTHS="5 4 3"; OFS=","; print "field1","field2","field3"}{print $1,$2,$3}' file

This might work for you (GNU sed):
sed -e '1i\field1,field2,field3' -e 's/[^,]/,&/6;s//,&/10' file

without using substr(), here's a not-so-elegant awk way to get around it
echo 'XXXXXYYYYZZZ
XXXXXYYYYZZZ
XXXXXYYYYZZZ
XXXXXYYYYZZZ
XXXXXYYYYZZZ' |
mawk 'BEGIN { OFS = ","
print (__ = "field")(++_), (__)(++_), (__)(++_)
_ = (_ = (_ = ".")_)_
__ = "&,"
} sub("." _,__) + sub("," _,__)^_'
field1,field2,field3
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ
XXXXX,YYYY,ZZZ

Related

compare and print 2 columns from 2 files in awk ou perl

I have 2 files with 2 million lines.
I need to compare 2 columns in 2 different files and I want to print the lines of the 2 files where there are equal items.
this awk code works, but it does not print lines from the 2 files:
awk 'NR == FNR {a[$3]; next}$3 in a' file1.txt file2.txt
file1.txt
0001 00000001 084010800001080
0001 00000010 041140000100004
file2.txt
2451 00000009 401208008004000
2451 00000010 084010800001080
desired output:
file1[$1]-file2[$1] file1[$2]-file2[$2] $3 ( same on both files )
0001-2451 00000001-00000010 084010800001080
how to do this in awk or perl?
Assuming your $3 values are unique within each input file as shown in your sample input/output:
$ cat tst.awk
NR==FNR {
foos[$3] = $1
bars[$3] = $2
next
}
$3 in foos {
print foos[$3] "-" $1, bars[$3] "-" $2, $3
}
$ awk -f tst.awk file1.txt file2.txt
0001-2451 00000001-00000010 084010800001080
I named the arrays foos[] and bars[] as I don't know what the first 2 columns of your input actually represent - choose a more meaningful name.
With your shown samples, please try following awk code. Fair warning
I haven't tested it yet with millions of lines.
awk '
FNR == NR{
arr1[$3]=$0
next
}
($3 in arr1){
split(arr1[$3],arr2)
print (arr2[1]"-"$1,arr2[2]"-"$2,$3)
delete arr2
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR == NR{ ##checking condition which will be TRUE when first Input_file is being read.
arr1[$3]=$0 ##Creating arr1 array with value of $1 OFS $2 and $3
next ##next will skip all further statements from here.
}
($3 in arr1){ ##checking if $3 is present in arr1 then do following.
split(arr1[$3],arr2) ##Splitting value of arr1 into arr2.
print (arr2[1]"-"$1,arr2[2]"-"$2,$3) ##printing values as per requirement of OP.
delete arr2 ##Deleting arr2 array here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.
If you have two massive files, you may want to use sort, join and awk to produce your output without having to have the first file mostly in memory.
Based on your example, this pipe would do that:
join -1 3 -2 3 <(sort -k3 -n file1) <(sort -k3 -n file2) | awk '{printf("%s-%s %s-%s %s\n",$2,$4,$3,$5,$1)}'
Prints:
0001-2451 00000001-00000010 084010800001080
If your files are that big, you might want to avoid storing the data in memory. It's a whole lot of comparisons, 2 million lines times 2 million lines = 4 * 1012 comparisons.
use strict;
use warnings;
use feature 'say';
my $file1 = shift;
my $file2 = shift;
open my $fh1, "<", $file1 or die "Cannot open '$file1': $!";
while (<$fh1>) {
my #F = split;
open my $fh2, "<", $file2 or die "Cannot open '$file2': $!";
# for each line of file1 file2 is reopened and read again
while (my $cmp = <$fh2>) {
my #C = split ' ', $cmp;
if ($F[2] eq $C[2]) { # check string equality
say "$F[0]-$C[0] $F[1]-$C[1] $F[2]";
}
}
}
With your rather limited test set, I get the following output:
0001-2451 00000001-00000010 084010800001080
Python: tested with 2.000.000 rows each file
d = {}
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
for line in f1:
if not line: break
c0,c1,c2 = line.split()
d[(c2)] = (c0,c1)
for line in f2:
if not line: break
c0,c1,c2 = line.split()
if (c2) in d: print("{}-{} {}-{} {}".format(d[(c2)][0], c0, d[(c2)][1], c1, c2))
$ time python3 comapre.py
1001-2001 10000001-20000001 224010800001084
1042-2013 10000042-20000013 224010800001096
real 0m3.555s
user 0m3.234s
sys 0m0.321s

join 2 files with different number of rows

Good morning
I got 2 files and I want to join them.
I am using awk but I can use other command in bash.
the problem is that when I try to awk some records that are not in both files do not appear in the final file.
file1
supply_DBReplication, 27336
test_after_upgrade, 0
test_describe_topic, 0
teste2e_funcional, 0
test_latency, 0
test_replication, 0
ticket_dl, 90010356798
ticket_dl.replica_cloudera, 0
traza_auditoria_eventos, 0
Ezequiel1,473789563
Ezequiel2,526210437
Ezequiel3,1000000000
file2
Domimio2,supply_bdsupply-stock-valorado-sherpa
Domimio8,supply_DBReplication
Domimio9,test_after_upgrade
Domimio7,test_describe_topic
Domimio3,teste2e_funcional
,test_latency
,test_replication
,ticket_dl
,ticket_dl.replica_cloudera
,traza_auditoria_eventos
And I wish:
file3
Domimio2,0
Domimio8,27336
Domimio9,0
Domimio7,0
Domimio3,0
NoDomain,0
NoDomain,0
NoDomain,90010356798
NoDomain,0
NoDomain,0
NoDomain,473789563
NoDomain,526210437
NoDomain,1000000000
I am executing this
awk 'NR==FNR {T[$1]=FS $2; next} {print $1 T[$2]}' FS="," file1 file2
But i got:
Domimio2, 0
Domimio8, 27336
Domimio9, 0
Domimio7, 0
Domimio3, 0
, 0
, 0
, 90010356798
, 0
, 23034
, 0
How can i do it?
Thank you
Assumptions:
join criteria: file1.field#1 == file2.field#2
output format: file2.field#1 , file1,field#2
file2 - if field#1 is blank then replace with NoDomain
file2.field#2 - if no match in file1.field#1 then output file2.field#1 + 0
file1.field#1 - if no match in file2.field#2 then output NoDomain + file1.field#2 (sorted by field#2 values)
One GNU awk idea:
awk '
BEGIN { FS=OFS="," }
NR==FNR { gsub(" ","",$2) # strip blanks from field #2
a[$1]=$2
next
}
{ $1 = ($1 == "") ? "NoDomain" : $1 # if file2.field#1 is missing then set to "NoDomain"
print $1,a[$2]+0
delete a[$2] # delete file1 entry so we do not print again in the END{} block
}
END { PROCINFO["sorted_in"]="#val_num_asc" # any entries leftover from file1 (ie, no matches) then sort by value and ...
for (i in a)
print "NoDomain",a[i] # print to stdout
}
' file1 file2
NOTE: GNU awk is required for the use of PROCINFO["sorted_in"]; if sorting of the file1 leftovers is not required then PROCINFO["sorted_in"]="#val_num_asc" can be removed from the code
This generates:
Domimio2,0
Domimio8,27336
Domimio9,0
Domimio7,0
Domimio3,0
NoDomain,0
NoDomain,0
NoDomain,90010356798
NoDomain,0
NoDomain,0
NoDomain,473789563
NoDomain,526210437
NoDomain,1000000000

print part of a field with awk

I run:
ss -atp | grep -vi state | awk '{ print $2" "$3" "$4" "$5" "$6 }'
output:
0 0 192.168.1.14:49254 92.222.106.156:http users:(("firefox-esr",pid=696,fd=95))
From the last column, I want to strip everything but firefox-esr (in this case); more precisely I want to only fetch what's between "".
I have tried:
ss -atp | grep -vi state | awk '{ sub(/users\:\(\("/,"",$6); print $2" "$3" "$4" "$5" "$6 }'
0 0 192.168.1.14:49254 92.222.106.156:http firefox-esr",pid=696,fd=95))
There is still the last part to strip; the problem is that the pid and fd are not a constant value and keep changing.
You might harness gensub reference ability for that. For simplicity let file.txt content be
users:(("firefox-esr",pid=696,fd=95))
then
awk '{print gensub(/.*"(.+)".*/,"\\1",1,$1)}' file.txt
outputs:
firefox-esr
Keep in mind that gensub do not alter string it gets as 4th argument, but return new string, so I print it.
You can use
awk '{ gsub(/^[^\"]*\"|\".*/, "", $6); print $2" "$3" "$4" "$5" "$6 }'
Here, gsub(/^[^\"]*\"|\".*/, "", $6) will take Field 6 as input, and remove all chars from start till the first " including it (see the ^[^\"]*\" part) and then the next " and all text after it (using \".*).
See this online awk demo:
s='0 0 0 192.168.1.14:49254 92.222.106.156:http users:(("firefox-esr",pid=696,fd=95))'
awk '{gsub(/^[^\"]*\"|\".*/, "",$6); print $2" "$3" "$4" "$5" "$6 }' <<< "$s"
# => 0 0 192.168.1.14:49254 92.222.106.156:http firefox-esr

Awk output formatting

I have 2 .po files and some word in there has 2 different meanings
and want to use awk to turn it into some kind of translator
For example
in .po file 1
msgid "example"
msgstr "something"
in .po file 2
msgid "example"
msgstr "somethingelse"
I came up with this
awk -F'"' 'match($2, /^example$/) {printf "%s", $2": ";getline; printf "%s", $2}' file1.po file2.po
The output will be
example:something example:somethinelse
How do I make it into this kind of format
example : something, somethingelse.
Reformatting
example:something example:somethinelse
into
example : something, somethingelse
can be done with this one-liner:
awk -F":| " -v OFS="," '{printf "%s:", $1; for (i=1;i<=NF;i++) if (i % 2 == 0)printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
Testing:
$ echo "example:something example:somethinelse example:something3 example:something4" | \
awk -F":| " -v OFS="," '{ \
printf "%s:", $1; \
for (i=1;i<=NF;i++) \
if (i % 2 == 0) \
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))}'
example:something,somethinelse,something3,something4
Explanation:
$ cat tst.awk
BEGIN{FS=":| ";OFS=","} # define field sep and output field sep
{ printf "%s:", $1 # print header line "example:"
for (i=1;i<=NF;i++) # loop over all fields
if (i % 2 == 0) # we're only interested in all "even" fields
printf("%s%s%s", ((i==2)?"":OFS), $i, ((i==NF)?"\n":""))
}
But you could have done the whole thing in one go with something like this:
$ cat tst.awk
BEGIN{OFS=","} # set output field sep to ","
NF{ # if NF (i.e. number of fields) > 0
# - to skip empty lines -
if (match($0,/msgid "(.*)"/,a)) id=a[1] # if line matches 'msgid "something",
# set "id" to "something"
if (match($0,/msgstr "(.*)"/,b)) str=b[1] # same here for 'msgstr'
if (id && str){ # if both "id" and "str" are set
r[id]=(id in r)?r[id] OFS str:str # save "str" in array r with index "id".
# if index "id" already exists,
# add "str" preceded by OFS (i.e. "," here)
id=str=0 # after printing, reset "id" and "str"
}
}
END { for (i in r) printf "%s : %s\n", i, r[i] } # print array "r"
and call this like:
awk -f tst.awk *.po
$ awk -F'"' 'NR%2{k=$2; next} NR==FNR{a[k]=$2; next} {print k" : "a[k]", "$2}' file1 file2
example : something, somethingelse

shell script to return value

I have below shell script which produce output as desired.
RuleNum=$1
cat input.txt |awk -v var=$RuleNum '$1==var {out=$1; for(i=NF;i >=0;i--)if($i~/bps/){sub("bps","",$i);out=out" "$i} print out;out=""}'
./downup.sh 20
20 BW-IN:2560000 BW-OUT:2048000
i want output as below
./downup.sh 20
256000 2048000
./downup.sh 36
2560000 2048000
below is the input.txt
20 name:abc addr:203.45.247.247/255.255.255.255 WDW-THRESH:12 BW-OUT:10000000bps BW-IN:15000000bps STATSDEVICE:test247 STATS:Enabled (4447794/0) <IN OUT>
25 name:xyz160 addr:203.45.233.160/255.255.255.224 STATSDEVICE:test160 STATS:Enabled priority:pass-thru (1223803328/0) <IN OUT>
37 name:testgrp2 <B> WDW-THRESH:8 BW-BOTH:192000bps STATSDEVICE:econetgrp2 STATS:Enabled (0/0) <Group> START:NNNNNNN-255-0 STOP:NNNNNNN-255-0
62 name:blahblahl54 addr:203.45.225.54/255.255.255.255 WDW-THRESH:5 BWLINK:cbb256 BW-BOTH:256000bps STATSDEVICE:hellol54 STATS:Enabled (346918/77) <IN OUT>
Add sub("BW.*:", "", $i) after the existing sub().
And cat isn't necessary. Just put the filename at the end of the line:
awk ... input.txt
To eliminate the rule number from the output, remove out = $1;.
Here is the result with an addition to avoid printing a space at the beginning of each line:
awk -v var=$RuleNum '$1==var {for(i = NF; i >= 0; i--) if ($i ~ /bps/) {sub("bps","",$i); sub("BW.*:", "", $i); out = out delim $i; delim = OFS} print out; out = delim = ""}'