AWK - Working with two files - awk

I have these two csv files:
File A:
veículo;carro;sust
automóvel;carro;sust
viatura;carro;sust
breve;rápido;adj
excepcional;excelente;adj
maravilhoso;excelente;adj
amistoso;simpático;adj
amigável;simpático;adj
...
File B:
"A001","carro","sust","excelente","adj","ocorrer","adv","bom","adj"
...
In the file A, $1(word) is synonym for $2(word) and $3(word) the part of speech.
In the lines of the file B we can skip $1,the remaining columns are words and their part of speech.
What I need to to do is to look line by line each pair (word-pos) in the file A and generate a line for each synonym. It is difficult to explain.
Desired Output:
"A001","carro","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
Done:
BEGIN {
FS="[,;]";
OFS=";";
}
FNR==NR{
sinonim[$1","$2","$3]++;
next;
}
{
s1=split($0,AX,"\n");
for (i=1;i<=s1;i++)
{
s2=split(AX[i],BX,",");
for (j=2;j<=NF;j+=2)
{
lineX=BX[j]","BX[j+1];
gsub(/\"/,"",lineX);
for (item in sinonim)
{
s3=split(item,CX,",");
lineS=CX[2]","CX[3];
if (lineX == lineS)
{
BX[j]=CX[1];
lineD=""
for (t=1;t<=s2;t++)
{
lineD=lineD BX[t]",";
}
lineF=lineF lineD"\n";
}
}
}
}
print lineF
}

$ cat tst.awk
BEGIN { FS=";" }
NR==FNR { synonyms[$2,$3][$2]; synonyms[$2,$3][$1]; next }
FNR==1 { FS=OFS="\",\""; $0=$0 }
{
gsub(/^"|"$/,"")
for (i=2;i<NF;i+=2) {
if ( ($i,$(i+1)) in synonyms) {
for (synonym in synonyms[$i,$(i+1)]) {
$i = synonym
for (j=2;j<NF;j+=2) {
if ( ($j,$(j+1)) in synonyms) {
for (synonym in synonyms[$j,$(j+1)]) {
orig = $0
$j = synonym
if (!seen[$0]++) {
print "\"" $0 "\""
}
$0 = orig
}
}
}
}
}
}
}
.
$ awk -f tst.awk fileA fileB
"A001","carro","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excepcional","adj","ocorrer","adv","bom","adj"
The above uses GNU awk for multi-dimensional arrays, with other awks it's a simple tweak to use synonyms[$2,$3] = synonyms[$2,$3] " " $2 etc. or similar and then split() later instead of synonyms[$2,$3][$2] and in.

BEGIN { FS="[,;]"; OFS="," }
NR == FNR { key = "\"" $2 "\""; synonym[key] = synonym[key] "," $1; next }
{
print;
if ($2 in synonym) {
count = split(substr(synonym[$2], 2), choices)
for (i = 1; i <= count; i++) {
$2 = "\"" choices[i] "\""
print
}
}
}

Related

how to optimize this awk script?

I browse 2 files with awk. I browse the first file and store the columns I need in arrays. I use after these arrays to make a comparison with a column (8) of the second file.
my script runs very slowly. I would like to know if there is not a way to optimize it?
FNR==NR
{
a[$1];
ip[NR]=$1;
site[NR]=$2;
next
}
BEGIN{
FS="[\t,=]";
OFS="|";
}
sudo awk -f{
l=length(ip);
if($8 in a)
{
for(k=0;k<=l;k++)
{
if(ip[k]== $8)
{
if(NF <= 70)
{
print "siteID Ipam: "site[k],"siteID zsc: "$14,"date: " $4,"src: "$8,"dst: "$10,"role: "$22,"urlcategory: "$36, "urlsupercategory: "$38,"urlclass: "$40;
}
else
{
print "siteID Ipam: "site[k], "siteID zsc: "$14,"date: " $4, "src: " $8, "dst: " $10, "role: "$22, "urlcategory: " $37, "urlsupercategory: "$39, "urlclass: $41;
}
break;
}
}
}
else
{
print $8 " is not in referentiel ";
}
}
Here is a better formatted same code with the initial typo.
BEGIN {
FS = "[\t,=]";
OFS = "|";
}
FNR == NR {
a[$1];
ip[NR] = $1;
site[NR] = $2;
next;
}
sudo awk -f {
l = length(ip);
if($8 in a) {
for(k = 0; k <= l; k++) {
if(ip[k] == $8) {
if(NF <= 70) {
print "siteID Ipam: "site[k],"siteID zsc: "$14,"date: " $4,"src: "$8,"dst: "$10,"role: "$22,"urlcategory: "$36, "urlsupercategory: "$38,"urlclass: "$40;
}
else {
print "siteID Ipam: "site[k], "siteID zsc: "$14,"date: " $4, "src: " $8, "dst: " $10, "role: "$22, "urlcategory: " $37, "urlsupercategory: "$39, "urlclass: $41;
}
break;
}
}
} else {
print $8 " is not in referentiel ";
}
}
suggest:
fix sudo awk -f typo.
a[$1]; --> a[$1] = 1;
($8 in a) --> (a[$8])

merge file on the basis of 2 fields

file1
session=1|w,eventbase=4,operation=1,rule=15
session=1|e,eventbase=5,operation=2,rule=14
session=2|t,eventbase=,operation=1,rule=13
file2
field1,field2,field3,session=1,fieldn,operation=1,fieldn
field1,field2,field3,session=1,fieldn,operation=2,fieldn
field1,field2,field3,session=2,fieldn,operation=2,fieldn
field1,field2,field3,session=2,fieldn,operation=1,fieldn
Output
field1,field2,field3,session=1,fieldn,operation=1,fieldn,eventbase=4,rule=15
field1,field2,field3,session=1,fieldn,operation=2,fieldn,eventbase=5,rule=14
field1,field2,field3,session=2,fieldn,operation=2,fieldn,NOMATCH
field1,field2,field3,session=2,fieldn,operation=1,fieldn,eventbase=,rule=13
I have Tried
BEGIN { FS = OFS = "," }
FNR == NR {
split($1,s,"|")
session=s[1];
a[session,$3] = session","$2","$3","$4;
next
}
{
split($4,x,"|");
nsession=x[1];
if(nsession in a)print $0 a[nsession,$6];
else print $0",NOMATCH";
}
Issue is I am not able to FIND nsession in 2D array a with if(nsession in a)
matching 2 files on the combination basis of session and operation
Thanks.. it helped.. Now I am learning :) Thanks team
BEGIN { FS = OFS = "," }
FNR == NR {
split($1,s,"|")
session=s[1];
a[session,$3] = session","$2","$3","$4;
next
}
{
split($4,x,"|");
nsession=x[1];
key=nsession SUBSEP $6
if(key in a)print $0 a[nsession,$6];
else print $0",NOMATCH";
}
You can try
awk -f merge.awk file1 file2
where merge.awk is
NR==FNR {
sub(/[[:blank:]]*$/,"")
getSessionInfo(1)
ar[ses,op]=",eventbase="evb",rule="rule
next
}
{
sub(/[[:blank:]]*$/,"")
getSessionInfo(0)
if ((ses,op) in ar)
print $0 ar[ses,op]
else
print $0 ",NOMATCH"
}
function getSessionInfo(f, a) {
match($0,/session=([^|])[|,]/,a)
ses=a[1]
match($0,/operation=([^,]),/,a)
op=a[1]
if (f) {
match($0,/eventbase=([^,]),/,a)
evb=a[1]
match($0,/rule=(.*)$/,a)
rule=a[1]
}
}

awk: Print string in file2 not matching with string file2

I have two file with comma separated values, I want to remove all the strings in file1 matching with strings in file 2.
file1:
soap,cosmetics,june,hello,good
file2:
june,hello
output:
soap,cosmetics,good
I tried this, but not working. I'm not sure where I'm going wrong. Any help appreciated.
BEGIN {
FS=","
}
NR==FNR {
a[NR]=$0
next
}
{
for (j=1;j<=NF;j++) {
split($0, d, ",")
if (d[j] in a == 0) {
line = (line ? line "," : "") d[j]
}
}
print line
line = ""
}
Here's one way using awk. Run like:
awk -f script.awk file2 file1
Contents of script.awk:
BEGIN {
FS=","
}
FNR==NR {
for(i=1;i<=NF;i++) {
a[$i]
}
next
}
{
for(j=1;j<=NF;j++) {
if (!($j in a)) {
r = (r ? r FS : "") $j
}
}
}
END {
print r
}
Results:
soap,cosmetics,good
Alternatively, here's the one-liner:
awk -F, 'FNR==NR { for(i=1;i<=NF;i++) a[$i]; next } { for(j=1;j<=NF;j++) if (!($j in a)) r = (r ? r FS : "") $j } END { print r }' file2 file1
$ gawk -v RS='[,\n]' 'NR==FNR{a[$0];next} !($0 in a){o=o s $0;s=","} END{print o}' file2 file1
soap,cosmetics,good

awk nesting curling brackets

I have the following awk script where I seem to need to next curly brackets. But this is not allowed in awk. How can I fix this issue in my script here?
The problem is in the if(inqueued == 1).
BEGIN {
print "Log File Analysis Sequencing for " + FILENAME;
inqueued=0;
connidtext="";
thisdntext="";
}
/message EventQueued/ {
inqueued=1;
print $0;
}
if(inqueued == 1) {
/AttributeConnID/ { connidtext = $0; }
/AttributeThisDN / { thisdntext = $2; } #space removes DNRole
}
#if first chars are a timetamp we know we are out of queued text
/\#?[0-9]+:[0-9}+:[0-9]+/
{
if(thisdntext != 0) {
print connidtext;
print thisdntext;
}
inqueued = 0; connidtext=""; thisdntext="";
}
try to change
if(inqueued == 1) {
/AttributeConnID/ { connidtext = $0; }
/AttributeThisDN / { thisdntext = $2; } #space removes DNRole
}
to
inqueued == 1 {
if($0~ /AttributeConnID/) { connidtext = $0; }
if($0~/AttributeThisDN /) { thisdntext = $2; } #space removes DNRole
}
or
inqueued == 1 && /AttributeConnID/{connidtext = $0;}
inqueued == 1 && /AttributeThisDN /{ thisdntext = $2; } #space removes DNRole
awk is made up of <condition> { <action> } segments. Within an <action> you can specify conditions just like you do in C with if or while constructs. You have a few other problems too, just re-write your script as:
BEGIN {
print "Log File Analysis Sequencing for", FILENAME
}
/message EventQueued/ {
inqueued=1
print
}
inqueued == 1 {
if (/AttributeConnID/) { connidtext = $0 }
if (/AttributeThisDN/) { thisdntext = $2 } #space removes DNRole
}
#if first chars are a timetamp we know we are out of queued text
/\#?[0-9]+:[0-9}+:[0-9]+/ {
if (thisdntext != 0) {
print connidtext
print thisdntext
}
inqueued=connidtext=thisdntext=""
}
I don't know if that'll do what you want or not, but it's syntactically correct at least.

awk '/range start/,/range end/' within script

How do I use the awk range pattern '/begin regex/,/end regex/' within a self-contained awk script?
To clarify, given program csv.awk:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/TREE/,/^$/
{
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
and file chunk:
TREE
10362900,A,INSTL - SEAL,Revise
,10362901,A,ASSY / DETAIL - PANEL,Revise
,,-203,ASSY - PANEL,Qty -,Add
,,,-309,PANEL,Qty 1,Add
,,,,"FABRICATE FROM TEKLAM NE1G1-02-250 PER TPS-CN-500, TYPE A"
,,,-311,PANEL,Qty 1,Add
,,,,"FABRICATE FROM TEKLAM NE1G1-02-750 PER TPS-CN-500, TYPE A"
,,,-313,FOAM SEAL,1.00 X 20.21 X .50 THK,Qty 1,Add
,,,,"BMS1-68, GRADE B, FORM II, COLOR BAC706 (BLACK)"
,,,-315,FOAM SEAL,1.50 X 8.00 X .25 THK,Qty 1,Add
,,,,"BMS1-68, GRADE B, FORM II, COLOR BAC706 (BLACK)"
,PN HERE,Dual Lock,Add
,
10442900,IR,INSTL - SEAL,Update (not released)
,10362901,A,ASSY / DETAIL - PANEL,Revise
,PN HERE,Dual Lock,Add
I want to have this output:
27 FOAM SEAL
29 FOAM SEAL
What is the syntax for adding the command line form '/begin regex/,/end regex/' to the script to operate on those lines only? All my attempts lead to syntax errors and googling only gives me the cli form.
why not use 2 steps:
% awk '/start/,/end/' < input.csv | awk csv.awk
Simply do:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/from/,/to/ {
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
If the from to regexes are dynamic:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
FROM=ARGV[1]
TO=ARGV[2]
if (ARGC == 4) { # the pattern was the only thing, so force read from standard input
ARGV[1] = "-"
} else {
ARGV[1] = ARGV[3]
}
}
{ if ($0 ~ FROM) { p = 1 ; l = 0} }
{ if ($0 ~ TO) { p = 0 ; l = 1} }
{
if (p == 1 || l == 1) {
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
l = 0 }
}
Now you have to call it like: ./scriptname.awk "FROM_REGEX" "TO_REGEX" INPUTFILE. The last param is optional, if missing STDIN can be used.
HTH
You need to show us what you have tried. Is there something about /begin regex/ or /end regex/ you're not telling us, other wise your script with the additions should work, i.e.
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/begin regex/,/end regex/{
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
OR are you using an old Unix, where there is old awk as /usr/bin/awk and New awk as /usr/bin/nawk. Also see if you have /usr/xpg4/bin/awk or gawk (path could be anything).
Finally, show us the error messages you are getting.
I hope this helps.