How do I dump the contents of SYMTAB in gawk? I've tried things like the following which displays scalars just fine. It also displays the array names and indices, but it doesn't display the value of each array element.
for (i in SYMTAB) {
if (isarray(SYMTAB[i])) {
for (j in SYMTAB[i]) {
printf "%s[%s] = %s\r\n", i, j, SYMTAB[i, j]
}
} else {
printf "%s = %s\r\n", i, SYMTAB[i]
}
}
which gives results like:
OFS =
ARGC = 1
PREC = 53
ARGIND = 0
ERRNO =
ARGV[0] =
For example, I would expect to see a value after ARGV[0] but I'm not.
Use SYMTAB[i][j] instead of SYMTAB[i,j] - you're using multi-dimensional array syntax in the loops to access the indices so just keep doing that.
Here's a recursive function to dump SYMTAB or any other array or scalar:
$ cat tst.awk
function dump(name,val, i) {
if ( isarray(val) ) {
printf "%*s%s %s%s", indent, "", name, "{", ORS
indent += 3
for (i in val) {
dump(i,val[i])
}
indent -= 3
printf "%*s%s %s%s", indent, "", name, "}", ORS
}
else {
printf "%*s%s = <%s>%s", indent, "", name, val, ORS
}
}
BEGIN {
dump("SYMTAB",SYMTAB)
}
.
$ awk -f tst.awk
SYMTAB {
ARGV {
0 = <awk>
ARGV }
ROUNDMODE = <N>
ORS = <
>
OFS = < >
LINT = <0>
FNR = <0>
ERRNO = <>
NR = <0>
IGNORECASE = <0>
TEXTDOMAIN = <messages>
NF = <0>
ARGIND = <0>
indent = <3>
ARGC = <1>
PROCINFO {
argv {
0 = <awk>
1 = <-f>
2 = <tst.awk>
argv }
group9 = <15>
ppid = <2212>
...
strftime = <%a %b %e %H:%M:%S %Z %Y>
group8 = <11>
PROCINFO }
FIELDWIDTHS = <>
CONVFMT = <%.6g>
SUBSEP = <>
PREC = <53>
ENVIRON {
SHLVL = <1>
ENV = <.env>
...
INFOPATH = </usr/local/info:/usr/share/info:/usr/info>
TEMP = </tmp>
ProgramData = <C:\ProgramData>
ENVIRON }
RS = <
>
FPAT = <[^[:space:]]+>
RT = <>
RLENGTH = <0>
OFMT = <%.6g>
FS = < >
RSTART = <0>
FILENAME = <>
BINMODE = <0>
SYMTAB }
Massage to suit...
Thank you Ed Morton. Looks like a recursive process would be required if I needed to support arbitrary levels of nested arrays, but for now this code dumps my gawk SYMTAB without errors:
for (i in SYMTAB) {
if (!isarray(SYMTAB[i])) {
printf "%s = %s\r\n", i, SYMTAB[i]
} else {
for (j in SYMTAB[i]) {
if (!isarray(SYMTAB[i][j])) {
printf "%s[%s] = %s\r\n", i, j, SYMTAB[i][j]
} else {
for (k in SYMTAB[i][j]) {
if (!isarray(SYMTAB[i][j][k])) {
printf "%s[%s][%s] = %s\r\n", i, j, k, SYMTAB[i][j][k]
} else {
printf "Skipping highly nested array.\r\n"
}
}
}
}
}
}
Thanks again!
I have these two csv files:
File A:
veículo;carro;sust
automóvel;carro;sust
viatura;carro;sust
breve;rápido;adj
excepcional;excelente;adj
maravilhoso;excelente;adj
amistoso;simpático;adj
amigável;simpático;adj
...
File B:
"A001","carro","sust","excelente","adj","ocorrer","adv","bom","adj"
...
In the file A, $1(word) is synonym for $2(word) and $3(word) the part of speech.
In the lines of the file B we can skip $1,the remaining columns are words and their part of speech.
What I need to to do is to look line by line each pair (word-pos) in the file A and generate a line for each synonym. It is difficult to explain.
Desired Output:
"A001","carro","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
Done:
BEGIN {
FS="[,;]";
OFS=";";
}
FNR==NR{
sinonim[$1","$2","$3]++;
next;
}
{
s1=split($0,AX,"\n");
for (i=1;i<=s1;i++)
{
s2=split(AX[i],BX,",");
for (j=2;j<=NF;j+=2)
{
lineX=BX[j]","BX[j+1];
gsub(/\"/,"",lineX);
for (item in sinonim)
{
s3=split(item,CX,",");
lineS=CX[2]","CX[3];
if (lineX == lineS)
{
BX[j]=CX[1];
lineD=""
for (t=1;t<=s2;t++)
{
lineD=lineD BX[t]",";
}
lineF=lineF lineD"\n";
}
}
}
}
print lineF
}
$ cat tst.awk
BEGIN { FS=";" }
NR==FNR { synonyms[$2,$3][$2]; synonyms[$2,$3][$1]; next }
FNR==1 { FS=OFS="\",\""; $0=$0 }
{
gsub(/^"|"$/,"")
for (i=2;i<NF;i+=2) {
if ( ($i,$(i+1)) in synonyms) {
for (synonym in synonyms[$i,$(i+1)]) {
$i = synonym
for (j=2;j<NF;j+=2) {
if ( ($j,$(j+1)) in synonyms) {
for (synonym in synonyms[$j,$(j+1)]) {
orig = $0
$j = synonym
if (!seen[$0]++) {
print "\"" $0 "\""
}
$0 = orig
}
}
}
}
}
}
}
.
$ awk -f tst.awk fileA fileB
"A001","carro","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excelente","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","carro","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","veículo","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","automóvel","sust","excepcional","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","maravilhoso","adj","ocorrer","adv","bom","adj"
"A001","viatura","sust","excepcional","adj","ocorrer","adv","bom","adj"
The above uses GNU awk for multi-dimensional arrays, with other awks it's a simple tweak to use synonyms[$2,$3] = synonyms[$2,$3] " " $2 etc. or similar and then split() later instead of synonyms[$2,$3][$2] and in.
BEGIN { FS="[,;]"; OFS="," }
NR == FNR { key = "\"" $2 "\""; synonym[key] = synonym[key] "," $1; next }
{
print;
if ($2 in synonym) {
count = split(substr(synonym[$2], 2), choices)
for (i = 1; i <= count; i++) {
$2 = "\"" choices[i] "\""
print
}
}
}
How do I use the awk range pattern '/begin regex/,/end regex/' within a self-contained awk script?
To clarify, given program csv.awk:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/TREE/,/^$/
{
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
and file chunk:
TREE
10362900,A,INSTL - SEAL,Revise
,10362901,A,ASSY / DETAIL - PANEL,Revise
,,-203,ASSY - PANEL,Qty -,Add
,,,-309,PANEL,Qty 1,Add
,,,,"FABRICATE FROM TEKLAM NE1G1-02-250 PER TPS-CN-500, TYPE A"
,,,-311,PANEL,Qty 1,Add
,,,,"FABRICATE FROM TEKLAM NE1G1-02-750 PER TPS-CN-500, TYPE A"
,,,-313,FOAM SEAL,1.00 X 20.21 X .50 THK,Qty 1,Add
,,,,"BMS1-68, GRADE B, FORM II, COLOR BAC706 (BLACK)"
,,,-315,FOAM SEAL,1.50 X 8.00 X .25 THK,Qty 1,Add
,,,,"BMS1-68, GRADE B, FORM II, COLOR BAC706 (BLACK)"
,PN HERE,Dual Lock,Add
,
10442900,IR,INSTL - SEAL,Update (not released)
,10362901,A,ASSY / DETAIL - PANEL,Revise
,PN HERE,Dual Lock,Add
I want to have this output:
27 FOAM SEAL
29 FOAM SEAL
What is the syntax for adding the command line form '/begin regex/,/end regex/' to the script to operate on those lines only? All my attempts lead to syntax errors and googling only gives me the cli form.
why not use 2 steps:
% awk '/start/,/end/' < input.csv | awk csv.awk
Simply do:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/from/,/to/ {
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
If the from to regexes are dynamic:
#!/usr/bin/awk -f
BEGIN {
FS = "\""
FROM=ARGV[1]
TO=ARGV[2]
if (ARGC == 4) { # the pattern was the only thing, so force read from standard input
ARGV[1] = "-"
} else {
ARGV[1] = ARGV[3]
}
}
{ if ($0 ~ FROM) { p = 1 ; l = 0} }
{ if ($0 ~ TO) { p = 0 ; l = 1} }
{
if (p == 1 || l == 1) {
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
l = 0 }
}
Now you have to call it like: ./scriptname.awk "FROM_REGEX" "TO_REGEX" INPUTFILE. The last param is optional, if missing STDIN can be used.
HTH
You need to show us what you have tried. Is there something about /begin regex/ or /end regex/ you're not telling us, other wise your script with the additions should work, i.e.
#!/usr/bin/awk -f
BEGIN {
FS = "\""
}
/begin regex/,/end regex/{
line="";
for (i=1; i<=NF; i++) {
if (i != 2) line=line $i;
}
split(line, v, ",");
if (v[5] ~ "FOAM") {
print NR, v[5];
}
}
OR are you using an old Unix, where there is old awk as /usr/bin/awk and New awk as /usr/bin/nawk. Also see if you have /usr/xpg4/bin/awk or gawk (path could be anything).
Finally, show us the error messages you are getting.
I hope this helps.
If I had a string with escaped commas like so:
a,b,{c\,d\,e},f,g
How might I use awk to parse that into the following items?
a
b
{c\,d\,e}
f
g
{
split($0, a, /,/)
j=1
for(i=1; i<=length(a); ++i) {
if(match(b[j], /\\$/)) {
b[j]=b[j] "," a[i]
} else {
b[++j] = a[i]
}
}
for(k=2; k<=length(b); ++k) {
print b[k]
}
}
Split into array a, using ',' as delimiter
Build array b from a, merging lines that end in '\'
Print array b (Note: Starts at 2 since first item is blank)
This solution presumes (for now) that ',' is the only character that is ever escaped with '\'--that is, there is no need to handle any \\ in the input, nor weird combinations such as \\\,\\,\\\\,,\,.
{
gsub("\\\\,", "!Q!")
n = split($0, a, ",")
for (i = 1; i <= n; ++i) {
gsub("!Q!", "\\,", a[i])
print a[i]
}
}
I don't think awk has any built-in support for something like this. Here's a solution that's not nearly as short as DigitalRoss's, but should have no danger of ever accidentally hitting your made-up string (!Q!). Since it tests with an if, you could also extend it to be careful about whether you actually have \\, at the end of your string, which should be an escaped slash, not comma.
BEGIN {
FS = ","
}
{
curfield=1
for (i=1; i<=NF; i++) {
if (substr($i,length($i)) == "\\") {
fields[curfield] = fields[curfield] substr($i,1,length($i)-1) FS
} else {
fields[curfield] = fields[curfield] $i
curfield++
}
}
nf = curfield - 1
for (i=1; i<=nf; i++) {
printf("%d: %s ",i,fields[i])
}
printf("\n")
}