Extracting multiple columns without loop - awk

I am writing an awk script that will take the output of grep and nicely format that into an HTML table. The delimiter is the ":" character; the problem I'm running into is that that character can appear in the text as well. So if I just use $1, $2, and $3 for the filename, line number, and comment respectively, I lose anything after the first : in the comment
Is there a way to say $1, $2, and then $3..NR without explicitly looping over the columns and concatenating them together?
Here's the script so far:
`
#!/usr/bin/awk
BEGIN {
FS=":"
print "<html><body>"
print "<table>"
print "<tr><td>File name</td><td>Line number</td><td>Comment</td></tr>"
}
{
print "<tr><td>" $1 "</td><td>" $2 "</td><td>" $3 "</td></tr>"
}
END {
print "</table>"
print "</body></html>"
}`
And some sample input:
./mysql-connector-java-5.0.8/src/com/mysql/jdbc/BlobFromLocator.java:177: // TODO: Make fetch size configurable
./mysql-connector-java-5.0.8/src/com/mysql/jdbc/CallableStatement.java:243: // TODO Auto-generated method stub
./mysql-connector-java-5.0.8/src/com/mysql/jdbc/CallableStatement.java:836: // TODO: Do this with less memory allocation

BEGIN { FS=":"; OFS=":" }
{ name=$1; number=$2; $1=""; $2=""; comment=substr($0,3); }

{ print gensub(/^[^:]*:[^:]*:/,"","g") }

Related

Truncation of strings after running awk script

I have this code
BEGIN { FS=OFS=";" }
{ key = $(NF-1) }
NR == FNR {
for (i=1; i<(NF-1); i++) {
if ( !seen[key,$i]++ ) {
map[key] = (key in map ? map[key] OFS : "") $i
}
}
next
}
{ print $0 map[key] }
I use code in this way
awk -f tst.awk 2.txt 1.txt
I have two text files
1.txt
AA;BB;
2.txt
CC;DD;BB;AA;
I try to generate this 3.txt output
AA;BB;CC;DD;
but with this script is not possible because this script return only AA;BB;
logic: The above just uses literal strings in a hash lookup of array indices so it doesn't care what characters you have in your input. However about sample output:if in 2.txt there are common fields also in 1.txt.for example BB;AA; then you need concatenate them in a single row, i.e AA;BB;CC;DD; Ordering is not required, for example is not relevant if output is BB;AA;DD;CC; Only condition that is required is avoid duplicates but my script already does this
Could you please try following, as per OP's comment both files have only 1 line. So using paste command to combine both the files and then processing its output by awk command.
paste -d';' 1.txt 2.txt |
awk '
BEGIN{
FS=OFS=";"
}
{
for(i=1;i<=NF;i++){
if(!seen[$i]++){ val=(val?val OFS:"")$i }
}
print val
delete seen
val=""
}'

awk problems remembering previous line

I have the following awk file;
BEGIN { FS=":" };
{if (NR%2==1) { host=$1 }};
{if (NR%2==0) { print $host ":" $0 }};
I want to do the following;
If the line # is odd, store the 1st field.
If the line # is even, print the previously stored field, a colon, plus the current line.
currently this outputs the even numbered line twice "evenline:evenline"
not sure what i'm doing wrong.
It should be:
BEGIN { FS=":" };
{if (NR%2==1) { host=$1 }};
{if (NR%2==0) { print host ":" $0 }};
$host -> host.
Why?
Field names in awk start with a dollar $ in front. You can access fields in awk statically like $1, $2 etc. or dynamically like $variable. Variables will get casted to integers when used for field names because field names in awk are numbers. The variable host contains a string which will get casted to 0. This makes awk print $0 twice.
Note that you can simplify this:
BEGIN { FS=":" }
NR%2==1 { host=$1 };
NR%2==0 { print host ":" $0 }

Sum up from line "A" to line "B" from a big file using awk

aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.

Convert rows into columns using awk

Not all columns (&data) are present for all records. Hence whenever fields missing are missing, they should be replaced with nulls.
My Input format:
.set 1000
EMP_NAME="Rob"
EMP_DES="Developer"
EMP_DEP="Sales"
EMP_DOJ="20-10-2010"
EMR_MGR="Jack"
.set 1001
EMP_NAME="Koster"
EMP_DEP="Promotions"
EMP_DOJ="20-10-2011"
.set 1002
EMP_NAME="Boua"
EMP_DES="TA"
EMR_MGR="James"
My desired output Format:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
I tried the below:
awk 'NR>1{printf "%s"(/^\.set/?RS:"~"),a} {a=substr($0,index($0,"=")+1)} END {print a}' $line
This is printing:
Rob~Developer~Sales~20-10-2010~Jack
Koster~Promotions~20-10-2011~
Boua~TA~James~
This awk script produces the desired output:
BEGIN { FS = "[=\"]+"; OFS = "~" }
/\.set/ { ++records; next }
NR > 1 { f[records,$1] = $2 }
END {
for (i = 1; i <= records; ++i) {
print f[i,"EMP_NAME"], f[i,"EMP_DES"], f[i,"EMP_DEP"], f[i,"EMP_DOJ"], f[i,"EMR_MGR"]
}
}
A two-dimensional array is used to store all of the values that are defined for each record.
After all the file has been processed, the loop goes through each row of the array and prints all of the values. The elements that are undefined will be evaluated as an empty string.
Specifying the elements explicity allows you to control the order in which they are printed. Using print rather than printf allows you to make correct use of the OFS variable which has been set to ~, as well as the ORS which is a newline character by default.
Thanks to #Ed for his helpful comments that pointed out some flaws in my original script.
Output:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
$ cat tst.awk
BEGIN{ FS="[=\"]+"; OFS="~" }
/\.set/ { ++numRecs; next }
{ name2val[numRecs,$1] = $2 }
!seen[$1]++ { names[++numNames] = $1 }
END {
for (recNr=1; recNr<=numRecs; recNr++)
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[recNr,names[nameNr]], (nameNr<numNames?OFS:ORS)
}
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
If you want some pre-defined order of fields in your output rather than creating it on the fly from the rows in each record as they're read, just populate the names[] array explicitly in the BEGIN section and if you have that situation AND don't want to save the whole file in memory:
$ cat tst.awk
BEGIN{
FS="[=\"]+"; OFS="~";
numNames=split("EMP_NAME EMP_DES EMP_DEP EMP_DOJ EMR_MGR",names,/ /)
}
function prtName2val( nameNr, i) {
if ( length(name2val) ) {
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[names[nameNr]], (nameNr<numNames?OFS:ORS)
delete name2val
}
}
/\.set/ { prtName2val(); next }
{ name2val[$1] = $2 }
END { prtName2val() }
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
The above uses GNU awk for length(name2val) and delete name2val, if you don't have that then use for (i in name2val) { do stuff; break } and split("",name2val) instead..
This is all I can suggest:
awk '{ t = $0; sub(/^[^"]*"/, "", t); gsub(/"[^"]*"/, "~", t); sub(/".*/, "", t); print t }' file
Or sed:
sed -re 's|^[^"]*"||; s|"[^"]*"|~|g; s|".*||' file
Output:
Rob~Developer~Sales~20-10-2010~Jack~Koster~Promotions~20-10-2011~Boua~TA~James

Awk syntax error pattern matching

I want to match every empty line, and when it is an empty line, i want to go to the next line.
The problem is when i type
/^$/ { next }
it always gives me a syntaxt error. It always refers to the first '{'. I thought this was the correct syntax. Can anyone help me please?
My script:
BEGIN{
FS=" "
matching=0
num=0
}
{
/^$/ { next }
if(matching==1 && num>=NF){
print("")
matching=0
}
if(match($NF,type)>0){
for(i=1;i<=NF;i++){
printf($i)
}
printf("\n")
num=NF
matching=1
next
}
if(matching==1){
for(i=1;i<=NF-num;i++){
printf(" ")
}
printf($NF)
printf("\n")
}
}
END{
}
This is my script
You are trying to use the pattern {action} syntax from within an {action} block there.
You need to move your /^$/ {next} line outside of the action block you are starting on line 6 (by moving it above that opening { to get what you want. Or use the if style matching in the action block.
Your script:
BEGIN{
FS=" "
matching=0
num=0
}
{ # <---- This starts an action block.
/^$/ { next } # <----- This is a pattern/action pair.
if(matching==1 && num>=NF){
print("")
matching=0
}
if(match($NF,type)>0){
for(i=1;i<=NF;i++){
printf($i)
}
printf("\n")
num=NF
matching=1
next
}
if(matching==1){
for(i=1;i<=NF-num;i++){
printf(" ")
}
printf($NF)
printf("\n")
}
}
END {
}
Your script has several issues, commented below:
BEGIN{
FS=" "
matching=0 # No need to init variables to zero, this is default behavior.
num=0 # Ditto.
}
{
/^$/ { next } # You can't just use "condition { action }" when you're already
# inside an awk action block. Move this outside of the action
# block or change it to "if (/^$/) { next }"
if(matching==1 && num>=NF){
print("") # print is a builtin not a function. Just do print "".
matching=0
}
if(match($NF,type)>0){
for(i=1;i<=NF;i++){
printf($i) # printf is a builtin, not a function and NEVER put input data
# where the printf formatting string should be. Change this to
# printf "%s", $i
}
printf("\n") # print ""
num=NF
matching=1
next
}
if(matching==1){
for(i=1;i<=NF-num;i++){
printf(" ") # printf " "
}
printf($NF) # printf "%s", $NF
printf("\n") # print ""
}
}
END{ # unused and unnecessary, remove this section.
}
I suspect if you posted some sample input and expected output we could help you write a better (more concise and more idomatic) script.