AWK complains about number of fields when extracting variables - awk

I have a script to parse a TeamCity directory map file. The script works, but I want to know why refactoring it into using variables breaks it with a seemingly unrelated error message and how I can still have it work using variables.
MAP=/opt/TeamCity/buildAgent/work/directory.map
sed -n -e '1,3d;1,/#/{/#/!p}' $MAP | \
awk ' {
n=split($0, array, "->");
printf(substr(array[1], 6) substr(array[2],2,16) "\n");
}
'
This prints
nicecorp::Master 652293808ace4eb5
nicecorp::Reset Database 652293808ace4eb5
nicecorp::test-single-steps 652293808ace4eb5
nicecorp::Develop 652293808ace4eb5
nicecorp::Pull Requests 652293808ace4eb5
Which is pretty much what I want.
The refactoring that breaks
But then I was trying to extract the sub strings into variables, and the script broke. I changed the last printf statement into this
proj=substr(array[1], 6);
tcdir=substr(array[2],2,16);
printf($proj" " $tcdir);
That just prints this error, although I thought it was more or less the same?
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="-" FNR=1 NR=1
This error seems a bit weird, given that my total input is about 500 bytes, 60 times less than the limit they complain about with regards to fields.
AWK version: mawk (1994)
Data format ($ head -10 directory.map):
#Don't edit this file!
#Nov 5, 2019 1:49:26 PM UTC
--version=2
bt30=nicecorp::Master -> 652293808ace4eb5 |?| Oct 29, 2019 4:14:27 PM UTC |:| default
bt32=nicecorp::Reset Database -> 652293808ace4eb5 |?| Oct 30, 2019 1:01:48 PM UTC |:| default
bt33=nicecorp::test-single-steps -> b96874cc9acaf874 |?| Nov 4, 2019 4:20:13 PM UTC |:| default
bt33=nicecorp::test-single-steps -> 652293808ace4eb5 |?| Nov 5, 2019 9:00:37 AM UTC |:| default
bt28=nicecorp::Develop -> 652293808ace4eb5 |?| Nov 5, 2019 1:07:53 PM UTC |:| default
bt29=nicecorp::Pull Requests -> 652293808ace4eb5 |?| Nov 5, 2019 1:18:08 PM UTC |:| default
#

The source of the problem is that the print statement in the refactor is using shell notation for variable ($proj instead of proj, $tcdir instead of tcdir).
When those values are numeric (e.g., tcdir=652293808ace4eb5 for the first line), awk (mawk in this case) will try to print 652293808-th column. Current version of gawk will not fail here - they will realize there are only few columns, and will show empty string for those field (or the full line for $0, if the value is non numeric)
Older version may attempt to extend the field list array to match the requested number, resulting in limit exceeded message.
Also note two minor issues - refactored code uses proj as format - it will get confused if '%' is included. Also, missing newlines. Did you really mean printf and not print ?
Fix:
proj=substr(array[1], 6);
tcdir=substr(array[2],2,16);
# Should consider print, instead of printf
printf(proj " " tcdir "\n");
# print proj, tcdir

The problem was syntax. I was using the shell style $tcdir to insert the value of the variable instead of simply tcdir. By (some unknown to me) means the tcdir portion of $tcdir is resolved to some numeric field value, meaning I am trying to print the value of a field, not the variable tcdir.

Related

Using an awk while-loop in Conky

I'm struggling with a while() loop in a Conky script.
Here's what I want to do :
I'm tunneling a command output to awk, extracting and formatting data.
The problem is : the output could contain 1 to n sections, and I want to get values from each one of them.
Here's the output sent to awk :
1) -----------
name: wu_1664392603_228876_0
WU name: wu_1664392603_228876
project URL: https://boinc.loda-lang.org/loda/
received: Thu Oct 6 15:31:40 2022
report deadline: Thu Oct 13 15:31:40 2022
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 220917
resources: 1 CPU
estimated CPU time remaining: 1379.480287
elapsed task time: 5858.009798
slot: 1
PID: 2221366
CPU time at last checkpoint: 5690.500000
current CPU time: 5712.920000
fraction done: 0.809000
swap size: 1051 MB
working set size: 973 MB
2) -----------
name: wu_1664392603_228908_0
WU name: wu_1664392603_228908
project URL: https://boinc.loda-lang.org/loda/
received: Thu Oct 6 15:31:53 2022
report deadline: Thu Oct 13 15:31:53 2022
ready to report: no
state: downloaded
scheduler state: scheduled
active_task_state: EXECUTING
app version num: 220917
resources: 1 CPU
estimated CPU time remaining: 1393.925106
elapsed task time: 5849.961764
slot: 7
PID: 2221367
CPU time at last checkpoint: 5654.640000
current CPU time: 5682.160000
fraction done: 0.807000
swap size: 802 MB
working set size: 728 MB
...
And here's the final output I want :
boinc.loda wu_1664392603_2288 80.9 07/10 01h37
boinc.loda wu_1664392603_2289 80.7 07/10 02h38
I managed to get the data I want ("WU name", "project URL", "estimated CPU time remaining" AND "fraction done") from one particuliar section using this code :
${execi 60 boinccmd --get_tasks | awk -F': |://|/' '\
/URL/ && ++i==1 {u=$3}\
/WU/ && ++j==1 {w=$2}\
/fraction/ && ++k==1 {p=$2}\
/estimated/ && ++l==1 {e=strftime("%d/%m %Hh%M",$2+systime())}\
END {printf "%.10s %.18s %3.1f %s", u, w, p*100, e}\
'}
This is quite inelegant, as I must repeat this code nth times, increasing i,j,k,l values to get the whole dataset (n is related to CPU threads, my PC has 8 threads, so I repeat the code 8 times).
I'd like the script to adapt to other CPUs, where n could be anything from 1 to ...
The obvious solution is to use a while() loop, parsing the whole dataset.
But nesting a conditional loop into an awk sequence calling an external command seems too tricky for me, and Conky scripts aren't really easy to debug, as Conky may hang without any error output or log if the script's syntax is bad.
Any help will be appreciated :)
Assumptions:
the sample input shows 2 values for estimated that are ~14.5 seconds apart (1379.480287 and 1393.925106) but the expected output is showing the estimated values as being ~61 mins apart (07/10 01h37 and 07/10 02h38); for now I'm going to assume this is due to OP's repeated runs of execi returning widely varying values for the estimated lines
each section of execi output always contains 4 matching lines (URL, WU, fraction, estimated) and these 4 strings only occur once within a section of execi output
I don't have execi installed on my system so to emulate OP's exceci I've cut-n-pasted OP's sample execi results into a local file named execi.dat.
Tweaking OP's current awk script that also allows us to eliminate the need for a bash loop (that repeatedly calls execi | awk):
cat execi.dat | awk -F': |://|/' '
FNR==NR { st=systime() }
/URL/ { found++; u=$3 }
/WU/ { found++; w=$2 }
/fraction/ { found++; p=$2 }
/estimated/ { found++; e=strftime("%d/%m %Hh%M",$2+st) }
found==4 { printf "%.10s %.18s %3.1f %s\n", u, w, p*100, e; found=0 }
'
This generates:
boinc.loda wu_1664392603_2288 80.9 06/10 17h47
boinc.loda wu_1664392603_2289 80.7 06/10 17h47
NOTE: the last value appears to be duplicated but that's due to the sample estimated values only differing by ~14.5 seconds

Using awk to replace and add text

I have the following .txt file:
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
I would like to make two edits to this text. First, in the line:
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
I would like to replace "Number=." with "Number=G"
And immediately after the after the line:
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
I would like to add a new line of text (& and line break):
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
I was wondering if this could be done with one or two awk commands.
Thanks for any suggestions!
My solution is similar to #Daweo. Consider this script, replace.awk:
/^##FORMAT/ { sub(/Number=\./, "Number=G") }
/##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">/ {
print
print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\"Quality score\">"
next
}
1
Run it:
awk -f replace.awk file.txt
Notes
The first line is easy to understand. It is a straight replace
The next group of lines deals with your second requirements. First, the print statement prints out the current line
The next print statement prints out your data
The next command skips to the next line
Finally, the pattern 1 tells awk to print every lines
I would GNU AWK following way, let file.txt content be
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
then
awk '/##FORMAT=<ID=PL/{gsub("Number=\\.","Number=G")}/##INFO=<ID=AF/{print;print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22>";next}{print}' file.txt
output
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
Explanation: If current line contains ##FORMAT=<ID=PL change Number=\\. to Number=G (note \ are required to get literal . rather than . meaning any character). If current line contains ##INFO=<ID=AF print it and then print ##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22> (\x22 is hex escape code for ", " could not be used inside " delimited string) and go to next line. Final print-ing is for all lines but those containing ##INFO=<ID=AF as these have own print-ing.
(tested in gawk 4.2.1)

Appending the datetime to the end of every line in a 600 million row file

I have a 680 million rows (19gig) file that I need the datetime appended onto every line. I get this file every night and I have to add the time that I processed it to the end of each line. I have tried many ways to do this including sed/awk and loading it into a SQL database with the last column being defaulted to the current timestamp.
I was wondering if there is a fast way to do this? My fastest way so far takes two hours and that is just not fast enough given the urgency of the information in this file. It is a flat CSV file.
edit1:
Here's what I've done so far:
awk -v date="$(date +"%Y-%m-%d %r")" '{ print $0","date}' lrn.ae.txt > testoutput.txt
Time = 117 minutes
perl -ne 'chomp; printf "%s.pdf\n", $_' EXPORT.txt > testoutput.txt
Time = 135 minutes
mysql load data local infile '/tmp/input.txt' into table testoutput
Time = 211 minutes
You don't specify if the timestamps have to be different for each of the lines. Would a "start of processing" time be enough?
If so, a simple solution is to use the paste command, with a pre-generated file of timestamps, exactly the same length as the file you're processing. Then just paste the whole thing together. Also, if the whole process is I/O bound, as others are speculating, then maybe running this on a box with an SSD drive would help speed up the process.
I just tried it locally on a 6 million row file (roughly 1% of yours), and it's actually able to do it in less than one second, on Macbook Pro, with an SSD drive.
~> date; time paste file1.txt timestamps.txt > final.txt; date
Mon Jun 5 10:57:49 MDT 2017
real 0m0.944s
user 0m0.680s
sys 0m0.222s
Mon Jun 5 10:57:49 MDT 2017
I'm going to now try a ~500 million row file, and see how that fares.
Updated:
Ok, the results are in. Paste is blazing fast compared to your solution, it took just over 90 seconds total to process the whole thing, 600M rows of simple data.
~> wc -l huge.txt
600000000 huge.txt
~> wc -l hugetimestamps.txt
600000000 hugetimestamps.txt
~> date; time paste huge.txt hugetimestamps.txt > final.txt; date
Mon Jun 5 11:09:11 MDT 2017
real 1m35.652s
user 1m8.352s
sys 0m22.643s
Mon Jun 5 11:10:47 MDT 2017
You still need to prepare the timestamps file ahead of time, but that's a trivial bash loop. I created mine in less than one minute.
A solution that simplifies mjuarez' helpful approach:
yes "$(date +"%Y-%m-%d %r")" | paste -d',' file - | head -n "$(wc -l < file)" > out-file
Note that, as with the approach in the linked answer, you must know the number of input lines in advance - here I'm using wc -l to count them, but if the number is fixed, simply use that fixed number.
yes keeps repeating its argument indefinitely, each on its own output line, until it is terminated.
paste -d',' file - pastes a corresponding pair of lines from file and stdin (-) on a single output line, separated with ,
Since yes produces "endless" output, head -n "$(wc -l < file)" ensures that processing stops once all input lines have been processed.
The use of a pipeline acts as a memory throttle, so running out of memory shouldn't be a concern.
Another alternative to test is
$ date +"%Y-%m-%d %r" > timestamp
$ join -t, -j9999 file timestamp | cut -d, -f2-
or time stamp can be generated in place as well <(date +"%Y-%m-%d %r")
join creates a cross product of the first file and second file using the non-existing field (9999), and since second file is only one line, practically appending it to the first file. Need the cut to get rid of the empty key field generated by join
If you want to add the same (current) datetime to each row in the file, you might as well leave the file as it is, and put the datetime in the filename instead. Depending on the use later, the software that processes the file could then first get the datetime from the filename.
To put the same datetime at the end of each row, some simple code could be written:
Make a string containing a separator and the datetime.
Read the lines from the file, append the above string and write back to a new file.
This way a conversion from datetime to string is only done once, and converting the file should not take much longer than copying the file on disk.

Using AWK to Retrieve an Error in a Cryptic Log File

I need to grab an occurrence of error for the current time, ignoring early occurrences. The problem is the date is few lines above (not on the same line as error code). How do I return the information from
***begin ibmdb error message***
which has the date & time to equate that to current time, and include all of this error log data:
*** begin ibmdb error message ***
Sun Dec 18 21:50:57 2016 - program 'execjob', User 'OSID:root', RMId 'root' Driver Version '9.0.1.14.865 2015-01-20 04:00:00'
DELETEDBREC() error on file 'USERRPT' in 'GEN'
DeleteSqlRec(lawson."USERRPT", 1)
DB2 FATAL ERROR for SQLExecute - Code: 40001/-911
[IBM][CLI Driver][DB2/AIX64] SQL0911N The current transaction has been rolled
back because of a deadlock or timeout. Reason code "68". SQLSTATE=40001
awk 'BEGIN{FS="begin ibmdb error message"} captures the beginning - how do I encapsulate the ending with - Reason code "68"
FS tells awk that the fields on your line will be separated by 'begin ibmdb error message'
You probably want to do something like
awk '/begin ibmdb error message/,/Reason code "68"/'
Something like this? I have started from the time point just for testing, instead of begin ibmdb error message since i thought might be more sections starting with the same text.
$ awk '/21:50/,/Reason code "68"/' file11
Sun Dec 18 21:50:57 2016 - program 'execjob', User 'OSID:root', RMId 'root' Driver Version '9.0.1.14.865 2015-01-20 04:00:00'
DELETEDBREC() error on file 'USERRPT' in 'GEN'
DeleteSqlRec(lawson."USERRPT", 1)
DB2 FATAL ERROR for SQLExecute - Code: 40001/-911
[IBM][CLI Driver][DB2/AIX64] SQL0911N The current transaction has been rolled
back because of a deadlock or timeout. Reason code "68". SQLSTATE=40001
Tip: You can see about capabilities in awk about pattern matching here: https://www.gnu.org/software/gawk/manual/html_node/Expression-Patterns.html

Processing apache logs quickly

I'm currently running an awk script to process a large (8.1GB) access-log file, and it's taking forever to finish. In 20 minutes, it wrote 14MB of the (1000 +- 500)MB I expect it to write, and I wonder if I can process it much faster somehow.
Here is the awk script:
#!/bin/bash
awk '{t=$4" "$5; gsub("[\[\]\/]"," ",t); sub(":"," ",t);printf("%s,",$1);system("date -d \""t"\" +%s");}' $1
EDIT:
For non-awkers, the script reads each line, gets the date information, modifies it to a format the utility date recognizes and calls it to represent the date as the number of seconds since 1970, finally returning it as a line of a .csv file, along with the IP.
Example input: 189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
Returned output: 189.5.56.113,124237889
#OP, your script is slow mainly due to the excessive call of system date command for every line in the file, and its a big file as well (in the GB). If you have gawk, use its internal mktime() command to do the date to epoch seconds conversion
awk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
for(o=1;o<=m;o++){
date[d[o]]=sprintf("%02d",o)
}
}
{
gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5)
n=split($4, DATE,"/")
day=DATE[1]
mth=DATE[2]
year=DATE[3]
hr=DATE[4]
min=DATE[5]
sec=DATE[6]
MKTIME= mktime(year" "date[mth]" "day" "hr" "min" "sec)
print $1,MKTIME
}' file
output
$ more file
189.5.56.113 - - [22/Jan/2010:05:54:55 +0100] "GET (...)"
$ ./shell.sh
189.5.56.113 1264110895
If you really really need it to be faster, you can do what I did. I rewrote an Apache log file analyzer using Ragel. Ragel allows you to mix regular expressions with C code. The regular expressions get transformed into very efficient C code and then compiled. Unfortunately, this requires that you are very comfortable writing code in C. I no longer have this analyzer. It processed 1 GB of Apache access logs in 1 or 2 seconds.
You may have limited success removing unnecessary printfs from your awk statement and replacing them with something simpler.
If you are using gawk, you can massage your date and time into a format that mktime (a gawk function) understands. It will give you the same timestamp you're using now and save you the overhead of repeated system() calls.
This little Python script handles a ~400MB worth of copies of your example line in about 3 minutes on my machine producing ~200MB of output (keep in mind your sample line was quite short, so that's a handicap):
import time
src = open('x.log', 'r')
dest = open('x.csv', 'w')
for line in src:
ip = line[:line.index(' ')]
date = line[line.index('[') + 1:line.index(']') - 6]
t = time.mktime(time.strptime(date, '%d/%b/%Y:%X'))
dest.write(ip)
dest.write(',')
dest.write(str(int(t)))
dest.write('\n')
src.close()
dest.close()
A minor problem is that it doesn't handle timezones (strptime() problem), but you could either hardcode that or add a little extra to take care of it.
But to be honest, something as simple as that should be just as easy to rewrite in C.
gawk '{
dt=substr($4,2,11);
gsub(/\//," ",dt);
"date -d \""dt"\" +%s"|getline ts;
print $1, ts
}' yourfile