I was reading an article showing how to migrate markdown files from Pelican to Hugo. I'm trying to understand what the awk script is doing. :
# begin block, executed once,
# to set field separator, output fied separator & print 3 dashes
BEGIN { FS = ":"; OFS = ":"; print "---" }
# ???
!c && /^$/ { print "---\n"; c = 1 }
# user defined function?
c { print; next }
# user defined function?
!c {
# lower first field
$1 = tolower($1)
# if first field is "date"
if ($1 == "date") {
# transform second field
$2 = gensub(/ ([^.]+)\.([^.]+).([^.]+)/, " \\3-\\2-\\1", 1, $2)
$2 = gensub(/-([0-9])-/, "-0\\1-", 1, $2)
}
if ($1 == "tags")
$2 = " [" gensub(/[-a-z]+/, "'\\0'", "g", substr($2, 2)) "]"
print
}
I don't really understand, what are c and !c are they user defined functions? Without the function keyword and without parameters? What is exactly the meaning of c=1?
c is a variable. c=1 sets the value of c to 1
c is a test of variable c and its true, other than 0
!c is a test of variable c and its true if c is not set or 0
c { print; next } If c is set to some other than nothing or 0, then print (will print the whole line since nothing other is specified). next stop what you are doing and skip to next line and start over.
Related
I update the question with additional information
I have a .fastq file formatted in the following way
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
For each sequence the format is the same (repetition of 4 lines)
What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
Starting from a file like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
I would like to obtain an output like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
wrt the code you posted:
substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)
Update - per your comment below that pattern is actually a regexp rather than a string:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
I warn you, I wanted to have some fun and it is twisted.
awk -v pattern=pattern -v window=15 '
BEGIN{RS="#";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
Input :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Output :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Second block is not updated because window is 15 and it cannot find the pattern within this window.
I used variable RS to deal with entire 4 lines block with $0, $1, $2, $3 and $4. Because input file starts with RS and does not end with RS, I prefered to not set ORS and use printf instead of print.
I want to awk a list of data-files. All records - there is an unknown number of records before - before , e.g.,
/10-12-2014 06:47:59/{p=1}
are to be skipped.
A brief template of one data file looks like this:
data_file_001
0; n records to be skipped
1;10-12-2014 06:47:59;
2;12-12-2014 10:17:44;
3;12-12-2014 10:37:44;
4;14-12-2014 10:00:32;
5;;movefield
6;16-12-2014 04:15:39;
needed Output ($2 datefield reformatted and $3 moved to $4):
colnum;date;col3;col4;col5
2;12.12.14;;
3;12.12.14;;
4;14.12.14;;
5;;;movefield;moved
6;16.12.14;;
My source file is this at the moment:
BEGIN { OFS=FS=";" ; print "colnum;date;col3;col4;col5"}
FNR == 1 { p=0 }
$3 == "movefield" { $4 = $3; $5 = "moved"; $3 = ""}
#(x=index($2," ") > 0) {DDMMYY = substr($2,1,x-1)}
$2=substr($2,1,11)
p!=0{print};
/10-12-2014 06:47:59/{p=1}
I have problems to reformat the data fields: The pattern-action (x=index($2," ") > 0) {DDMMYY = substr($2,1,x-1)} does not work nor $2=substr($2,1,11) in conjunction with the movefield action. Notice that the record where the movefield field appears has no date field.
Please have in mind that the awk is meant to be used on a bunch of files (loop).
With GNU awk for implace editing, no loop required:
awk -i inplace '
BEGIN { OFS=FS=";" ; print "colnum","date","col3","col4","col5" }
FNR==1 { next }
$3 == "movefield" { $4 = $3; $5 = "moved"; $3 = ""; print; next }
{ sub(/ .*/,"",$2); gsub(/-/,".",$2); print $0, ""}
' file*
Another in GNU awk:
$ awk '
function refmt(str) { # reformat date for comparing
split(str,d,"[ :-]")
return mktime(d[3] " " d[2] " " d[1] " " d[4] " " d[5] " " d[6])
}
BEGIN {
FS=OFS=";"
start=refmt("10-12-2014 06:47:59") # reformat the threshold date
print "colnum","date","col3","col4" # print header (why 5?)
}
refmt($2)>start || $2=="" { # if date > start or empty
sub(/ .*/,"",$2) # delete time part
gsub(/-/,".",$2) # replace - by .
$4=$3; $3="" # or $3 = OFS $3
print # output
}' file
colnum;date;col3;col4
2;12.12.2014;;
3;12.12.2014;;
4;14.12.2014;;
5;;;movefield
6;16.12.2014;;
I have a dictionary dict with records separated by ":" and data fields by new lines, for example:
:one
1
:two
2
:three
3
:four
4
Now I want awk to substitute all occurrences of each record in the input
file, eg
onetwotwotwoone
two
threetwoone
four
My first awk script looked like this and works just fine:
BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}
giving me:
12221
2
321
4
Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?
Here's the current second part of the script:
{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}
All scripts are run by awk -f translate.awk dict input
Thanks in advance!
Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.
You won't want this:
gsub(regexp,backreference-enabled-string)
You want something more like this:
index(...,string) substr(string)
I think this is what you're trying to do:
$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
if ( NR%2 ) {
key = $2
}
else {
rep[key] = $0
}
next
}
{
for ( key in rep ) {
head = ""
tail = $0
while ( start = index(tail,key) ) {
head = head substr(tail,1,start-1) rep[key]
tail = substr(tail,start+length(key))
}
$0 = head tail
}
print
}
$ awk -f tst.awk dict file
12221
2
321
4
Never mind for asking....
Just some missing parentheses...?!
{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}
works like a charm.
I need to create a script that uses awk to sum the values of several columns that are output by a LSF command. I also need the script to skip the headers in the first line. This is what I have so far, will it work? I'm not sure that it will properly skip the first line and add the others. I would test it, but I do not have access to the LSF machines.
bhosts | awk '
BEGIN { running=suspended=reserved=0; }
NR < 2 { next }
(running = running + $6)
(a = a + $7)
(b = b + $8)
(suspended = a + b)
(reserved = reserved + $9)
END {
...
...
}'
exit
I can't test either. This would be better asked on http://codereview.stackexchange.com, but if you want to do some calculations on every line except the first one:
bhosts | awk '
NR >= 2 {
running += $6
a += $7
b += $8
suspended = a + b
reserved += $9
}
END {
...
}
'
Undeclared variables are automatically treated as zero in numeric context, so it's not strictly necessary to declare them.
This is the evolution of these two questions, here, and here.
For mine own learning, I'm trying to accomplish two (more) things with the code below:
Instead of invoking my script with # myscript -F "," file_to_process, how can I fold in the '-F ","' part into the script itself?
How can I initialize a variable, so that I only assign a value once (ignoring subsequent matches? You can see from the script that I parse seconds and micro seconds in each rule, I'd like to keep the first assignment of sec around so I could subtract it from subsequent matches in the printf() statement.
#!/usr/bin/awk -f
/DIAG:/ {
lbl = $3;
sec = $5;
usec = $6;
/Test-S/ {
stgt = $7;
s1 = $30;
s2 = $31;
}
/Test-A/ {
atgt = $7;
a = $8;
}
/Test-H/ {
htgt = $7;
h = $8;
}
/Test-C/ {
ctgt = $7;
c = $8;
}
}
/WARN:/ {
sec = $4;
usec = $5;
m1 = $2;
m2 = $3
}
{
printf("%16s,%17d.%06d,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%5d,%5d\n", lbl, sec, usec, stgt, s1, s2, atgt, a, htgt, h, ctgt, c, m1, m2)
}
Use a BEGIN clause:
BEGIN { FS = ","
var1 = "text"
var2 = 3
etc.
}
This is run before the line-processing statements. FS is the field separator.
If you want to parse a value and keep it, it depends on whether you want only the first one or you want each previous one.
To keep the first one:
FNR==1 {
keep=$1
}
To keep the previous one:
BEGIN {
prevone = "initial value"
}
/regex/ {
do stuff with $1
do stuff with prevone
prevone = $1
}