GAWK Script using special characters - gawk

I am having an issue using special characters. I am parsing a text file separated by tabs. I want to have the program add a "*" to the first word in the line if a certain parameter is true.
if ($Var < $3) $1 = \*$1
Now every time I run it I get the error that it is not the end of the line.

2 things, but without more context to test with we really can't help you much.
$Var will only have meaning if you have set it above like Var=3. Then I don't think gawk will evaluate your $3 to the value of $3. The other side of that expression < $3 WILL expand to the value of the 3rd field. If you're getting $Var from the shell environment, you need to let the gawk script 'see' that value, i.e.
awk '{ ..... if ('"$Var"' < $3) $1= "*" $1 .....}
If you want the string literal '*' pre-pended, you're better off doing $1 = "*" $1
Without sample inputs, sample expected output, actual output and error messages, we'll be playing 20 questions here. If these comments don't solve your problem, please edit your question above to include these items.
P.S. Welcome to StackOverflow and let me remind you of three things we usually do here: 1) As you receive help, try to give it too, answering questions in your area of expertise 2) Read the FAQs, http://tinyurl.com/2vycnvr , 3) When you see good Q&A, vote them up by using the gray triangles, http://i.imgur.com/kygEP.png , as the credibility of the system is based on the reputation that users gain by sharing their knowledge. Also remember to accept the answer that better solves your problem, if any, by pressing the checkmark sign , http://i.imgur.com/uqJeW.png

Related

Return lines with at least n consecutive occurrences of the pattern in bash [duplicate]

This question already has answers here:
Specify the number of pattern repeats in JavaScript Regex
(2 answers)
Closed 1 year ago.
Might be naive question, but I can't find an answer.
Given a text file, I'd like to find lines with at least (defined number) of occurrences of a certain pattern, say, AT[GA]CT.
For example, in n=2, from the file:
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
Only the second line should be returned.
I know how to use grep/awk to search for at least one instance of this degenerate pattern, and for some defined number of pattern instances occurring non-consecutively. But the issue is the pattern occurrences MUST be consecutive, and I can't figure out how to achieve that.
Any help appreciated, thank you very much in advance!
I would use GNU AWK for this task following way, let file.txt content be
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
then
awk 'BEGIN{p="AT[GA]CT";n=2;for(i=1;i<=2;i+=1){pat=pat p}}$0~pat' file.txt
output
TAGATGCTATACTTGA
Explanation: I use for loop to repeat p n times, then filter line by checking if line ($0) does match with what I created earlier.
Alternatively you might use string formatting function sprintf as follows:
awk 'BEGIN{n=2}$0~sprintf("(AT[GA]CT){%s}",n)' file.txt
Explanation: I used sprintf function, %s in first argument marks where to put n. If you want to know more about what might be used in printf and sprintf first argument read Format Modifiers
(both solutions tested in GNU Awk 5.0.1)

Parsing and creating new arguments with getline AWK code

I am writing a pretty long AWK program (NOT terminal script) to parse through a network trace file. I have a situation where the next line in the trace file is ALWAYS a certain type of 'receive' (3 possible types) - however, I only want AWK to handle/print on one type. In short, I want to tell AWK if the next line contains a certain receive type, do not include it. It is my understanding that getline is the best way to go about this.
I have tried a couple different variations of getline and getline VAR via the manual, I still cannot seem to search through and reference fields in the next line like I want. Updated from edit:
if ((event=="r") && (hopSource == hopDest)) {
getline x
if ((x $31 =="arp") || (x $35 =="AODV")) {
#printf("Badline %s %s \n", $31, $35)
}
else {
macLinkRec++;
#printf("MAC Link Recieved from HEAD - %d to MEMBER %d \n", messageSource, messageDest)
}
}
I am using the print(badline) as just a marker to see what is going on. I fully understand how to restructure the code once I get the search and reference correct. I am also able to print the correct 'next' lines. However, I would expect to be able to search through the next line and create new arguments based on what is contained in the next line. How do I search a 'next line' based on an argument in AWK? How do I reference fields in that line to create new arguments?
Final note, the 'next line' number of fields (NF) varies, but I feel that the $35 field reference should handle any problems there.

Is a /start/,/end/ range expression ever useful in awk?

I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}
Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..
While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.

How do I use awk split file to multiline records?

On OSX, I've converted a Powerpoint deck to ASCII text, and now want to process this with awk.
I want to split the file into multiline records corresponding to slides in the deck.
Treating any line beginning with a capital latin letter provides a good approximation, but I can't figure out doing this in awk.
I've tried resetting the record separator, RS = "\n^[A-Z]" and RS = "\n^[[:alnum:]][[:upper:]]", and various permutations, but none differentiate. That is, awk keeps treating each individual as a record, rather than grouping them as I want.
The cleaned text looks like this:
Welcome
++ Class will focus on:
– Basics of SQL syntax
– SQL concepts analogous to Excel concepts
Who Am I
++ Self-taught on LAMP(ython) stack
++ Plus some DNS, bash scripting, XML / XSLT
++ Prior professional experience:
– Office of Management and Budget
– Investment banking (JP Morgan, UBS, boutique)
– MBA, University of Chicago
Roadmap
+ Preliminaries
+ What is SQL
+ Excel vs SQL
+ Moving data from Excel to SQL and back
+ Query syntax basics
- Running queries
- Filtering, grouping
- Functions
- Combining tables
+ Using queries for analysis
Some 'slides' have blank lines, some don't.
Once past these hurdles I plan to wrap each record in an tag for use in deck.js. But getting the record definitions right is killing me.
How do I do those things?
EDIT: The question initially asked also about converting Unicode bullet characters to ASCII, but I've figured that out. Some remarks in comments focus on that stuff.
In awk you could try to collect records using:
/^[[:upper:]]/ {
if (r>0) print rec
r=1; rec=$0 RS; next
}
{
rec=rec $0 RS
}
END {
print rec
}
To remove bullets you could use
gsub (/•/,"++",rec)
You might try using the "textutil" utility built into OSX to convert the file within a script to save you doing it all by hand. Try typing the following into Terminal window and pressing to move to the next page:
man textutil
Once you have got some converted text, try posting that so people can see what the inputs look like, then maybe someone can help you split it up how you want.

Reorganizing named fields with AWK

I have to deal with various input files with a number of fields, arbitrarily arranged, but all consistently named and labelled with a header line. These files need to be reformatted such that all the desired fields are in a particular order, with irrelevant fields stripped and missing fields accounted for. I was hoping to use AWK to handle this, since it has done me so well when dealing with field-related dilemmata in the past.
After a bit of mucking around, I ended up with something much like the following (writing from memory, untested):
# imagine a perfectly-functional BEGIN {} block here
NR==1 {
fldname[1] = "first_name"
fldname[2] = "last_name"
fldname[3] = "middle_name"
maxflds = 3
# this is just a sample -- my real script went through forty-odd fields
for (i=1;i<=NF;i++) for (j=1;j<=maxflds;j++) if ($i == fldname[j]) fldpos[j]=i
}
NR!=1 {
for (j=1;j<=maxflds;j++) {
if (fldpos[j]) printf "%s",$fldpos[j]
printf "%s","/t"
}
print ""
}
Now this solution works fine. I run it, I get my output exactly how I want it. No complaints there. However, for anything longer than three fields or so (such as the forty-odd fields I had to work with), it's a lot of painfully redundant code which always has and always will bother me. And the thought of having to insert a field somewhere else into that mess makes me shudder.
I die a little inside each time I look at it.
I'm sure there must be a more elegant solution out there. Or, if not, perhaps there is a tool better suited for this sort of task. AWK is awesome in it's own domain, but I fear I may be stretching it's limits some with this.
Any insight?
The only suggestion that I can think of is to move the initial array setup into the BEGIN block and read the ordered field names from a separate template file in a loop. Then your awk program consists only of loops with no embedded data. Your external template file would be a simple newline-separated list.
BEGIN {while ((getline < "fieldfile") > 0) fldname[++maxflds] = $0}
You would still read the header line in the same way you are now, of course. However, it occurs to me that you could use an associative array and reduce the nested for loops to a single for loop. Something like (untested):
BEGIN {while ((getline < "fieldfile") > 0) fldname[$0] = ++maxflds}
NR==1 {
for (i=1;i<=NF;i++) fldpos[i] = fldname[$i]
}