How to skip records that turn on/off the range pattern? - gawk

gawk '/<Lexer>/,/<\/Lexer>/' file
this works but it prints the first and last records, which I'd like to omit. How to do so?
It says: "The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in." but no example.
I tried something like
gawk '/<Lexer>/,/<\/Lexer>/' {1,FNR-1} file
but it doesn't work.
If you have a better way to do this, without using awk, say so.

You can do it with 2 separate match statements and a variable
gawk '/<Lexer>/{p=1; next} /<\/Lexer>/ {p=0} p==1 {print}' file
This matches <Lexer> and sets p to 1 and then skips to the next line. While p is 1 it prints the current line. When it matches </Lexer> it sets p to 0 and skips. As p is 0 printing is suppressed.

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

Return lines with at least n consecutive occurrences of the pattern in bash [duplicate]

This question already has answers here:
Specify the number of pattern repeats in JavaScript Regex
(2 answers)
Closed 1 year ago.
Might be naive question, but I can't find an answer.
Given a text file, I'd like to find lines with at least (defined number) of occurrences of a certain pattern, say, AT[GA]CT.
For example, in n=2, from the file:
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
Only the second line should be returned.
I know how to use grep/awk to search for at least one instance of this degenerate pattern, and for some defined number of pattern instances occurring non-consecutively. But the issue is the pattern occurrences MUST be consecutive, and I can't figure out how to achieve that.
Any help appreciated, thank you very much in advance!
I would use GNU AWK for this task following way, let file.txt content be
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
then
awk 'BEGIN{p="AT[GA]CT";n=2;for(i=1;i<=2;i+=1){pat=pat p}}$0~pat' file.txt
output
TAGATGCTATACTTGA
Explanation: I use for loop to repeat p n times, then filter line by checking if line ($0) does match with what I created earlier.
Alternatively you might use string formatting function sprintf as follows:
awk 'BEGIN{n=2}$0~sprintf("(AT[GA]CT){%s}",n)' file.txt
Explanation: I used sprintf function, %s in first argument marks where to put n. If you want to know more about what might be used in printf and sprintf first argument read Format Modifiers
(both solutions tested in GNU Awk 5.0.1)

Is a /start/,/end/ range expression ever useful in awk?

I've always contended that you should never use a range expression like:
/start/,/end/
in awk because although it makes the trivial case where you only want to print matching text including the start and end lines slightly briefer than the alternative*:
/start/{f=1} f{print; if (/end/) f=0}
when you want to tweak it even slightly to do anything else, it requires a complete re-write or results in duplicated or otherwise undesirable code. e.g. if you want to print the matching text excluding the range delimiters using the second form above you'd just tweak it to move the components around:
f{if (/end/) f=0; else print} /start/{f=1}
but if you started with /start/,/end/ you'd need to abandon that approach in favor of what I just posted or you'd have to write something like:
/start/,/end/{ if (!/start|end/) print }
i.e. duplicate the conditions which is undesirable.
Then I saw a question posted that required identifying the LAST end in a file and where a range expression was used in the solution and I thought it seemed like that might have some value (see https://stackoverflow.com/a/21145009/1745001).
Now, though, I'm back to thinking that it's just not worth bothering with range expressions at all and a solution that doesn't use range expressions would have worked just as well for that case.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
*I used to use:
/start/{f=1} f; /end/{f=0}
but too many times I found I had to do something additional when f is true and /end/ is found (or to put it another way ONLY do something when /end/ is found IF f were true) so now I just try to stick to the slightly less brief but much more robust and extensible:
/start/{f=1} f{print; if (/end/) f=0}
Interesting. I also often start with a range expression and then later on switch to using a variable..
I think a situation where this could be useful, aside from the pure range-only situations is if you want to print a match, but only if it lies in a certain range. Also because it is immediately obvious what it does. For example:
awk '/start/,/end/{if(/ppp/)print}' file
with this input:
start
dfgd gd
ppp 1
gfdg
fd gfd
end
ppp 2
ppp 3
start
ppp 4
ppp 5
end
ppp 6
ppp 7
gfdgdgd
will produce:
ppp 1
ppp 4
ppp 5
--
One could of course also use:
awk '/start/{f=1} /ppp/ && f; /end/{f=0}' file
But it is longer and somewhat less readable..
While you are right that the /start/,/end/ range expression can easily be reimplemented with a conditional, it has many interesting use-cases where it is used on its own. As you observe it, it might have little value for processing of tabular data, the main but not only use case of awk.
So - does anyone have an example where a range expression actually adds noticeable value to a solution?
In the mentioned use-cases, the range expression improves legibility. Here are a few examples, where the range expression accurately selects the text to be processed. These are only a hand of examples, but there is countlessly similar applications, demonstrating the incredible versatility of awk.
Filter logs within a time range
Assuming each log line starts with an ISO timestamp, the filter below selects all events in a given range of 1 hour:
awk '/^2015-06-30T12:00:00Z/,/^2015-06-30T13:00:00Z/'
Extract a document from a file
awk '/---- begin file.data ----/,/---- end file.data ----/'
This can be used to bundle resources with shell scripts (with cat), to extract parts of GPG-signed messages (prepared with --clearsign) or more generally of MIME-messages.
Process LaTeX files
The range pattern can be used to match LaTeX environments, so for instance we can select the abstracts of all articles in our directory:
awk '/begin{abstract}/,/end{abstract}/' *.tex
or all the theorems, to prepare a theorem database!
awk '/begin{theorem}/,/end{theorem}/' *.tex
or write a linter ensuring that theorems do not contain citations (if we regard this as bad style):
awk '
/begin{theorem}/,/end{theorem}/ { if(/\\cite{/) { c+= 1 } }
END { printf("There were %d bad-style citations.\n", c) }
'
or preprocess tables, etc.

Awk Sum skipping Special Character Row

I am trying to take the sum of a particular column in a file i.e. column 18.
Using awk command along with Printf to display it in proper decimal format.
SUM=`cat ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt|awk -F"" '{s+=$18}END{printf("%24.2f\n", s)}'
Above command is skipping those rows in file which has the special character in one of the column 5 - RÉPARATIONS. Hence Awk skips these rows and doesnt consider sum for that row. Please help how to resolve this issue to take sum of all rows.
There is missing a back tics in your example, should be:
SUM=`cat ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt|awk -F"" '{s+=$18}END{printf("%24.2f\n", s)}'`
But you should not use back tics, you should use parentheses $(code)
Using cat to enter data to awk is also wrong way to do it, add pat after awk
SUM=$(awk -F"" '{s+=$18} END {printf "%24.2f\n",s}' ${INF_TARGET_FILE_PATH}/${EXTRACT_NAME}_${CURRENT_DT}.txt)
This may resolve your problem, but gives a more correct code.
If you give us your input file, it would help us to understand the problem.

How does associative arrays work in awk?

I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:
awk '!x[$1]++' filename
It works, but I am not sure how it works. I know it uses associate arrays in awk but I am not able to infer anything beyond it.
Update:
Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.
That awk script !x[$1]++ fills an array named x. Suppose the first word ($1 refers to the first word in a line of text) in a line of text is line1. It effectively results in this operation on the array:
x["line1"]++
The "index" (the key) of the array is the text encountered in the file (line1 in this example), and the value associated with that key is an integer that is incremented by 1.
When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not operator ! evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not operation results in zero (false), so the line is not printed.
A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:
{
if (x[$1] == 0 )
print
x[$1]++
}