Grep for Multiple instances of string between a substring and a character? - awk

Can you please tell me how to Grep for every instance of a substring that occurs multiple times on multiple lines within a file?
I've looked at
https://unix.stackexchange.com/questions/131399/extract-value-between-two-search-patterns-on-same-line
and How to use sed/grep to extract text between two words?
But my problem is slightly different - each substring will be immediately preceded by the string: name"> and will be terminated be a < character immediately after the last character of the substring I want.
So one line might be
<"name">Bob<125><adje></name><"name">Dave<123><adfe></name><"name">Fred<125><adfe></name>
And I would like the output to be:
Bob
Dave
Fred

Although awk is not the best tool for xml processing, it will help if your xml structure and data simple enough.
$ awk -F"[<>]" '{for(i=1;i<NF;i++) if($i=="\"name\"") print $(++i)}' file
Bob
Dave
Fred
I doubt that the tag is <"name"> though. If it's <name>, without the quotes change the condition in the script to $i=="name"

gawk
awk -vRS='<"name">|<' '/^[A-Z]/' file
Bob
Dave
Fred

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

Return lines with at least n consecutive occurrences of the pattern in bash [duplicate]

This question already has answers here:
Specify the number of pattern repeats in JavaScript Regex
(2 answers)
Closed 1 year ago.
Might be naive question, but I can't find an answer.
Given a text file, I'd like to find lines with at least (defined number) of occurrences of a certain pattern, say, AT[GA]CT.
For example, in n=2, from the file:
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
Only the second line should be returned.
I know how to use grep/awk to search for at least one instance of this degenerate pattern, and for some defined number of pattern instances occurring non-consecutively. But the issue is the pattern occurrences MUST be consecutive, and I can't figure out how to achieve that.
Any help appreciated, thank you very much in advance!
I would use GNU AWK for this task following way, let file.txt content be
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
then
awk 'BEGIN{p="AT[GA]CT";n=2;for(i=1;i<=2;i+=1){pat=pat p}}$0~pat' file.txt
output
TAGATGCTATACTTGA
Explanation: I use for loop to repeat p n times, then filter line by checking if line ($0) does match with what I created earlier.
Alternatively you might use string formatting function sprintf as follows:
awk 'BEGIN{n=2}$0~sprintf("(AT[GA]CT){%s}",n)' file.txt
Explanation: I used sprintf function, %s in first argument marks where to put n. If you want to know more about what might be used in printf and sprintf first argument read Format Modifiers
(both solutions tested in GNU Awk 5.0.1)

Awk - How to escape the | in sub?

I'd like to substitue a string, which contains a |
My STDIN :
13|Test|123|6232
14|Move|126|6692
15|Test|123|6152
I'd like to obtain :
13|Essai|666|6232
14|Move|126|6692
15|Essai|666|6152
I tried like this
{sub("|Test|123","|Essai|666") ;} {print;}
But I think the | is bothers me.... I really need to replace the complete string WITH the |.
How should I do to get this result ?
Many thanks for you precious help
You can use
awk '{sub(/\|Test\|123\|/,"|Essai|666|")}1' file
See the online demo.
Note:
/\|Test\|123\|/ is a regex that matches |Test|123| substring
sub(/\|Test\|123\|/,"|Essai|666|") - replaces the first occurrence of the regex pattern in the whole record (since the input is omitted, $0 is assumed)
1 triggers the default print action, no need to explicitly call print here.

Escape all commas in line except first and last

I have a CSV file which I'm trying to import to a SQL Server table. The file contains lines of 3 columns each, separated by a comma. The only problem is that some of the data in the second column contains an arbitrary number of commas. For example:
1281,I enjoy hunting, fishing, and boating,smith317
I would like to escape all occurrences of commas in each line except the first and the last, such that the result of this line would be:
1281,I enjoy hunting\, fishing\, and boating,smith317
I know I will need some type of regular expression to accomplish this task, but my knowledge of regular expressions is very limited. Currently, I'm trying to use Notepad++ find/replace with regex, but I am open to other ideas.
Any help would be greatly appreciated :-)
Okay, could be a manual stuff. Do this:
Normal find all the , and replace it with \,. Escape everything.
Regex find ^(.*)(\\,) and replace it with $1,.
Regex find (\\,)(.*)$ and replace it with ,$2.
Worked for me in Sublime Text 2.

Unable to Remove Special Characters In Pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.
You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)
There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here