Use awk to print variables and static text - awk

I have a text file containing two values per line separated by a space. Field 1 is a URI and field 2 is a full URL.
My account doesn't have enough reputation to post links so please ignore the space after http in each of the examples.
Here's an example:
/uri1 http ://www.example. com/uri1
/uri2 http ://www.example.com/uri2
/uri3/test http ://www.example.com/uri3/index.html
/uri4?id=123 http ://www.example.com/uri4/page1.html
I'd like to create a new file where field 1 is the same, but I modify field 2 to be a combination of field2 with some static text then field 1.
Below is my example:
/uri1 http ://www.example.abc/uri1/#/www.test.abc/uri1
/uri2 http ://www.example.abc/uri2/#/www.test.abc/uri2
/uri3/test http ://www.example.abc/uri3/index.html/#/www.test.abc/uri3/test
/uri4?id=123 http ://www.example.abc/uri4/page1.html/#/www.test.abc/uri4?id=123
I thought awk would be the easiest way to do this, but it's not working like I expected.
Below is the command I'm trying to use, but it prints the fields over the top of each other and the output is like it's combined.
awk '{print $1, $2"/#/www.test.com"$1}' file.txt

Related

Recursively search directory for occurrences of each string from one column of a .csv file

I have a CSV file--let's call it search.csv--with three columns. For each row, the first column contains a different string. As an example (punctuation of the strings is intentional):
Col 1,Col 2,Col 3
string1,valueA,stringAlpha
string 2,valueB,stringBeta
string'3,valueC,stringGamma
I also have a set of directories contained within one overarching parent directory, each of which have a subdirectory we'll call source, such that the path to source would look like this: ~/parentDirectory/directoryA/source
What I would like to do is search the source subdirectories for any occurrences--in any file--of each of the strings in Col 1 of search.csv. Some of these strings will need to be manually edited, while others can be categorically replaced. I run the following command . . .
awk -F "," '{print $1}' search.csv | xargs -I# grep -Frli # ~/parentDirectory/*/source/*
What I would want is a list of files that match the criteria described above.
My awk call gets a few hits, followed by xargs: unterminated quote. There are some single quotes in some of the strings in the first column that I suspect may be the problem. The larger issue, however, is that when I did a sanity check on the results I got (which seemed far too few to be right), there was a vast discrepancy. I ran the following:
ag -l "searchTerm" ~/parentDirectory
Where searchTerm is a substring of many (but not all) of the strings in the first column of search.csv. In contrast to my above awk-based approach which returned 11 files before throwing an error, ag found 154 files containing that particular substring.
Additionally, my current approach is too low-resolution even if it didn't error out, in that it wouldn't distinguish between which results are for which strings, which would be key to selectively auto-replacing certain strings. Am I mistaken in thinking this should be doable entirely in awk? Any advice would be much appreciated.

replacing fasta headers gives mismatch

probably a simple issue, but I cannot seem to solve it like this.
I want to replace the headers of a FASTA file. This file is a subset of a larger file, but headers were adjusted in the process. I want to add the original headers since it includes essential information.
I selected the headers from the subset (subset.fasta) using grep, and used this to match and extract the headers from the original file, giving 'correct.headers'. They are the same number of headers and in the same order, so this should be ok.
I found the code below which should do what I want according to the description. I've only started learning awk, so I can't really control it, though.
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' correct.headers subset.fasta > subset.correct.fasta
(source: Replace fasta headers using sed command)
However, there are some 100 more output lines than expected, and there's a shift starting after a couple of million lines.
My workflow was like this:
I had a subsetted fasta-file (created by a program extracting certain taxa) where the headers were missing info:
>header_1
read_1
>header_2
read_2
...
>header_n
read_n
I extracted the headers from this subsetted file using grep, giving the subset headers file:
>header_1
>header_2
...
>header_n
I matched the first part of the header to extract the original headers from the non-subsetted file using grep:
>header_1 info1
>header_2 info2
...
>header_n info_n
giving the same number of headers, matching order, etc.
I then used this file to replace the headers in the subset with the original ones using above awk line, but this gives a mismatch at a certain point and adds additional lines.
result
>header_1 info1
read_1
>header_2 info2
read_2
...
>header_x info_x
read_n
Where/why does it go wrong?
Thanks!

How to forward logs with Splunk Forwarder for the files with no header and logs should be in form of key/Value

I have a splunk forwarder setup already on my host.
I have certain files in the folder (/tom/mike/). File names are starting with Back*.
The content of file may in one or multiple line. There are multiple fixed position values separated with some spaces in each line with no header.
Content (Example: Consider "-" as one space)
Tom---516-----RTYUI------45678
Mik---345-----XYXFF------56789
I need splunk logs for each line.
like:
Key1= Tom Key2=516 Key3= RTYUI Key4= 45678
Key1= Mike Key2= 345 Key3= XYXFF Key4= 56789
I know inputs.conf changes would be like below:
[monitor:///tom/mike/Back*]
index=myIndex
blacklist=\.(gz|zip|bkz|arch|etc)$
sourcetype = BackFileData
Please suggest changes which can be done in props.conf. Please keep in mind that delimiter is fixed for each value in line but its not same (like 2 spaces) for all column values. There are no headers as well in these files.
You can use kvdelims if you want a search-time extraction or you can make a transforms.conf rule and apply it in props.conf and it will extract at index time
Here's a good article covering all those scenarios
https://www.splunk.com/blog/2008/02/12/delimiter-based-key-value-pair-extraction.html

Find duplicate records with only text case difference

I have a log file with 8M entries/records with URLs. I'd like to find duplicate URLs (same URLs) with the only difference being their type / text case.
Example:
origin-www.example.com/this/is/hard.html
origin-www.example.com/this/is/HARD.html
origin-www.example.com/this/is/Hard.html
In this case, there are three duplicates with case sensitivity.
Output should be just the count -c and a new file with the duplicates.
Use the typical awk '!seen[$0]++' file trick combined with tolower() or toupper() to make all lines be in the same case:
$ awk '!seen[tolower($0)]++' file
origin-www.example.com/this/is/hard.html
For a different output and counters whatsoever, provide a valid desired output.

Mark duplicate headers in a fasta file

I have a big Fasta file which I want to modify. It basically consists of many sequences with headers that start ">". My Problem is, that some of the Headers are not unique, even though the Sequences are unique.
Example:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
Now I want to find all duplicates in my big Fasta File and append numbers to the duplicates, so that I know which duplicate it is (1,2,3,...,x). When a new duplicate is found (one with another header), the counter should start from the beginning.
The output should be something like this:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082-1
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
I would prefer a method with awk or sed, so that I can easily modify the code to run on all files in a directory.
I have to admit, that I am just starting to learn programming and parsing, but I hope this is not a stupid question.
THX in advance for the help.
An awk script:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
The above uses the ">" as a record separator, and checks the first field (which is the header name that can be duplicated). For each line it prints, it adds a suffix after the header name for each additional time the field appears (i.e. '-1' for the first dup, '-2' for the second...)