awk search for a paragraph by a word from a file using shell variables - variables

I try filer file by name and previously asked how to use a shell variable in awk
How to force awk command search for a previously given word
Using recommendations given I can't now get the whole paragraph, only the name given... My question is: how can I get paragraphs, which consist the name given? Paragraphs are separated by \n\n
awk '{print var}' var="$NAME" RS="\n\n" ORS="\n\n" literatur.bib
It worked perfectly this way, but I had to write the name direct in awk, which doesn't match the task
awk '/name/' RS="\n\n" ORS="\n\n" literatur.bib
Input should look like this: Newton
Output like this, but with all the paragraphs with the name Newton:
#Book{Eilam:2005:reversing,
author = {Eldad Newton}, title =
{Reversing: Secrets of Reverse Engineering},
publisher = {Wiley},
year = 2005 }
#Book{Glatz:2006:Betriebssysteme,
author = {Eduard Newton},
title = "{Betriebssysteme}", publisher = {dpunkt}, year = 2006 }
Input file looks exactly like this example, but consist of hundreds of the similar paragraphs
When I search for example name Baruah I become this:
Baruah
Baruah
Baruah
And so many more times... And it just prints it without search, because if I would give the word not given in the file I will still become this result - this word written many times
SOLVED:
Just by using "" instead of ''
awk "/name/" RS="\n\n" ORS="\n\n" literatur.bib

Related

replacing fasta headers gives mismatch

probably a simple issue, but I cannot seem to solve it like this.
I want to replace the headers of a FASTA file. This file is a subset of a larger file, but headers were adjusted in the process. I want to add the original headers since it includes essential information.
I selected the headers from the subset (subset.fasta) using grep, and used this to match and extract the headers from the original file, giving 'correct.headers'. They are the same number of headers and in the same order, so this should be ok.
I found the code below which should do what I want according to the description. I've only started learning awk, so I can't really control it, though.
awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' correct.headers subset.fasta > subset.correct.fasta
(source: Replace fasta headers using sed command)
However, there are some 100 more output lines than expected, and there's a shift starting after a couple of million lines.
My workflow was like this:
I had a subsetted fasta-file (created by a program extracting certain taxa) where the headers were missing info:
>header_1
read_1
>header_2
read_2
...
>header_n
read_n
I extracted the headers from this subsetted file using grep, giving the subset headers file:
>header_1
>header_2
...
>header_n
I matched the first part of the header to extract the original headers from the non-subsetted file using grep:
>header_1 info1
>header_2 info2
...
>header_n info_n
giving the same number of headers, matching order, etc.
I then used this file to replace the headers in the subset with the original ones using above awk line, but this gives a mismatch at a certain point and adds additional lines.
result
>header_1 info1
read_1
>header_2 info2
read_2
...
>header_x info_x
read_n
Where/why does it go wrong?
Thanks!

How to use awk delimiters in a large csv with text fields with commas [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a .csv with 470 columns and tens of thousands of rows of products, many with text strings including commas, that cause my awk statements to blow out and write to the wrong columns thus corrupting my data. Here are the statements I'm using:
Input example:
LDB0010-300,"TIMELESS DESIGN: Classic, Elegant, Beautiful",Yes,1,Live,...
LDB0010-400,"CLASSIC DESIGN: Contemporary look",No,0,Not Live,...
LDB0010-500,"Everyone should wear this, almost!",Yes,0,Not Live,...
Code:
cat products.csv | sed -e 's/, /#/g' | awk -F, 'NR>1 {$308="LIVE" ; $310="LIVE" ; $467=0 ; print $0}' OFS=, | sed -e 's/#/, /g'
Current output, which is wrong with data written in the wrong columns:
LDB0010-300,"TIMELESS DESIGN: Classic",LIVE, Beautiful",Yes,1,Live,...
LDB0010-400,"CLASSIC DESIGN: Contemporary look",No,0,0,...
LDB0010-500,"Everyone should wear this",LIVE,Yes,0,Not Live,...
When studying the data closer, I noticed that in the cells with text descriptions, commas were always followed with a space, whereas commas used as delimiters had no space after them. So the approach I took was to substitute comma-space with '#', run my awk statement to set the values of those columns, then substitute back from '#' to comma-space. This all looked pretty good until I opened the spreadsheet and noticed that there were many rows with values written into the wrong columns. Does anyone know a better way to do this that will prevent these blow outs?
The sample data you posted does not reproduce the symptoms you report with the code you provided. The absolutely simplest explanation is that your observation that commas with a space are always field-internal and other commas are not is in fact incorrect. This should be easy enough to check;
sed 's/, /#/g' products.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, a[n] }'
If you don't get a single line of output with exactly the correct number of columns and rows, you can tell that your sed trick is not working correctly. (Notice also the fix for the useless cat.)
Anyway, here is a simple Python refactoring which should hopefully be obvious enough. The Python CSV library knows how to handle quoted fields so it will only split on commas which are outside double quotes.
#!/usr/bin/env python3
import csv
import sys
w = csv.writer(sys.stdout)
for file in sys.argv[1:]:
with open(file, newline='') as inputfile:
r = csv.reader(inputfile)
for row in r:
row[307] = "LIVE"
row[309] = "LIVE"
row[466] = 0
w.writerow(row)
Notice how Python's indexing is zero-based, whereas Awk counts fields starting at one.
You'd run it like
python3 this_script.py products.csv
See also the Python csv module documentation for various options you might want to use.
The above reads all the input files and writes the output to standard output. If you just want to read a single input file and write to a different file, that could be simplified to
#!/usr/bin/env python3
import csv
import sys
with open(sys.argv[1], newline='') as inputfile, open(sys.argv[2], 'w', newline='') as outputfile:
r = csv.reader(inputfile)
w = csv.writer(outputfile)
header = True
for row in r:
if not header: # don't muck with first line
row[307] = "LIVE"
row[309] = "LIVE"
row[466] = 0
w.writerow(row)
header = False
You'd run this as
python3 thisscript.py input.csv output.csv
I absolutely hate specifying the output file as a command-line argument (we should have an option instead) but for a quick one-off, I guess this is acceptable.

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile

Mark duplicate headers in a fasta file

I have a big Fasta file which I want to modify. It basically consists of many sequences with headers that start ">". My Problem is, that some of the Headers are not unique, even though the Sequences are unique.
Example:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
Now I want to find all duplicates in my big Fasta File and append numbers to the duplicates, so that I know which duplicate it is (1,2,3,...,x). When a new duplicate is found (one with another header), the counter should start from the beginning.
The output should be something like this:
>acrdi|AD19082
STSTAFPLLTQFYGCAIIILVLAMCCSCLVYAMYFMNSSGLQTHESTVTQKVKDFSLQ
WLQPILFGCSWRHRLIAKSRRNRSKIQPMTGTEPPWNESKDAFENLKTWALNKQNRNCLL
EINFLEAKDFIVMCKDVVCFEEDDKDERNLNLCLKTLTEAFRFLRNCCAETPKNQSFVIS
SGVAKQAIEVILILLRPVFQEREKGTEVITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCP
QFFLDVLLSRHHSIQDCLCMIIFNCLNQQRRLQLVNNPKIISQIVHLCADKSLLEWGYFI
LDCLICEGLFPDLYQGMEFDPLARIILLDLFQVKITDALDESSERTERTETPKELYASSL
NYLAEQFETYFIDIIQRLQQLDYSSNDFFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASL
LETCVDLLRETSKPEAKAAFKRPGTSYWEYVLPTFP
>acrdi|AD19082-1
MLRQSEPPWNESKDAFENLKTWALNKQNRNCLLEINFLEAKDFIVMCKDVVCFEEDDKDE
RNLNLCLKTLTEAFRFLRNCCAETPKNQSFVISSGVAKQAIEVILILLRPVFQEREKGTE
VITDTIRSGLQLLGNTVVKNIDTQEFIWNCCCPQFFLDVLLSRHHSIQDCLCMIIFNCLN
QQRRLQLVNNPKIISQIVHLCADKSLLEWGYFILDCLICEGLFPDLYQGMEFDPLARIIL
LDLFQVKITDALDESSERTERTETPKELYASSLNYLAEQFETYFIDIIQRLQQLDYSSND
FFQVLVVTRLLSLLSTSTGLKSSMTGLQDRASLLETCVDLLRETSKPEAKAAFSNVSSFP
HSVDSGRISPSHGFQRDLVRVIGNMCYQHFPNQEKVRELDGIPLLLDHCNIDDHNPYICQ
WAIFAIRNVLENNKENQDIVASIHPLGLADMSRLQQFGVDAVEFDGEKI
I would prefer a method with awk or sed, so that I can easily modify the code to run on all files in a directory.
I have to admit, that I am just starting to learn programming and parsing, but I hope this is not a stupid question.
THX in advance for the help.
An awk script:
BEGIN {
OFS="\n";
ORS=RS=">";
}
{
name = $1;
$1 = "";
suffix = names[name] ? "-" names[name] : "";
print name suffix $0, "\n";
names[name]++;
}
The above uses the ">" as a record separator, and checks the first field (which is the header name that can be duplicated). For each line it prints, it adds a suffix after the header name for each additional time the field appears (i.e. '-1' for the first dup, '-2' for the second...)

How to find the same words in two different text files and print those lines using bash?

I have two text files. One contain just one column of words. Hundreds of words. Just one word in every line. The second one contain a lot of columns a row.
I need to find the words from first text file which are in the second text file and print the entire line from second text file where this word is, using awk, grep or other command line program. For example:
Text file #1:
car
house
notebook
Text file #2:
32233: FTD laptop
24342: TGD car
2424: jdj notebook
Output:
24342: TGD car
2424: jdj notebook
try this:
grep -Fwf file1 file2