How to use awk delimiters in a large csv with text fields with commas [closed] - awk

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a .csv with 470 columns and tens of thousands of rows of products, many with text strings including commas, that cause my awk statements to blow out and write to the wrong columns thus corrupting my data. Here are the statements I'm using:
Input example:
LDB0010-300,"TIMELESS DESIGN: Classic, Elegant, Beautiful",Yes,1,Live,...
LDB0010-400,"CLASSIC DESIGN: Contemporary look",No,0,Not Live,...
LDB0010-500,"Everyone should wear this, almost!",Yes,0,Not Live,...
Code:
cat products.csv | sed -e 's/, /#/g' | awk -F, 'NR>1 {$308="LIVE" ; $310="LIVE" ; $467=0 ; print $0}' OFS=, | sed -e 's/#/, /g'
Current output, which is wrong with data written in the wrong columns:
LDB0010-300,"TIMELESS DESIGN: Classic",LIVE, Beautiful",Yes,1,Live,...
LDB0010-400,"CLASSIC DESIGN: Contemporary look",No,0,0,...
LDB0010-500,"Everyone should wear this",LIVE,Yes,0,Not Live,...
When studying the data closer, I noticed that in the cells with text descriptions, commas were always followed with a space, whereas commas used as delimiters had no space after them. So the approach I took was to substitute comma-space with '#', run my awk statement to set the values of those columns, then substitute back from '#' to comma-space. This all looked pretty good until I opened the spreadsheet and noticed that there were many rows with values written into the wrong columns. Does anyone know a better way to do this that will prevent these blow outs?

The sample data you posted does not reproduce the symptoms you report with the code you provided. The absolutely simplest explanation is that your observation that commas with a space are always field-internal and other commas are not is in fact incorrect. This should be easy enough to check;
sed 's/, /#/g' products.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, a[n] }'
If you don't get a single line of output with exactly the correct number of columns and rows, you can tell that your sed trick is not working correctly. (Notice also the fix for the useless cat.)
Anyway, here is a simple Python refactoring which should hopefully be obvious enough. The Python CSV library knows how to handle quoted fields so it will only split on commas which are outside double quotes.
#!/usr/bin/env python3
import csv
import sys
w = csv.writer(sys.stdout)
for file in sys.argv[1:]:
with open(file, newline='') as inputfile:
r = csv.reader(inputfile)
for row in r:
row[307] = "LIVE"
row[309] = "LIVE"
row[466] = 0
w.writerow(row)
Notice how Python's indexing is zero-based, whereas Awk counts fields starting at one.
You'd run it like
python3 this_script.py products.csv
See also the Python csv module documentation for various options you might want to use.
The above reads all the input files and writes the output to standard output. If you just want to read a single input file and write to a different file, that could be simplified to
#!/usr/bin/env python3
import csv
import sys
with open(sys.argv[1], newline='') as inputfile, open(sys.argv[2], 'w', newline='') as outputfile:
r = csv.reader(inputfile)
w = csv.writer(outputfile)
header = True
for row in r:
if not header: # don't muck with first line
row[307] = "LIVE"
row[309] = "LIVE"
row[466] = 0
w.writerow(row)
header = False
You'd run this as
python3 thisscript.py input.csv output.csv
I absolutely hate specifying the output file as a command-line argument (we should have an option instead) but for a quick one-off, I guess this is acceptable.

Related

Return lines with at least n consecutive occurrences of the pattern in bash [duplicate]

This question already has answers here:
Specify the number of pattern repeats in JavaScript Regex
(2 answers)
Closed 1 year ago.
Might be naive question, but I can't find an answer.
Given a text file, I'd like to find lines with at least (defined number) of occurrences of a certain pattern, say, AT[GA]CT.
For example, in n=2, from the file:
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
Only the second line should be returned.
I know how to use grep/awk to search for at least one instance of this degenerate pattern, and for some defined number of pattern instances occurring non-consecutively. But the issue is the pattern occurrences MUST be consecutive, and I can't figure out how to achieve that.
Any help appreciated, thank you very much in advance!
I would use GNU AWK for this task following way, let file.txt content be
ATGCTTTGA
TAGATGCTATACTTGA
TAGATGCTGTATACTTGA
then
awk 'BEGIN{p="AT[GA]CT";n=2;for(i=1;i<=2;i+=1){pat=pat p}}$0~pat' file.txt
output
TAGATGCTATACTTGA
Explanation: I use for loop to repeat p n times, then filter line by checking if line ($0) does match with what I created earlier.
Alternatively you might use string formatting function sprintf as follows:
awk 'BEGIN{n=2}$0~sprintf("(AT[GA]CT){%s}",n)' file.txt
Explanation: I used sprintf function, %s in first argument marks where to put n. If you want to know more about what might be used in printf and sprintf first argument read Format Modifiers
(both solutions tested in GNU Awk 5.0.1)

Returning the position of pattern matches in a text file with multiple lines [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a long text file with the following format:
>foo_bar
TATGTTCTGCAACTGTATAATGGTATAAAAACATTGCAAAATGTAATGAAACTTGTTATTTTGTGAAATACATTCTATAAATATCACTATTTCATGAAAA
ATATTGAAAATCATTTATTTTCGACAAGTAGAACCATAGGTTCTGTAATTGTAAATAGTTCTGCAAACTTAACCTGTTTTGCAGAAGAATATGTTTTCAC
TAGTTAACTTGTAGAATGTTTAGGATTGTTAAAATTTTTAACAAAATAAGATTTTATAGAACATGATTTGCAAAATAACACATTTTGCAATATTTTTATA
CCATATATAGTTGCAGAACATATGGGGACTACGGGCAGCCGGTAAATATGTGGACTACATGGAACTTGTTCAGATACATCTGGAGCAAAGAGCCACCGCT
CTAAATTATCTCTTCTCATTTCCAGTATTATATCTCTCATGCTAAATTATCTCTACAAATCATGACCTCTCTTAGCAATCTCCCTGAGCATCTCCGTAGG
GAGCAGATATTCACCCGTCTTCCGATGAAAGACCTAATGGTCCTCGCATCTGCAAGTCATGTCTTGCGTTAATCTTTCTCTCTCTTTTTGTGGAATCCCA
TCTCTCCTCTTATCAACTAAACCAGATACAGTTTGCACCAACTTTCTTCACTCCCCTGTTACATGAGAAGGCCAGACTTAGGTAGCTTCTGAATCAGAAC
CCGGTCATTCCAAGCATGGGATTTCTTGTTGATCTCTTGTTTTTATGTAATAGTGATCATTTGATATCTGGTGTTGATGGGAATTCAGATGTATGGGACT
TTGTTTATTGTTGATGTGGAATTCTTATATTTTACTGTGTACTATAAAATTTTAGTGATACCTACTATCTATTGTATAAATTGATTAATTGATGTTCTTA
>bar_foo
TATGTTCTGCAACTGTATAATGGTATAAAAACATTGCAAAATGTAATGAAACTTGTTATTTTGTGAAATACATTCTATAAATATCACTATTTCATGAAAA
ATATTGAAAATCATTTATTTTCGACAAGTAGAACCATAGGTTCTGTAATTGTAAATAGTTCTGCAAACTTAACCTGTTTTGCAGAAGAATATGTTTTCAC
TAGTTAACTTGTAGAATGTTTAGGATTGTTAAAATTTTTAACAAAATAAGATTTTATAGAACATGATTTGCAAAATAACACATTTTGCAATATTTTTATA
CCATATATAGTTGCAGAACATATGGGGACTACGGTACTACGGTAAATATGTGGACTACATGGAACTTGTTCAGATACATCTGGAGCAAAGAGCCACCGCT
CTAAATTATCTCTTCTCATTTCCAGCTGCATATCTCTCATGCTAAATTATCTCTACAAATCATGACCTCTCTTAGCAATCTCCCTGAGCATCTCCGTAGG
GAGCAGATATTCACCCGTCTTCCGATGAAAGACCTAATGGTCCTCGCATCTGCAAGTCATGTCTTGCGTTAATCTTTCTCTCTCTTTTTGTGGAATCCCA
TCTCTCCTCTTATCAACTAAACCAGATACAGTTTGCACCAACTTTCTTCACTCCCCTGTTACATGAGAAGGCCAGACTTAGGTAGCTTCTGAATCAGAAC
CCGGTCATTCCAAGCATGGGATTTCTTGTTGATCTCTTGTTTTTATGTAATAGTGATCATTTGATATCTGGTGTTGATGGGAATTCAGATGTATGGGACT
TTGTTTATTGTTGATGTGGAATTCTTATATTTTACTGTGTACTATAAAATTTTAGTGATACCTACTATCTATTGTATAAATTGATTAATTGATGTTCTTA
I.e., there is a header line which begins with a ">", and then an arbitrary number of lines with no more than 100 letters in them. I would like to find the positions within the non-header lines that match either "GCAGC" or "GCTGC". Overlapping match sites would both get recorded individually.
An example output would be a three column text file where the first column contained the header line for that block of text minus the ">", the second column contained the start position of a pattern match (i.e., the number of characters into the text block, excluding line-break characters), and the third column recorded which of the two patterns were matched. E.g.:
foo_bar 109 GCAGC
bar_foo 58289 GCTGC
Not sure how complex this task is, and in particular whether there is a memory-efficient way to perform this operation in a streaming fashion. awk or sed seem like two utilities which might work, but the required command is beyond my limited understanding of the programs.
A tiny tweak on yesterdays answer:
sub(/^>/,"") {
hdr = $0
next
}
{
while ( match($0,/GC[AT]GC/) ) {
print hdr, RSTART, substr($0,RSTART,RLENGTH)
$0 = substr($0,1,RSTART-1) " " substr($0,RSTART+1)
}
}
Please get the book Effective AWK Programming, 5th Edition, by Arnold Robbins to learn the basics of awk.

awk search for a paragraph by a word from a file using shell variables

I try filer file by name and previously asked how to use a shell variable in awk
How to force awk command search for a previously given word
Using recommendations given I can't now get the whole paragraph, only the name given... My question is: how can I get paragraphs, which consist the name given? Paragraphs are separated by \n\n
awk '{print var}' var="$NAME" RS="\n\n" ORS="\n\n" literatur.bib
It worked perfectly this way, but I had to write the name direct in awk, which doesn't match the task
awk '/name/' RS="\n\n" ORS="\n\n" literatur.bib
Input should look like this: Newton
Output like this, but with all the paragraphs with the name Newton:
#Book{Eilam:2005:reversing,
author = {Eldad Newton}, title =
{Reversing: Secrets of Reverse Engineering},
publisher = {Wiley},
year = 2005 }
#Book{Glatz:2006:Betriebssysteme,
author = {Eduard Newton},
title = "{Betriebssysteme}", publisher = {dpunkt}, year = 2006 }
Input file looks exactly like this example, but consist of hundreds of the similar paragraphs
When I search for example name Baruah I become this:
Baruah
Baruah
Baruah
And so many more times... And it just prints it without search, because if I would give the word not given in the file I will still become this result - this word written many times
SOLVED:
Just by using "" instead of ''
awk "/name/" RS="\n\n" ORS="\n\n" literatur.bib

Need help using awk or similar to print/output a partial line of a JSON file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In the following example, I need to readjust the content within the 2nd set of quotes on line 5, up to, but not beyond the decimal point.
The contents of the quotes vary so everything between " and . must be captured and cannot be matched by using a search string based on any contents between.
It is also possible that in the future the line number may change, however, the line can always be found by searching for "Item".
The process should utilize awk, grep, cat, sed or a combination of them due to the limitations of the proprietary environment/OS. I have searched around but wasn't able to find anything that would work as desired.
filename: data.json
{
"Brand": "Marketside",
"Price": "3.97",
"SKU": "48319448",
"Item": "12-ct_Large_Grade_A(Brown_Organic).48319448",
}
An Example of a successful output would be:
12-ct_Large_Grade_A(Brown_Organic)
The requirement to rely exclusively on line-oriented tools to manipulate JSON seems extremely misdirected. When manipulating structured formats, use tools which understand the structured format.
jq '.Item|split(".")[0]' data.json
to extract up to the first dot; or
jq '.Item|sub("[.][^.]*$";"")' data.json
to discard the text from the last dot until the end of the field.
(jq doesn't like the superfluous last comma after the Item in your pseudo-JSON, though.)
There is no doubt in anyone's mind that your acute problem as stated can be solved with a simple Awk or sed script. What happens then - what already happened here - is that you discover additional requirements which were not obvious from the toy example you posted. A proper, portable solution copes with JSON samples with strings with embedded commas and escaped double quotes, and continues to work when the superficial JSON format changes because a component somewhere upstream is updated to put all the JSON on a single line or whatever.
Here is an awk:
awk -F'.' '/Item/{split(substr($0,1,L=length($0)-length($NF)-1),a,"\"");print a[4]}'
12.ct.Large.Grade.A(Brown_Organic)
It search for Item and then print from " to latest .
Split the string by .
Find the length of latest part after the split length($NF)
Extract this lengt from total to find position of latest . length($0)-length($NF)
Then split the the first part by " and print the 4th part.

Extracting a specific value from a text file

I am running a script that outputs data.
I am specifically trying to extract one number. However, each time I run the script and get the output file, the position of the number I am interested in will be in a different position (due to the log nature of the output file).
I have tried several awk, sed, grep commands but I can't get any to work as many of them rely on the position of the word or number remaining constant.
This is what I am dealing, with. The value I require is the bold one:
Energy initial, next-to-last, final =
-5.96306582435 -5.96306582435 -5.96349956298
You can try
awk '{print $(i++%3+6)}' infile