Using awk to filter a CSV file with quotes in it - awk

I have a text file with comma separated values.
A sample line can be something like
"Joga","Bonito",7,"Machine1","Admin"
The " seen are part of the text and are needed when this csv gets converted back to a java object.
I want to filter out some lines from this file based on some field in the csv.
The following statement doesnt work.
awk -F "," '($2== "Bonito") {print}' filename.csv
I am guessing that this has something to do with the " appearing in the text.
I saw an example like:
awk -F "\"*,\"*"
I am not sure how this works. It looks like a regex, but the use of the last * flummoxed me.
Is there a better option than the last awk statement I wrote?
How does it work?

Since some parameters have double quotes and other not, you can filter with a quoted parameter:
awk -F, '$2 == "\"Bonito\""' filename.csv
To filter on parameter that do not have double quote, just do:
awk -F, '$3 == 7' filename.csv
Another way is to use the double quote in the regex (the command ? that make the double quote optional):
awk -F '"?,"?' '$2 == "Bonito"' filename.csv
But this has a drawback of also matching the following line:
"Joga",Bonito",7,"Machine1","Admin"

First a bit more through test file:
$ cat file
"Joga","Bonito",7,"Machine1","Admin"
"Joga",Bonito,7,"Machine1","Admin"
Using regex ^\"? ie. starts with or without a double quote:
$ awk -F, '$2~/^\"?Bonito\"?$/' file
"Joga","Bonito",7,"Machine1","Admin"
"Joga",Bonito,7,"Machine1","Admin"

Related

how to use "," as field delimiter [duplicate]

This question already has answers here:
Escaping separator within double quotes, in awk
(3 answers)
Closed 1 year ago.
i have a file like this:
"1","ab,c","def"
so only use comma a field delimiter will get wrong result, so i want to use "," as field delimiter, i tried like this:
awk -F "," '{print $0}' file
or like this:
awk -F "","" '{print $0}' file
or like this:
awk -F '","' '{print $0}' file
but the result is incorrect, don't know how to include "" as part of the field delimiter itself,
If you can handle GNU awk, you could use FPAT:
$ echo '"1","ab,c","def"' | # echo outputs with double quotes
gawk ' # use GNU awk
BEGIN {
FPAT="([^,]*)|(\"[^\"]+\")" # because FPAT
}
{
for(i=1;i<=NF;i++) # loop all fields
gsub(/^"|"$/,"",$i) # remove leading and trailing double quotes
print $2 # output for example the second field
}'
Output:
ab,c
FPAT cannot handle RS inside the quotes.
What you are attempting seems misdirected anyway. How about this instead?
awk '/^".*"$/{ sub(/^\"/, ""); sub(/\"$/, ""); gsub(/\",\", ",") }1'
The proper solution to handling CSV files with quoting in them is to use a language which has an actual CSV parser. My thoughts go to Python, which includes a csv module in its standard library.
In GNU AWK
{print $0}
does print whole line, if no change were made original line is printed, no matter what field separator you set you will get original lines if only action is print $0. Use $1=$1 to trigger string rebuild.
If you must do it via FS AT ANY PRICE, then you might do it as follows: let file.txt content be
"1","ab,c","def"
then
BEGIN{FS="\x22,?\x22?"}{$1=$1;print $0}
output
1 ab,c def
Note leading space (ab,c is $3). Explanation: I inform GNU AWK that field separator is literal " (\x22, " is 22(hex) in ASCII) followed by zero or one (?) , followed by zero or one (?) literal " (\x22). $1=$1 trigger line rebuilt as mentioned earlier. Disclaimer: this solution assume that you never have escaped " inside your string,
(tested in gawk 4.2.1)

Grep exact match string with spaces as variable

I have:
file.csv
Which contains
2,1,"string with spaces",3,4,5
2,1,"some other string",3,4,5
2,1,"string with spaces more than this",3,4,5
2,1,"yet another",3,4,5
2,1,"string with spaces too",3,4,5
When I do this:
grep '"string with spaces",' file.csv
It produces the desired out come which is:
2,1,"string with spaces",3,4,5
Now I need to do this in a while loop:
while read p; do
grep '"$p",' file.csv
done < list.txt
Where:
list.txt
contains:
string with spaces
yet another
And my desired output is:
2,1,"string with spaces",3,4,5
2,1,"yet another",3,4,5
The problem is that my while loop comes back empty, or matches partially. How do I loop through list.txt & get my desired output?
If you are ok with awk this should be an easy one for it.
awk 'FNR==NR{a[$0];next} ($4 in a)' list.txt FS="[,\"]" file.csv
OR(as per Ed sir's comment to make field separator as comma and keep it clearer, one could try following)
awk -F, 'FNR==NR{a["\""$0"\""];next} $3 in a' list.txt file.csv
Output will be as follows.
2,1,"string with spaces",3,4,5
2,1,"yet another",3,4,5
Your string quoting is all using single quotes ' which does not do any interpolation of the $p variable. Changing it to grep '"'"$p"'",' file.csv will solve the problem. The key is that here the variable interpolation is done inside of double quotes " and then concatenated with the strings containing actual double quote " characters.
A more (or less, depending on your point of view) readable version could look like this: grep "\"$p\"," file.csv
grep -Ff strings.txt file.csv
This should get you far enough.

Grep part of string after symbol and shuffle columns

I would like to take the number after the - sign and put is as column 2 in my matrix. I know how to grep the string but not how to print it after the text string.
in:
1-967764 GGCTGGTCCGATGGTAGTGGGTTATCAGAACT
3-425354 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC
4-376323 GGCTGGTCCGATGGTAGTGGGTTATCAGAAC
5-221398 GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT
6-180339 TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT
out:
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
awk -F'[[:space:]-]+' '{print $3,$2}' file
Seems like a simple substitution should do the job:
sed -E 's/[0-9]+-([0-9]+)[[:space:]]*(.*)/\2 \1/' file
Capture the parts you're interested in and use them in the replacement.
Alternatively, using awk:
awk 'sub(/^[0-9]+-/, "") { print $2, $1 }' file
Remove the leading digits and - from the start of the line. When this is successful, sub returns true, so the action is performed, printing the second field, followed by the first.
Using regex ( +|-) as field separator:
$ awk -F"( +|-)" '{print $3,$2}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
here is another awk
$ awk 'split($1,a,"-") {print $2,a[2]}' file
awk '{sub(/.-/,"");print $2,$1}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339

Enclosing a single quote in Awk

I currently have this line of code, that needs to be increased by one every-time in run this script. I would like to use awk in increasing the third string (570).
'set t 570'
I currently have this to change the code, however I am missing the closing quotation mark. I would also desire that this only acts on this specific (above) line, however am unsure about where to place the syntax that awk uses to do that.
awk '/set t /{$3+=1} 1' file.gs >file.tmp && mv file.tmp file.gs
Thank you very much for your input.
Use sub() to perform a replacement on the string itself:
$ awk '/set t/ {sub($3+0,$3+1,$3)} 1' file
'set t 571'
This looks for the value in $3 and replaces it with itself +1. To avoid replacing all of $3 and making sure the quote persists in the string, we say $3+0 so that it evaluates to just the number, not the quote:
$ echo "'set t 570'" | awk '{print $3}'
570'
$ echo "'set t 570'" | awk '{print $3+0}'
570
Note this would fail if the value in $3 happens more times in the same line, since it will replace all of them.

Unable to match regex in string using awk

I am trying to fetch the lines in which the second part of the line contains a pattern from the first part of the line.
$ cat file.txt
String1 is a big string|big
$ awk -F'|' ' { if ($2 ~ /$1/) { print $0 } } ' file.txt
But it is not working.
I am not able to find out what is the mistake here.
Can someone please help?
Two things: No slashes, and your numbers are backwards.
awk -F\| '$1~$2' file.txt
I guess what you meant is part of the string in the first part should be a part of the 2nd part.if this is what you want! then,
awk -F'|' '{n=split($1,a,' ');for(i=1,i<=n;i++){if($2~/a[i]/)print $0}}' your_file
There are surprisingly many things wrong with your command line:
1) You aren't using the awk condition/action syntax but instead needlessly embedding a condition within an action,
2) You aren't using the default awk action but instead needlessly hand-coding a print $0.
3) You have your RE operands reversed.
4) You are using RE comparison but it looks like you really want to match strings.
You can fix the first 3 of the above by modifying your command to:
awk -F'|' '$1~$2' file.txt
but I think what you really want is "4" which would mean you need to do this instead:
awk -F'|' 'index($1,$2)' file.txt