Grep exact match string with spaces as variable - awk

I have:
file.csv
Which contains
2,1,"string with spaces",3,4,5
2,1,"some other string",3,4,5
2,1,"string with spaces more than this",3,4,5
2,1,"yet another",3,4,5
2,1,"string with spaces too",3,4,5
When I do this:
grep '"string with spaces",' file.csv
It produces the desired out come which is:
2,1,"string with spaces",3,4,5
Now I need to do this in a while loop:
while read p; do
grep '"$p",' file.csv
done < list.txt
Where:
list.txt
contains:
string with spaces
yet another
And my desired output is:
2,1,"string with spaces",3,4,5
2,1,"yet another",3,4,5
The problem is that my while loop comes back empty, or matches partially. How do I loop through list.txt & get my desired output?

If you are ok with awk this should be an easy one for it.
awk 'FNR==NR{a[$0];next} ($4 in a)' list.txt FS="[,\"]" file.csv
OR(as per Ed sir's comment to make field separator as comma and keep it clearer, one could try following)
awk -F, 'FNR==NR{a["\""$0"\""];next} $3 in a' list.txt file.csv
Output will be as follows.
2,1,"string with spaces",3,4,5
2,1,"yet another",3,4,5

Your string quoting is all using single quotes ' which does not do any interpolation of the $p variable. Changing it to grep '"'"$p"'",' file.csv will solve the problem. The key is that here the variable interpolation is done inside of double quotes " and then concatenated with the strings containing actual double quote " characters.
A more (or less, depending on your point of view) readable version could look like this: grep "\"$p\"," file.csv

grep -Ff strings.txt file.csv
This should get you far enough.

Related

Issue with field separator in AWK script

Having a very large file where two lines shown below and having two fields name and revision having colon delimiter. I need to print only the second column.
sam:7.[0:6]
Ram:8.[6:6]_rev[2:4] h_ack[2:6]
vincent:58
I tried this code:
#!/bin/bash
awk -F: '{print $2}'
7.[0
8.[6
58
Output should be:
7.[0:6]
8.[6:6]_rev[2:4] h_ack[2:6]
58
What went wrong in my code.
The problem in your awk expression is that you are splitting on all :.
Instead, you want to split only on the first : from the start.
$ awk -F'^[^:]+:' '{print $2}' file
The regex pattern matches the start of the string ^, any character other than a :, and finally a :.
If you specify field separator as :, it's normal behavior of awk to output this, ex:
7.[0, because you need the other columns after $2.
cut here, better suits the requirement:
cut -d: -f2- file
Could you please try following.
awk '
match($0,/:.*/){
print substr($0,RSTART+1,RLENGTH-1)
}
' Input_file

Using awk to filter a CSV file with quotes in it

I have a text file with comma separated values.
A sample line can be something like
"Joga","Bonito",7,"Machine1","Admin"
The " seen are part of the text and are needed when this csv gets converted back to a java object.
I want to filter out some lines from this file based on some field in the csv.
The following statement doesnt work.
awk -F "," '($2== "Bonito") {print}' filename.csv
I am guessing that this has something to do with the " appearing in the text.
I saw an example like:
awk -F "\"*,\"*"
I am not sure how this works. It looks like a regex, but the use of the last * flummoxed me.
Is there a better option than the last awk statement I wrote?
How does it work?
Since some parameters have double quotes and other not, you can filter with a quoted parameter:
awk -F, '$2 == "\"Bonito\""' filename.csv
To filter on parameter that do not have double quote, just do:
awk -F, '$3 == 7' filename.csv
Another way is to use the double quote in the regex (the command ? that make the double quote optional):
awk -F '"?,"?' '$2 == "Bonito"' filename.csv
But this has a drawback of also matching the following line:
"Joga",Bonito",7,"Machine1","Admin"
First a bit more through test file:
$ cat file
"Joga","Bonito",7,"Machine1","Admin"
"Joga",Bonito,7,"Machine1","Admin"
Using regex ^\"? ie. starts with or without a double quote:
$ awk -F, '$2~/^\"?Bonito\"?$/' file
"Joga","Bonito",7,"Machine1","Admin"
"Joga",Bonito,7,"Machine1","Admin"

Split rows to multiple line based on comma : one liner solution

I want to split the following format to unique lines
Input:
17:79412041:C:T,CGGATGTCAT
17:79412059:C:G,T
17:79412138:G:A,C
17:79412192:C:G,T,A
Desired output
17:79412041:C:T
17:79412041:C:CGGATGTCAT
17:79412059:C:G
17:79412059:C:T
17:79412138:G:A
17:79412138:G:C
17:79412192:C:G
17:79412192:C:T
17:79412192:C:A
Basically split the input to unique rows or firstID:secondID:thirdID:FourthID. Here multiple row may have firstID:secondID:thirdID may be common and the FourthID is the one it make each raw unique(that was seperated by "," in the input).
Thanks in advance
Shams
awk one-liner
$ awk -F":" '{gsub(/,/,":"); a=$1FS$2FS$3; for(i=4; i<=NF; i++) print a FS $i;}' f1
17:79412041:C:T
17:79412041:C:CGGATGTCAT
17:79412059:C:G
17:79412059:C:T
17:79412138:G:A
17:79412138:G:C
17:79412192:C:G
17:79412192:C:T
17:79412192:C:A
We are first replacing all , with : to keep a common delimiter i.e. :
We are then traversing from 4th field to end and printing each field by prefixing first three fields.
This one-liner here:
$ awk -F':' '{ split($4,a,","); for (i in a) { print $1":"$2":"$3":"a[i] } }' data.txt
Produces:
17:79412041:C:T
17:79412041:C:CGGATGTCAT
17:79412059:C:G
17:79412059:C:T
17:79412138:G:A
17:79412138:G:C
17:79412192:C:G
17:79412192:C:T
17:79412192:C:A
Explanation:
split(string, array, delimiter)
splits the string by the delimiter, and saves the pieces into the array.
The for-in loop simply prints every piece in the array with the first three entries.
The -F':' part defines the top-level delimiter.
another awk, should work for any number of fields
$ awk -F: '{split($NF,a,","); for(i in a) {sub($NF"$",a[i]); print}}' file
Following awk + gsub of it may help you on same too:
awk -F":" '{gsub(",",ORS $1 OFS $2 OFS $3 "&");gsub(/,/,":")} 1' OFS=":" Input_file
This might work for you (GNU sed):
sed 's/^\(\(.*:\)[^:,]*\),/\1\n\2/;P;D' file
Insert a newline and the key for each comma in a line.
An alternative using a loop and syntactic sugar:
sed -r ':a;s/^((.*:)[^:,]*),/\1\n\2/;ta' file

Grep part of string after symbol and shuffle columns

I would like to take the number after the - sign and put is as column 2 in my matrix. I know how to grep the string but not how to print it after the text string.
in:
1-967764 GGCTGGTCCGATGGTAGTGGGTTATCAGAACT
3-425354 GCATTGGTGGTTCAGTGGTAGAATTCTCGCC
4-376323 GGCTGGTCCGATGGTAGTGGGTTATCAGAAC
5-221398 GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT
6-180339 TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT
out:
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
awk -F'[[:space:]-]+' '{print $3,$2}' file
Seems like a simple substitution should do the job:
sed -E 's/[0-9]+-([0-9]+)[[:space:]]*(.*)/\2 \1/' file
Capture the parts you're interested in and use them in the replacement.
Alternatively, using awk:
awk 'sub(/^[0-9]+-/, "") { print $2, $1 }' file
Remove the leading digits and - from the start of the line. When this is successful, sub returns true, so the action is performed, printing the second field, followed by the first.
Using regex ( +|-) as field separator:
$ awk -F"( +|-)" '{print $3,$2}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339
here is another awk
$ awk 'split($1,a,"-") {print $2,a[2]}' file
awk '{sub(/.-/,"");print $2,$1}' file
GGCTGGTCCGATGGTAGTGGGTTATCAGAACT 967764
GCATTGGTGGTTCAGTGGTAGAATTCTCGCC 425354
GGCTGGTCCGATGGTAGTGGGTTATCAGAAC 376323
GGAAGAGCACACGTCTGAACTCCAGTCACGTGAAAATCTCGTATGCCGTCT 221398
TCCCTGGTGGTCTAGTGGTTAGGATTCGGCGCT 180339

AWK get specificic pattern

I have lines like this:
Volume.Free_IBM_LUN59_28D: 2072083693568
I would like to get only IBM_LUN59_28D from this line using awk.
Thanks
You can use sub to do substitutions on each input line, as per the following transcript:
pax> echo 'Volume.Free_IBM_LUN59_28D: 2072083693568' | awk '
...> {
...> sub (".*Free_", "");
...> sub (":.*", "");
...> print
...> }'
IBM_LUN59_28D
That command crosses multiple lines for readability but, if you're operating on a file and not too concerned about readability, you can just use the compressed version:
awk '{sub(".*Free_","");sub(":.*","");print}' inputFile
If you're amenable to non-awk solutions, you could also use sed:
sed -e 's/.*Free_//' -e 's/:.*//' inputFile
Note that both those solutions rely on your (somewhat sparse) test data. If your definition of "like" includes preceding textual segments other than Free_ or subsequent characters other than :, some more work may be needed.
For example, if you wanted the string between the first _ and the first :, you could use:
awk '{sub("[^_]*_","");sub(":.*","");print}'
With sed:
sed 's/[^_]*_\(.*\):.*/\1/'
Search for sequence of non _ characters followed by _ (this will match Volume.Free_), then another sequence of characters (this will match IBM_LUN59_28D, we group this for future use), followed by : and any char sequence. Substitute with the saved pattern (\1). That's it.
Sample:
$ echo "Volume.Free_IBM_LUN59_28D: 2072083693568" | sed 's/[^_]*_\(.*\):.*/\1/'
IBM_LUN59_28D
Here is one awk
awk -F"Free_" 'NF>1{split($2,a,":");print a[1]}'
Eks:
echo "Volume.Free_IBM_LUN59_28D: 2072083693568" | awk -F"Free_" 'NF>1{split($2,a,":");print a[1]}'
IBM_LUN59_28D
It divides the line by Free_.
If line then have more than one field NF>1 then:
Split second field bye : and print first part a[1]
With awk:
echo "$val" | awk -F: '{print $1}' | awk -F. '{print $2}' | awk '{print substr($0,6)}'
where the given string is in $val.