AWK FPAT not working as expected for string parsing - sql

I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'
In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:
(abc#xyz.com,www.example.com,'field2,(2
I am expecting this output
(abc#xyz.com,www.example.com,'field2,(2)'
Where is the problem in my code. I am search a lot and check awk manual for this but not successful.

My first answer below was wrong, there is an ERE for what you're trying to do:
$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Original answer, left as a different approach:
You need a 2-pass approach first to replace all )s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...) fields and put the RSs back to )s before printing them:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
FPAT = "[(][^)]*)"
$0 = $0
for (i=1; i<=NF; i++) {
gsub(RS,")",$i)
print $i
}
FS = FS
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
The above is gawk-only due to FPAT (or we could have used gawk patsplit()), with other awks you'd used a while-match()-substr() loop:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
while ( match($0,/[(][^)]*)/) ) {
field = substr($0,RSTART,RLENGTH)
gsub(RS,")",field)
print field
$0 = substr($0,RSTART+RLENGTH)
}
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')

Written and tested with your shown samples in GNU awk. This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.
echo "$b" | awk -F'\\),\\(' '{print $1}'
(abc#xyz.com,www.example.com,'field2,(2)'
Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.

Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')

if you wanna do it close to one pass, maybe try this
{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";
XFS = "\376\004\377";
XRS = "\051" ORS;
} ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) {
gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'
I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences. \051 = ")", \050 is the open one.
further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)
It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).
As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter. If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo. It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data. It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.
(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8)
This way, i wouldn't need FPAT at all.
*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed.

Related

awk remove endings of words ending with patterns

I have a large dataset and am trying to lemmatize a column ($14) with awk, which is I need to remove 'ing', 'ed', 's' in words if it ends with one of those pattern. So asked, asks, asking would be just 'ask' after all.
Let's say I have this dataset (the column I want to make modifications is $2:
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
With that, expected output is:
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
I have tried following regex with awk, but it didnt work.
awk -F'\t' '{gsub(/\(ing|ed|s\)\b/," ",$2); print}' file.txt
#this replaces some of the words with ing and ed, not all, words ending with s stays the same (which I dont want)
Please help, I'm new to awk and still exploring it.
Using GNU awk for gensub() and \> for word boundaries:
$ awk 'BEGIN{FS=OFS="\t"} {$2=gensub(/(ing|ed|s)\>/,"","g",$2)} 1' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
Using any awk with gsub you could do:
awk -F'\t' -v OFS="\t" '
{ gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
}1
' file
Example Input File
$ cat file
onething This is a string that is tested multiple times.
twoed I wanted to remove words ending with many patterns.
threes Reading books is good thing.
four Just a normal sentence.
Example Use/Output
$ awk -F'\t' -v OFS="\t" '
> { gsub(/(s|ed|ing)[.[:blank:]]/," ",$2)
> match($2,/[.]$/) || sub(/[[:blank:]]$/,".",$2)
> }1
> ' file
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
four Just a normal sentence.
(note: last line added as example of a sentence unchanged)
If you use GNU awk you were not far from it:
$ awk -F'\t' -v OFS='\t' '{gsub(/ing|ed|s\>/,"",$2); print}' file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.
Note -v OFS='\t' to use tabs also as output field separators.
But if your awk uses the kind of obsolete regular expressions without word boundaries (like the default awk that comes with macOS) things are more complicated. One option is to iteratively use match and substr. Example:
# foo.awk
BEGIN {
n = split(prefix, word, /,/)
for(i = 1; i <= n; i++) {
len[i] = length(word[i])
}
}
{
for(i = 1; i <= n; i++) {
re = word[i] "[^[:alnum:]]"
while(m = match($2, re)) {
if(m == 1) {
$2 = substr($2, len[i]+1, length($2))
} else {
$2 = substr($2, 1, m-1) substr($2, m+len[i], length($2))
}
}
}
print
}
And then:
$ awk -F'\t' -v OFS='\t' -v prefix="ing,ed,s" -f foo.awk file.txt
onething Thi i a str that i test multiple time.
twoed I want to remove word end with many pattern.
threes Read book i good th.

Can I delete a field in awk?

This is test.txt:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
If I run
awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt
the result is:
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
The $2 wasn't deleted, it just became empty.
I hope, when printing $0, that the result is:
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut:
cut -d, -f 1,3- file
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
If you want to remove 3rd field then use:
cut -d, -f 1,2,4- file
To remove 4th field use:
cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values.
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string.
awk '
match($0,/,[^,]*,/){
print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH)
}' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present:
$ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01
0x01,0x00,0x76
$
Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution.
sed 's/,[^,]*//' test.txt
This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1.
So for n = 4 you'd have
sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt
or more generally, if your sed dialect understands braces for specifying repetitions
sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt
Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable.
In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc).
Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma:
sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted:
awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt
Step by step:
{
col = 2 # defines which column will be deleted
n = split($0,arr,",") # each line is split into an array
# n is the number of elements in the array
line = "" # this will be the new line
for (i = 1; i <= n; i++) # roaming through all elements in the array
line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] )
# appends a comma (except if line is still empty)
# and the current array element to the line (except when on the selected column)
print line # prints line
}
Another solution:
You can just pipe the output to another sed and squeeze the delimiters.
$ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g'
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
$
Commenting on the first solution of #RavinderSingh13 using sub() function:
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html
It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field)
So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons":
awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
My solution:
awk -F, '
{
regex = "^"$1","$2
sub(regex, $1, $0);
print $0;
}'
or one line code:
awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt
I found that OFS="," was not necessary
I would do it following way, let file.txt content be:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
then
awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt
output
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.

AWK:Convert columns to rows with condition (create list ) [duplicate]

I have a tab-delimited file with three columns (excerpt):
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase
AC147602.5_FG004 IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR002110 Ankyrin repeat
AC148152.3_FG001 IPR026961 PGG domain
and I'd like to get this using bash:
AC147602.5_FG004 IPR000146 Fructose-1,6-bisphosphatase class 1/Sedoheputulose-1,7-bisphosphatase IPR023079 Sedoheptulose-1,7-bisphosphatase
AC148152.3_FG001 IPR023079 Sedoheptulose-1,7-bisphosphatase IPR002110 Ankyrin repeat IPR026961 PGG domain
So if ID in the first column are the same in several lines, it should produce one line for each ID with all other parts of lines joined. In the example it will give two-row file.
give this one-liner a try:
awk -F'\t' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file
For whatever reason, the awk solution does not work for me in cygwin. So I used Perl instead. It joins around a tab character and separates line by \n
cat FILENAME | perl -e 'foreach $Line (<STDIN>) { #Cols=($Line=~/^\s*(\d+)\s*(.*?)\s*$/); push(#{$Link{$Cols[0]}}, $Cols[1]); } foreach $List (values %Link) { print join("\t", #{$List})."\n"; }'
will depend off file size (and awk limitation)
if too big this will reduce the awk need by sorting file first and only keep 1 label in memory for printing
A classical version with post print using a modification of the whole line
sort YourFile \
| awk '
last==$1 { sub( /^[^[:blank:]]*[[:blank:]]+/, ""); C = C " " $0; next}
NR > 1 { print Last C; Last = $1; C = ""}
END { print Last}
'
Another version using field and pre-print but less "human readable"
sort YourFile \
| awk '
last!=$1 {printf( "%s%s", (! NR ? "\n" : ""), Last=$1)}
last==$1 {for( i=2;i<NF;i++) printf( " %s", $i)}
'
A pure bash version. It has no additional dependencies, but requires bash 4.0 or above (2009) for associative array support.
All on one line:
{ declare -A merged; merged=(); while IFS=$'\t' read -r key value; do merged[$key]="${merged[$key]}"$'\t'"$value"; done; for key in "${!merged[#]}"; do echo "$key${merged[$key]}"; done } < INPUT_FILE.tsv
Readable and commented equivalent:
{
# Define `merged` as an empty associative array.
declare -A merged
merged=()
# Read tab-separated lines. Any leftover fields also end up in `value`.
while IFS=$'\t' read -r key value
do
# Append to any value that's already there, separated by a tab.
merged[$key]="${merged[$key]}"$'\t'"$value"
done
# Loop over the input keys. Note that the order is arbitrary;
# pipe through `sort` if you want a predictable order.
for key in "${!merged[#]}"
do
# Each value is prefixed with a tab, so no need for a tab here.
echo "$key${merged[$key]}"
done
} < INPUT_FILE.tsv

AWK, Comma delimited fields enclosed in quotes [duplicate]

The intent of this question is to provide a canonical answer.
Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:
$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
What's the most robust way efficiently using awk to identify the separate records and fields:
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
so it can be used as those records and fields internally by the rest of the awk script.
A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.
The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.
If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):
$ echo 'foo,"field,""with"",commas",bar' |
awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
or the equivalent using any awk:
$ echo 'foo,"field,""with"",commas",bar' |
awk -v fpat='[^,]*|("([^"]|"")*")' -v OFS=',' '{
rec = $0
$0 = ""
i = 0
while ( (rec!="") && match(rec,fpat) ) {
$(++i) = substr(rec,RSTART,RLENGTH)
rec = substr(rec,RSTART+RLENGTH+1)
}
for (i=1; i<=NF;i++) print i " <" $i ">"
}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>
See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the specific FPAT setting I use above.
If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:
$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1 fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3;fld2""",
Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:
$ cat decsv.awk
function buildRec( fpat,fldNr,fldStr,done) {
CurrRec = CurrRec $0
if ( gsub(/"/,"&",CurrRec) % 2 ) {
# The string built so far in CurrRec has an odd number
# of "s and so is not yet a complete record.
CurrRec = CurrRec RS
done = 0
}
else {
# If CurrRec ended with a null field we would exit the
# loop below before handling it so ensure that cannot happen.
# We use a regexp comparison using a bracket expression here
# and in fpat so it will work even if FS is a regexp metachar
# or a multi-char string like "\\\\" for \-separated fields.
CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
$0 = ""
fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
while ( (CurrRec != "") && match(CurrRec,fpat) ) {
fldStr = substr(CurrRec,RSTART,RLENGTH)
# Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
if ( gsub(/^"|"$/,"",fldStr) ) {
gsub(/""/, "\"", fldStr)
}
$(++fldNr) = fldStr
CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
}
CurrRec = ""
done = 1
}
return done
}
# If your input has \-separated fields, use FS="\\\\"; OFS="\\"
BEGIN { FS=OFS="," }
!buildRec() { next }
{
printf "Record %d:\n", ++recNr
for (i=1;i<=NF;i++) {
# To replace newlines with blanks add gsub(/\n/," ",$i) here
printf " $%d=<%s>\n", i, $i
}
print "----"
}
.
$ awk -f decsv.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.
It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.
*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:
if ( gsub(/^"|"$/,"",fldStr) ) {
to this:
if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {
Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:
#mosvy for escaped double quotes within fields.
#datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.
Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.
An improvement upon #EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).
gawk -v FPAT='[^,]*|("[^"]*")+' ...
This STILL
isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.
assumes GNU awk (gawk), a standard awk won't do.
Example:
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12
$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
for(i=1; i<=NF;i++){
if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
print "<"$i">"
}
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>
This is exactly what csvquote is for - it makes things simple for awk and other command line data processing tools.
Some things are difficult to express in awk. Instead of running a single awk command and trying to get awk to handle the quoted fields with embedded commas and newlines, the data gets prepared for awk by csvquote, so that awk can always interpret the commas and newlines it finds as field separators and record separators. This makes the awk part of the pipeline simpler. Once awk is done with the data, it goes back through csvquote -u to restore the embedded commas and newlines inside quoted fields.
csvquote file.csv | awk -f my_awk_script | csvquote -u
EDIT:
For a complete description on csvquote, see: How it works. this also explains the `` characters which are shown in places where there was a carriage return.
csvquote file.csv | awk -f decsv.awk | csvquote -u
(for the source of decsv.awk see answer from Ed Morton )
outut:
Record 1:
$1=<rec1 fld1>
$2=<>
$3=<rec1","fld3.1",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3fld2">
$3=<>
----
I have found csvkit a really useful toolkit to handle with csv files in command line.
line='test,t2,t3,"t5,"'
echo $line | csvcut -c 4
"t5,"
echo 'foo,"field,""with"",commas",bar' | csvcut -c 3
bar
It also contains csvstat, csvstack etc. tools which are also very handy.
cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1
fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",
csvcut -c 1 file.csv
"rec1, fld1"
"rec2, fld1.1
fld1.2"
""""""
csvcut -c 3 file.csv
"rec1"",""fld3.1
"",
fld3.2"
""
""
Awk (gawk) actually provides extensions, one of which being csv processing, which is the most robust way to do so with gawk in my opinion. The extension takes care of many gotchas and parses the csv for you.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print a[1]}'
The output is:
James T. Kirk
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print a[1] }, aka prints first csv field of the line.
If you're using one of the common AWK interpreters (Gawk, onetrueawk, mawk), the other solutions are your best bet. However, if you're able to use a different interpreter, frawk and GoAWK have proper CSV support built-in.
frawk is a very fast AWK implementation written in Rust. Use -i csv to process input in CSV mode. Note that frawk is not quite POSIX compatible (see differences).
GoAWK is a POSIX-compatible AWK implementation written in Go. Also supports -i csv mode, as well as -H (parse header row) with #"named_field" syntax (read more). Disclaimer: I'm the author of GoAWK.
With file.csv as per the question, you can simply use an AWK script with a regular for loop over the fields as follows:
$ cat records.awk
{
printf "Record %d:\n", NR
for (i=1; i<=NF; i++)
printf " $%d=<%s>\n", i, $i
print "----"
}
Then use either frawk -i csv or goawk -i csv to get the expected output. For example:
$ frawk -i csv -f records.awk file.csv
Record 1:
$1=<rec1, fld1>
$2=<>
$3=<rec1","fld3.1
",
fld3.2>
$4=<rec1
fld4>
----
Record 2:
$1=<rec2, fld1.1
fld1.2>
$2=<rec2 fld2.1"fld2.2"fld2.3>
$3=<>
$4=<rec2 fld4>
----
Record 3:
$1=<"">
$2=<"rec3,fld2">
$3=<>
----
$ goawk -i csv -f records.awk file.csv
Record 1:
... same as above ...
----

How to print fields for repeated key column in one line

I'd like to transform a table in such a way that for duplicated
values in column #2 it would have corresponding values from column #1.
I.e. something like that...
MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003
to
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
AC149475.2_FG003 MZ00013456
As I need it to computations in R I tried to use:
x=aggregate(mz_grmz,by=list(mz_grmz[,2]),FUN=paste(mz_grmz[,1],sep="|"))
but it don't work (wrong function)
Error in match.fun(FUN) :
'paste(mz_grmz[, 1], sep = "|")' is not a function, character or symbol
I also remind myself about unstack() function, but it isn't what I need.
I tried to do it using awk, based on my base knowledge I reworked code given here:
site1
#! /bin/sh
for y do
awk -v FS="\t" '{
for (x=1;x<=NR;x++) {
if (NR>2 && x=x+1) {
print $2"\t"x
}
else {print NR}
}
}' $y > $y.2
done
unfortunately it doesn't work, it's only produce enormous file with field #2 and some numbers.
I suppose it is easy task, but it is above my skills right now.
Could somebody give me a hint? Maybe just function to use in aggregate in R.
Thanks
You could do it in awk like this:
awk '
{
if ($2 in a)
a[$2] = a[$2] "|" $1
else
a[$2] = $1
}
END {
for (i in a)
print i, a[i]
}' INFILE > OUTFILE
to keep the output as same as the text in your question (empty lines etc..):
awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}\
END{for(x in a){print x,a[x];print ""}}' inputFile
test:
kent$ echo "MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003"|awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}END{for(x in a){print x,a[x];print ""}}'
AC149475.2_FG003 MZ00013456
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
This GNU sed solution might work for you:
sed -r '1{h;d};H;${x;s/(\S+)\s+(\S+)/\2\t\1/g;:a;s/(\S+\t)([^\n]*)(\n+)\1([^\n]*)\n*/\1\2|\4\3/;ta;p};d' input_file
Explanation: Use the extended regex option-r to make regex's more readable. Read the whole file into the hold space (HS). Then on end-of-file, switch to the HS and firstly swap and tab separate fields. Then compare the first fields in adjacent lines and if they match, tag the second field from the second record to the first line separated by a |. Repeated until no further adjacent lines have duplicate first fields then print the file out.