Related
I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'
In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:
(abc#xyz.com,www.example.com,'field2,(2
I am expecting this output
(abc#xyz.com,www.example.com,'field2,(2)'
Where is the problem in my code. I am search a lot and check awk manual for this but not successful.
My first answer below was wrong, there is an ERE for what you're trying to do:
$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Original answer, left as a different approach:
You need a 2-pass approach first to replace all )s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...) fields and put the RSs back to )s before printing them:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
FPAT = "[(][^)]*)"
$0 = $0
for (i=1; i<=NF; i++) {
gsub(RS,")",$i)
print $i
}
FS = FS
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
The above is gawk-only due to FPAT (or we could have used gawk patsplit()), with other awks you'd used a while-match()-substr() loop:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
while ( match($0,/[(][^)]*)/) ) {
field = substr($0,RSTART,RLENGTH)
gsub(RS,")",field)
print field
$0 = substr($0,RSTART+RLENGTH)
}
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Written and tested with your shown samples in GNU awk. This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.
echo "$b" | awk -F'\\),\\(' '{print $1}'
(abc#xyz.com,www.example.com,'field2,(2)'
Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.
Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
if you wanna do it close to one pass, maybe try this
{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";
XFS = "\376\004\377";
XRS = "\051" ORS;
} ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) {
gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'
I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences. \051 = ")", \050 is the open one.
further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)
It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).
As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter. If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo. It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data. It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.
(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8)
This way, i wouldn't need FPAT at all.
*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed.
This is test.txt:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
If I run
awk -F, 'BEGIN{OFS=","}{$2="";print $0}' test.txt
the result is:
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
The $2 wasn't deleted, it just became empty.
I hope, when printing $0, that the result is:
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
All the existing solutions are good though this is actually a tailor made job for cut:
cut -d, -f 1,3- file
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
If you want to remove 3rd field then use:
cut -d, -f 1,2,4- file
To remove 4th field use:
cut -d, -f 1-3,5- file
I believe simplest would be to use sub function to replace first occurrence of continuous ,,(which are getting created after you made 2nd field NULL) with single ,. But this assumes that you don't have any commas in between field values.
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
2nd solution: OR you could use match function to catch regex from first comma to next comma's occurrence and get before and after line of matched string.
awk '
match($0,/,[^,]*,/){
print substr($0,1,RSTART-1)","substr($0,RSTART+RLENGTH)
}' Input_file
It's a bit heavy-handed, but this moves each field after field 2 down a place, and then changes NF so the unwanted field is not present:
$ awk -F, -v OFS=, '{ for (i = 2; i < NF; i++) $i = $(i+1); NF--; print }' test.txt
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01
0x01,0x00,0x76
$
Tested with both GNU Awk 4.1.3 and BSD Awk ("awk version 20070501" on macOS Mojave 10.14.6 — don't ask; it frustrates me too, but sometimes employers are not very good at forward thinking). Setting NF may or may not work on older versions of Awk — I was a little surprised it did work, but the surprise was a pleasant one, for a change.
If Awk is not an absolute requirement, and the input is indeed as trivial as in your example, sed might be a simpler solution.
sed 's/,[^,]*//' test.txt
This is especially elegant if you want to remove the second field. A more generic approach to remove, the nth field would require you to put in a regex which matches the first n - 1 followed by the nth, then replace that with just the the first n - 1.
So for n = 4 you'd have
sed 's/\([^,]*,[^,]*,[^,]*,\)[^,]*,/\1/' test.txt
or more generally, if your sed dialect understands braces for specifying repetitions
sed 's/\(\([^,]*,\)\{3\}\)[^,]*,/\1/' test.txt
Some sed dialects allow you to lose all those pesky backslashes with an option like -r or -E but again, this is not universally supported or portable.
In case it's not obvious, [^,] matches a single character which is not (newline or) comma; and \1 recalls the text from first parenthesized match (back reference; \2 recalls the second, etc).
Also, this is completely unsuitable for escaped or quoted fields (though I'm not saying it can't be done). Every comma acts as a field separator, no matter what.
With GNU sed you can add a number modifier to substitute nth match of non-comma characters followed by comma:
sed -E 's/[^,]*,//2' file
Using awk in a regex-free way, with the option to choose which line will be deleted:
awk '{ col = 2; n = split($0,arr,","); line = ""; for (i = 1; i <= n; i++) line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] ); print line }' test.txt
Step by step:
{
col = 2 # defines which column will be deleted
n = split($0,arr,",") # each line is split into an array
# n is the number of elements in the array
line = "" # this will be the new line
for (i = 1; i <= n; i++) # roaming through all elements in the array
line = line ( i == col ? "" : ( line == "" ? "" : "," ) arr[i] )
# appends a comma (except if line is still empty)
# and the current array element to the line (except when on the selected column)
print line # prints line
}
Another solution:
You can just pipe the output to another sed and squeeze the delimiters.
$ awk -F, 'BEGIN{OFS=","}{$2=""}1 ' edward.txt | sed 's/,,/,/g'
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
$
Commenting on the first solution of #RavinderSingh13 using sub() function:
awk 'BEGIN{FS=OFS=","}{$2="";sub(/,,/,",");print $0}' Input_file
The gnu-awk manual: https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html
It is important to note that making an assignment to an existing field changes the value of $0 but does not change the value of NF, even when you assign the empty string to a field." (4.4 Changing the Contents of a Field)
So, following the first solution of RavinderSingh13 but without using, in this case,sub() "The field is still there; it just has an empty value, delimited by the two colons":
awk 'BEGIN {FS=OFS=","} {$2="";print $0}' file
0x01,,0x93,0x65,0xF8
0x01,,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,,0x00,0x76
My solution:
awk -F, '
{
regex = "^"$1","$2
sub(regex, $1, $0);
print $0;
}'
or one line code:
awk -F, '{regex="^"$1","$2;sub(regex, $1, $0);print $0;}' test.txt
I found that OFS="," was not necessary
I would do it following way, let file.txt content be:
0x01,0xDF,0x93,0x65,0xF8
0x01,0xB0,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0xB2,0x00,0x76
then
awk 'BEGIN{FS=",";OFS=""}{for(i=2;i<=NF;i+=1){$i="," $i};$2="";print}' file.txt
output
0x01,0x93,0x65,0xF8
0x01,0x01,0x03,0x02,0x00,0x64,0x06,0x01,0xB0
0x01,0x00,0x76
Explanation: I set OFS to nothing (empty string), then for 2nd and following column I add , at start. Finally I set what is now comma and value to nothing. Keep in mind this solution would need rework if you wish to remove 1st column.
How could I possibly replace a character with another, selecting the last word from the last two lines of a text file in shell, using only a single command? In my case, replacing every occurrence of a with E from the last word only.
Like, from a text file containing this:
tree;apple;another
mango.banana.half
monkey.shelf.karma
to this:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
I tried using sed -n 'tail -2 'mytext.txt' -r 's/[a]+/E/*$//' but it doesn't work (my error: sed expression #1, char 10: unknown option to 's).
Could you please try following, tac + awk solution. Completely based on OP's samples only.
tac Input_file |
awk 'FNR<=2{if(/;/){FS=OFS=";"};if(/\./){FS=OFS="."};gsub(/a/,"E",$NF)} 1' |
tac
Output with shown samples is:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
NOTE: Change gsub to sub in case you want to substitute only very first occurrence of character a in last field.
This might work for you (GNU sed):
sed -E 'N;${:a;s/a([^a.]*)$/E\1/mg;ta};P;D' file
Open a two line window throughout the length of the file by using the N to append the next line to the previous and the P and D commands to print then delete the first of these. Thus at the end of the file, signified by the $ address the last two lines will be present in the pattern space.
Using the m multiline flag on the substitution command, as well as the g global flag and a loop between :a and ta, replace any a in the last word (delimited by .) by an E.
Thus the first pass of the substitution command will replace the a in half and the last a in karma. The next pass will match nothing in the penultimate line and replace the a in karmE. The third pass will match nothing and thus the ta command will fail and the last two lines will printed with the required changes.
If you want to use Sed, here's a solution:
tac input_file | sed -E '1,2{h;s/.*[^a-zA-Z]([a-zA-Z]+)/\1/;s/a/E/;x;s/(.*[^a-zA-Z]).*/\1/;G;s/\n//}' | tac
One tiny detail. In your question you say you want to replace a letter, but then you transform karma in kErme, so what is this? If you meant to write kErma, then the command above will work; if you meant to write kErmE, then you have to change it just a bit: the s/a/E/ should become s/a/E/g.
With tac+perl
$ tac ip.txt | perl -pe 's/\w+\W*$/$&=~tr|a|E|r/e if $.<=2' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
\w+\W*$ match last word in the line, \W* allows any possible trailing non-word characters to be matched as well. Change \w and \W accordingly if numbers and underscores shouldn't be considered as word characters - for ex: [a-zA-Z]+[^a-zA-Z]*$
$&=~tr|a|E|r change all a to E only for the matched portion
e flag to enable use of Perl code in replacement section instead of string
To do it in one command, you can slurp the entire input as single string (assuming this'll fit available memory):
perl -0777 -pe 's/\w+\W*$(?=(\n.*)?\n\z)/$&=~tr|a|E|r/gme'
Using GNU awk forsplit() 4th arg since in the comments of another solution the field delimiter is every sequence of alphanumeric and numeric characters:
$ gawk '
BEGIN {
pc=2 # previous counter, ie how many are affected
}
{
for(i=pc;i>=1;i--) # buffer to p hash, a FIFO
if(i==pc && (i in p)) # when full, output
print p[i]
else if(i in p) # and keep filling
p[i+1]=p[i] # above could be done using mod also
p[1]=$0
}
END {
for(i=pc;i>=1;i--) {
n=split(p[i],t,/[^a-zA-Z0-9\r]+/,seps) # split on non alnum
gsub(/a/,"E",t[n]) # replace
for(j=1;j<=n;j++) {
p[i]=(j==1?"":p[i] seps[j-1]) t[j] # pack it up
}
print p[i] # output
}
}' file
Output:
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Would this help you ? on GNU awk
$ cat file
tree;apple;another
mango.banana.half
monkey.shelf.karma
$ tac file | awk 'NR<=2{s=gensub(/(.*)([.;])(.*)$/,"\\3",1);gsub(/a/,"E",s); print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;next}1' | tac
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
Better Readable version :
$ tac file | awk 'NR<=2{
s=gensub(/(.*)([.;])(.*)$/,"\\3",1);
gsub(/a/,"E",s);
print gensub(/(.*)([.;])(.*)$/,"\\1\\2",1) s;
next
}1' | tac
With GNU awk you can set FS with the two separators, then gsub for the replacement in $3, the third field, if NR>1
awk -v FS=";|[.]" 'NR>1 {gsub("a", "E",$3)}1' OFS="." file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
With GNU awk for the 3rd arg to match() and gensub():
$ awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/(.*[^[:alnum:]])(.*)/,a)
print a[1] gensub(/a/,"E","g",a[2])
}
}
' file
tree;apple;another
mango.banana.hElf
monkey.shelf.kErmE
or with any awk:
awk -v n=2 '
NR>n { print p[NR%n] }
{ p[NR%n] = $0 }
END {
for (i=0; i<n; i++) {
match(p[i],/.*[^[:alnum:]]/)
lastWord = substr(p[i],1+RLENGTH)
gsub(/a/,"E",lastWord )
print substr(p[i],1,RLENGTH) lastWord
}
}
' file
If you want to do it for the last 50 lines of a file instead of the last 2 lines just change -v n=2 to -v n=50.
The above assumes there are at least n lines in your input.
You can let sed repeat changing an a into E only for the last word with a label.
tac mytext.txt| sed -r ':a; 1,2s/a(\w*)$/E\1/; ta' | tac
Fairly new to the awk command and still playing with it, I am trying to display multiple lines of a file, lets say lines 3-5, and display it backwards. So with the given file:
Hello World
How are you
I love computer science,
I am using awk,
And it is hard.
And it should output:
science, computer love I
awk, using am I
hard. is it And
Any step in the correct direction will be beneficial!!
Following awk may help you in same, where I am using start and end variables to get only those lines which are needed to be printed by OP.
awk -v start=3 -v end=5 'FNR>=start && FNR<=end{for(;NF>0;NF--){printf("%s%s",$NF,NF==1?RS:FS)}}' Input_file
Output will be as follows.
science, computer love I
awk, using am I
hard. is it And
Explanation: Adding explanation to solution too now.
awk -v start=3 -v end=5 ' ##Mentioning variables named start and end where start is denoting the starting line and end is denoting end line which we want to print.
FNR>=start && FNR<=end{ ##Checking condition here if variable FNR(awk out of the box variable) value is greater than or equal to variable start AND FNR value is less than or equal to end variable. If condition is TRUE then do following:
for(;NF>0;NF--){ ##Initiating a for loop which starts from value of NF(Number of fields, which is out of the box variable of awk) and it runs till NF is 0.
printf("%s%s",$NF,NF==1?RS:FS)} ##Printing value of NF(value of field) and other string will be either space of new line(by checking when field is first then print new line as print space).
}
' Input_file ##Mentioning Input_file name here.
$ cat tst.awk
NR>2 && NR<6 {
for (i=NF; i>0; i--) {
printf "%s%s", $i, (i>1?OFS:ORS)
}
}
$ awk -f tst.awk file
science, computer love I
awk, using am I
hard. is it And
You can use the following awk command to reach your goal:
input:
$ cat text
Hello World
How are you
I love computer science,
I am using awk,
And it is hard.
output:
$ awk 'NR<3{print}NR>=3{for(i=0; i<NF; i++){printf "%s ",$(NF-i);} printf "\n";}' text
Hello World
How are you
science, computer love I
awk, using am I
hard. is it And
Explanations:
NR<3{print} will print first 2 lines in the correct order
NR>=3{for(i=0; i<NF; i++){printf $(NF-i)" ";} printf "\n";}' from the 3rd line you have a loop on all the field identified by NF and you print them one after another from the last one to the first one ($NF is the last one $1 is the first one) and you separate each field with a space. Last but not least after the loop you print and EOL char.
Now, if you do not need to print the first 2 lines use:
$ awk 'NR>=3{for(i=0; i<NF; i++){printf "%s ",$(NF-i);} printf "\n";}' text
science, computer love I
awk, using am I
hard. is it And
For files with more lines for which you want to print only a range (3-5) use:
$ awk 'NR>=3 && NR<=5{for(i=0; i<NF; i++){printf "%s ",$(NF-i);} printf "\n";}' text
I'd like to transform a table in such a way that for duplicated
values in column #2 it would have corresponding values from column #1.
I.e. something like that...
MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003
to
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
AC149475.2_FG003 MZ00013456
As I need it to computations in R I tried to use:
x=aggregate(mz_grmz,by=list(mz_grmz[,2]),FUN=paste(mz_grmz[,1],sep="|"))
but it don't work (wrong function)
Error in match.fun(FUN) :
'paste(mz_grmz[, 1], sep = "|")' is not a function, character or symbol
I also remind myself about unstack() function, but it isn't what I need.
I tried to do it using awk, based on my base knowledge I reworked code given here:
site1
#! /bin/sh
for y do
awk -v FS="\t" '{
for (x=1;x<=NR;x++) {
if (NR>2 && x=x+1) {
print $2"\t"x
}
else {print NR}
}
}' $y > $y.2
done
unfortunately it doesn't work, it's only produce enormous file with field #2 and some numbers.
I suppose it is easy task, but it is above my skills right now.
Could somebody give me a hint? Maybe just function to use in aggregate in R.
Thanks
You could do it in awk like this:
awk '
{
if ($2 in a)
a[$2] = a[$2] "|" $1
else
a[$2] = $1
}
END {
for (i in a)
print i, a[i]
}' INFILE > OUTFILE
to keep the output as same as the text in your question (empty lines etc..):
awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}\
END{for(x in a){print x,a[x];print ""}}' inputFile
test:
kent$ echo "MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003"|awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}END{for(x in a){print x,a[x];print ""}}'
AC149475.2_FG003 MZ00013456
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
This GNU sed solution might work for you:
sed -r '1{h;d};H;${x;s/(\S+)\s+(\S+)/\2\t\1/g;:a;s/(\S+\t)([^\n]*)(\n+)\1([^\n]*)\n*/\1\2|\4\3/;ta;p};d' input_file
Explanation: Use the extended regex option-r to make regex's more readable. Read the whole file into the hold space (HS). Then on end-of-file, switch to the HS and firstly swap and tab separate fields. Then compare the first fields in adjacent lines and if they match, tag the second field from the second record to the first line separated by a |. Repeated until no further adjacent lines have duplicate first fields then print the file out.