awk: first, split a line into separate lines; second, use those new lines as a new input

awk: first, split a line into separate lines; second, use those new lines as a new input - awk

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.

I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}

If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.

With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!

To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')

you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

Keep current and previous line only if current line fulfills a given condition

I have a file which looks like this:
>4RYF_1
MAENTKNENITNILTQKLIDTRTVLIYGEINQELAEDVSKQLLLLESISNDPITIFINSQGGHVEAGDTIHDMIKFIKPTVKVVGTGWVASAGITIYLAAEKENRFSLPNTRYMIHQPAGGVQGQSTEIEIEAKEIIRMRERINRLIAEATGQSYEQISKDTDRNFWLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
I want to keep the sequence and previous line only if the sequence has a given length. For selecting only lines with that condition I use:
awk 'length($0) > 50 && length($0) <=800)' sample.txt
But how can I keep lines starting with > as well if this condition is met?

Yet another awk solution:
awk '/^>/ { header = $0; next } length > 50 && length <= 800 { print header ORS $0 }'

Would you please try the following:
awk -v RS='>' -F'\n' '
length($2) > 50 && length($2) <= 800 {printf ">%s", $0}
' sample.txt
Assigning RS to '>' tells awk to split the file on > into records,
treating the header line and the sequence line in the same record.
Assigning FS to '\n' splits the record to the header and
sequence, each assigning $1 to the header and $2 to the sequence.
As the leading > is chopped off as a delimiter, we need to prepend it
when printing the matched records.

Here is one-liner:
LANG=C grep -B1 '^.\{51,800\}$' < sample.txt
The command was really slow with LANG=en_US.UTF-8 that I set by default, so using LANG=C instead.
man grep tells you that '-B NUM' means ' Print NUM lines of leading context before matching lines.'.
'^' means start of line
'.' means any character
'{51,800}' means we want between 51 and 800 of the previous thing
'$' means end of line.
Or in other words, we want to match lines that are between 51 and 800 characters, and print it and the previous line.

A potential solution with AWK is:
awk '!/^>/ {next}; {getline s}; length(s) > 50 && length(s) <= 800 { print $0 "\n" s }' example.fasta
e.g. if example.fasta contains
>4RYF_1
WLSVNEAKDYGIVNEIIENRDGLKMASWSHPQFEK
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
>1000_chars
YiJOgeCApTkcJWxIuvooOxuqVnPdSLtOQmUfnzpBvcpYKyCvelFwKgMchYFnlvuZwVxNcnSvGcACsMywDQVvYBAiaIesQkLkYNsExRbqKPZIPnCRMAFHLmIzxIBqLwoNEPSKMZCTpwbbQCNrHSrbDMtCksTjvQsMeAkoudRGUJnPpQTEzwwnKoZBHtpMSIQBfYSPDYHwKktvCiFpewrsdDTQpqBajOWZkKURaKszEqDmdYMkzSAkMtlkXPfHroiTbyxZwzvrrMSXMRSavrBdgVYZanudjacRHWfpErJMkomXpzagXIzwbaeFgAgFnMxLuQHsdvZysqAsngkCZILvVLaFpkWnOpuYensROwkhwqUdngvlTsXBoCBwJUENUFgVdnSnxVOvfksyiabglFPqmSwhGabjNZiWGyvktzSDOQNGlEvoxhJCAOhxVAtZfyimzsziakpzfIszSWYVgKZTHatWSfttHYTkvgafcsVmitfEfQDuyyDAAAoTKpuhLrnHVFKgmEsSgygqcNLQYkpnhOosKiZJKpDolXcxAKHABtALqVXoVcSHpskrpWPrkkZLTpUXkENhnesmoQjonLWxkpcuJrOosXKNTDNuZaWIEtrDILXsIFTjAnrnwJBoirgNHcDURwDIzAXJSLPLmWkurOhWSLPrIOyqNvADBdIFaCGoZeewKleBHUGmKFWFcGgZIGUdOHwwINZqcOClPAjYaLNdLgDsUNCPwKMrOXJEyPvMRLaTJGgxzeoLCggJYTVjlJpyMsoCRZBDrBDckNMhJSQWBAxYBlqSpXnpmLeEJYirwjfCqZGBZdgkHzWGoAMxgNKHOAvGXsIbbuBjeeORhZaIrruBwDfzgTICuwWCAhCPqMqkHrxkQMZbXUIavknNhuIycoDssXlOtbSWsxVXQhWMyDQZWDlEtewXWKBPUcHDYWWgyOerbnoAxrnpsCulOxqxdywFJFoeWNpVGIPMUJSWwvlVDWNkjIBMlXPi
It will only print
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH
Edit
The method that I would recommend to better handle edge-cases is to use purpose-built bioinformatics software, e.g. seqkit
seqkit seq -m 50 -M 800 example.fasta
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDI
FLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEI
MIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEA
KDYGLIDDIIINKSGLKGHHHHHH

Is perl an option?
perl -nle '$prev && print if length() >50 and length() < 800 && print $prev; $prev = $_' input_file
$prev - Create a variable which will hold every line. When the length condition is met, and there has been a previous line $prev, then it prints the condition matched in $prev and prints the last line.
$prev = $_ Assigns the current line to the prev line variable
If the upper limit 800 is not essential, could sed be an option?
$ sed -En '/>/ {N;/[a-zA-Z0-9]{50,}/p}' input_file
/>/ - Match > and read into the pattern space
N; Run the condition on the next line after the match and append that to the pattern space also:
{50,} - If the length is 50 or more
\1/p - Return it and print
Output
>4RYF_2
MNLIPTVIEQTSRGERAYDIYSRLLKDRIIMLGSAIDDNVANSIVSQLLFLDAQDPEKDIFLYINSPGGSISAGMAIYDTMNFVKADVQTIGMGMAASMGSFLLTAGANGKRFALPNAEIMIHQPLGGAQGQATEIEIAARHILKIKERMNTIMAEKTGQPYEVIARDTDRDNFMTAQEAKDYGLIDDIIINKSGLKGHHHHHH

With your shown samples, please try following awk code. Written and tested with GNU awk.
awk -v RS= '
{
val=""
delete arr
while(match($0,/>[^\n]*\n*[^\n]*/)){
val=substr($0,RSTART,RLENGTH)
split(val,arr,"\n")
if(length(arr[2])>50 && length(arr[2])<=800){
print val
}
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file

If only the next line should meet the length restrictions, you can match and store the line that starts with > in a variable, for example previous
Then for the next line, check for the length and if the previous line is not empty.
If is is not, print the previous and the current line.
At the end, set the previous variable to an empty string.
awk '{
if (/^>/) {
previous = $0
next
}
if (length(previous) != 0 && length($0) > 50 && length($0) <= 800) {
print previous ORS $0
}
previous=""
}' sample.txt
See an AWK demo

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]

To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3

Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

how to use sed/awk to do math arithmetic from a file

I have a file test.txt with multiple lines sharing the same pattern:
a:1;qty=2;px=3;d=4;
a:5;qty=6;px=7;d=8;
a:9;qty=10;px=11;d=12;
And I would like to write a simple terminal linux cmd using sed/awk to calculate (2*3+6*7+10*11)/(2+6+10), which is sum(qty*px)/sum(qty).
May I ask that, how to retrieve the value of qty and px in each line, and then use awk to store the values and do the final calculation?
Thanks,

One way if no empty lines:
awk -F"[=;]" '{x+=$3;y+=$3*$5}END{print y/x}' file
If empty lines present,
awk -F"[=;]" '!/^$/{x+=$3;y+=$3*$5}END{print y/x}' file

If that's the most general pattern, then the following oneline should suffice
cat test.txt | sed 's/[a-zA-Z]*[:=]//g' | awk -F';' '{ s1 += $2*$3; s2 += $2; }; END { print s1/s2; }'

In case the keys are not always in the same order, you can do
awk -F "[=: ]*" '{ for( i=2; i<=NF;i+=2) a[$i]=$(i+1) }
{ num += a["px"]*a["qty"]; den+=a["qty"]}
END { print num/den }' file

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Consider the following command:
$ gawk -F"\t" "BEGIN{OFS=\"\t\"}{$2=$3=\"\"; print $0}" Input.tsv
When I set $2 = $3 = "", the intended effect is to get the same effect as writing:
print $1,$4,$5...$NF
However, what actually happens is that I get two empty fields, with the extra field delimiters still printing.
Is it possible to actually delete $2 and $3?
Note: If this was on Linux in bash, the correct statement above would be the following, but Windows does not handle single quotes well in cmd.exe.
$ gawk -F'\t' 'BEGIN{OFS="\t"}{$2=$3=""; print $0}' Input.tsv

This is an oldie but goodie.
As Jonathan points out, you can't delete fields in the middle, but you can replace their contents with the contents of other fields. And you can make a reusable function to handle the deletion for you.
$ cat test.awk
function rmcol(col, i) {
for (i=col; i<NF; i++) {
$i = $(i+1)
}
NF--
}
{
rmcol(3)
}
1
$ printf 'one two three four\ntest red green blue\n' | awk -f test.awk
one two four
test red blue

You can't delete fields in the middle, but you can delete fields at the end, by decrementing NF.
So you can shift all the later fields down to overwrite $2 and $3 then decrement NF by two, which erases the last two fields:
$ echo 1 2 3 4 5 6 7 | awk '{for(i=2; i<NF-1; ++i) $i=$(i+2); NF-=2; print $0}'
1 4 5 6 7

If you are just looking to remove columns, you can use cut:
$ cut -f 1,4- file.txt
To emulate cut:
$ awk -F "\t" '{ for (i=1; i<=NF; i++) if (i != 2 && i != 3) { if (i == NF) printf $i"\n"; else printf $i"\t" } }' file.txt
Similarly:
$ awk -F "\t" '{ delim =""; for (i=1; i<=NF; i++) if (i != 2 && i != 3) { printf delim $i; delim = "\t"; } printf "\n" }' file.txt
HTH

The only way I can think to do it in Awk without using a loop is to use gsub on $0 to combine adjacent FS:
$ echo {1..10} | awk '{$2=$3=""; gsub(FS"+",FS); print}'
1 4 5 6 7 8 9 10

One way could be to remove fields like you do and remove extra spaces with gsub:
$ awk 'BEGIN { FS = "\t" } { $2 = $3 = ""; gsub( /\s+/, "\t" ); print }' input-file

In the addition of the answer by Suicidal Steve I'd like to suggest one more solution but using sed instead awk.
It seems more complicated than usage of cut as it was suggested by Steve. But it was the better solution because sed -i allows editing in-place.
$ sed -i 's/\(.*,\).*,.*,\(.*\)/\1\2/' FILENAME

To remove fields 2 and 3 from a given input file (assuming a tab field separator), you can remove the fields from $0 using gensub and regenerate it as follows:
awk -F '\t' 'BEGIN{OFS="\t"}\
{$0=gensub(/[^\t]*\t/,"",3);\
$0=gensub(/[^\t]*\t/,"",2);\
print}' Input.tsv

The method presented in the answer of ghoti has some problems:
every assignment of $i = $(i+1) forces awk to rebuild the record $0. This implies that if you have 100 fields and you want to delete field 10, you rebuild the record 90 times.
changing the value of NF manually is not posix compliant and leads to undefined behaviour (as is mentioned in the comments).
A somewhat more cumbersome, but stable robust way to delete a set of columns would be:
a single column:
awk -v del=3 '
BEGIN{FS=fs;OFS=ofs}
{ b=""; for(i=1;i<=NF;++i) if(i!=del) b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
multiple columns:
awk -v del=3,5,7 '
BEGIN{FS=fs;OFS=ofs; del="," del ","}
{ b=""; for(i=1;i<=NF;++i) if (del !~ ","i",") b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file

Well, if the goal is to remove the extra delimiters, then you can use tr on Linux. Example:
$ echo "1,2,,,5" | tr -s ','
1,2,5

echo one two three four five six|awk '{
print $0
is3=$3
$3=""
print $0
print is3
}'
one two three four five six
one two four five six
three

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk: first, split a line into separate lines; second, use those new lines as a new input - awk

With GNU awk: $ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file 1 xxx 2 fooxxx With any awk: $ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file 1 xxx 2 fooxxx

Related

assigning a var inside AWK for use outside awk

Keep current and previous line only if current line fulfills a given condition

selecting columns in awk discarding corresponding header

how to use sed/awk to do math arithmetic from a file

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Categories

Resources