printf adds mysterious trailing % character in zsh - printf

I want to print repeated * characters in zsh. This answer has a solution that works in bash:
printf '*%.0s' {1..50}
However when I run that in zsh, I get this output:
**************************************************%
where the trailing % sign has inverted colors. This is mysterious to me and I want to know why that happens, and how do I avoid it?

That's not a character, it's the lack of a character.
If the last line of output is not terminated (i.e. does not end with a newline character, \n), zsh shows a reverse-video % sign. See http://zsh.sourceforge.net/Doc/Release/Options.html#Prompting.
The fix is to just output a terminating newline:
printf '*%.0s' {1..50}; echo

Related

Attempting to pass an escape char to awk as a variable

I am using this command;
awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
which warns me;
awk: warning: escape sequence `\(' treated as plain `(
which prints;
new[[:blank:]]+File(
I would like the value to be;
new[[:blank:]]+File\(
I've tried amending the command to account for escape chars but always get the same warning
When you run:
$ awk -v regex1='new[[:blank:]]+File\(' 'BEGIN{print "Regex1 =", regex1}'
awk: warning: escape sequence `\(' treated as plain `('
Regex1 = new[[:blank:]]+File(
you're in shell assigning a string to an awk variable. When you use -v in awk you're asking awk to interpret escape sequences in such an assignment so that \t can become a literal tab char, \n a newline, etc. but the ( in your string has no special meaning when escaped and so \( is exactly the same as (, hence the warning message.
If you want to get a literal \ character you'd need to escape it so that \\ gets interpreted as just \:
$ awk -v regex1='new[[:blank:]]+File\\(' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
You seem to be trying to pass a regexp to awk and in my opinion once you get to needing 2 escapes your code is clearer and simpler if you put the target character into a bracket expression instead:
$ awk -v regex1='new[[:blank:]]+File[(]' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File[(]
If you want to assign an awk variable the value of a literal string with no interpretation of escape sequences then there are other ways of doing so without using -v, see How do I use shell variables in an awk script?.
If you use gnu awk then you can use a regexp literal with #/.../ format and avoid double escaping:
awk -v regex1='#/new[[:blank:]]+File\(/' 'BEGIN{print "Regex1 =", regex1}'
Regex1 = new[[:blank:]]+File\(
i think gawk and mawk 1/2 are also okay with the hideous but fool-proof octal method like
-v regex1="new[[:blank:]]+File[\\050]" # note the double quotes
once the engine takes out the first \\ layer, the regex being tested against is equivalent to
/new[[:blank:]]+File[\050]/
which is as safe as it gets. Reason why it matters is that something like
/new[[:blank:]]+File[\(]/
is something mawk/mawk2 are totally cool with but gawk will give an annoying warning message. octals (or [\x28]) get rid of that cross-awk weirdness and allow the same custom string regex to be deployed across all 3
(haven't tested against less popular variants like BWK original or NAWK etc).
ps : since i'm on the subject of octal caveats, mawk/mawk2 and gawk in binary mode are cool with square bracket octals for all bytes, meaning
"[\\302-\\364][\\200-\\277]+" # this happens to be a *very* rough proxy for UTF-8
is valid for all 3. if you really want to be the hex guy, that same regex becomes
"[\\xC2-\\xF4][\\x80-\\xBF]+"
however, gawk in unicode mode will scream about locale whenever you attempt to put squares around any non-ASCII byte. To circumvent that, you'll have to just list them out with a bunch of or's like :
(\302|\303|\304.....|\364)(\200|\201......|\277)+
this way you can get gawk unicode mode to handle any arbitrary byte and also handle binary input data (whatever the circumstances caused that to happen), and perform full base64 or URI plus encoding/decoding from within (plus anything else you want, like SHA256 or LZMA etc).... So far I've even managed to get gawk in unicode mode to base64 encode an MP4 file input without gawk spitting out the "illegal multi byte" error message.
.....and also get gawk and mawk in binary modes to become mostly UTF-8 aware and safe.
The "mostly" caveat being I haven't implemented the minute details like directly doing normalization form conversions from within instead of dumping out to python3 and getting results back via getline, or keeping modifier linguistics marks with its intended character if i do a UC-safe-substring string-reversal.

Remove pattern and everything before using AWK in fasta file

I searched a lot but could not find a solution to my issue. I have a file that looks like:
>HEADER1
AACTGGTTACGTGGTTCTCT
>HEADER2
GGTTTCTC
>HEADER3
CCAGGTTTCGAGGGGTTACGGGGTA
I want to remove GGTT pattern and everything before it. So basically there are several of these patterns in some of the lines so I want to remove all of them including everything before the pattern or among them.
The desired output should look like:
>HEADER1
CTCT
>HEADER2
TCTC
>HEADER3
ACGGGGTA
I tried suggested ways but could not be able to adjust it to my data.
Thank you in Advance for your help.
If it's not possible for your headers to include GGTT, I suppose the easiest would be:
$ sed 's/.*GGTT//' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
If your headers might contain GGTT, then awk probably be better:
$ awk '!/^>/ {sub(/.*GGTT/, "")}1' file
>HEADER1
CTCT
>HEADE2
TCTC
>HEADER3
ACGGGGTA
In both cases, the .*GGTT is "greedy", so it doesn't matter if there are multiple instances of GGTT, it will always match up to and remove everything through the last occurrence.
In the awk version, the pattern !/^>/ makes sure that subsitution is only done on lines that do not start with >.
Note that in general, sequences in fasta format as shown in the question may span multiple lines (= they are often wrapped to 80 or 100 nucleotides per line). This answer handles such cases correctly as well, unlike some other answers in this thread.
Use these two Perl one-liners connected by a pipe. The first one-liner does all of the common reformatting of the fasta sequences that is necessary in this and similar cases. It removes newlines and whitespace in the sequence (which also unwraps the sequence), but does not change the sequence header lines. It also properly handles leading and trailing whitespace/newlines in the file. The second one-liner actually removes everything up to and including the last GGTT in the sequence, in a case-insensitive manner.
Note: If GGTT is at the end of the sequence, the output will be a header plus an empty sequence. See seq4 in the example below. This may cause issues with some bioinformatics tools used downstream.
# Create the input for testing:
cat > in.fa <<EOF
>seq1 with blanks
ACGT GGTT ACGT
>seq2 with newlines
ACGT
GGTT
ACGT
>seq3 without blanks or newlines
ACGTGGTTACGT
>seq4 everything should be deleted, with empty sequence in the output
ACGTGGTTACGTGGTT
>seq5 lowercase
acgtggttacgt
EOF
# Reformat to single-line fasta, then delete subsequences:
perl -ne 'chomp; if ( /^>/ ) { print "\n" if $n; print "$_\n"; $n++; } else { s/\s+//g; print; } END { print "\n"; }' in.fa | \
perl -pe 'next if /^>/; s/.*GGTT//i;' > out.fa
Output in file out.fa:
>seq1 with blanks
ACGT
>seq2 with newlines
ACGT
>seq3 without blanks or newlines
ACGT
>seq4 everything should be deleted, with empty sequence in the output
>seq5 lowercase
acgt
The Perl one-linera use these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
chomp : Remove the input line separator (\n on *NIX).
if ( /^>/ ) : Test if the current line is a sequence header line.
$n : This variable is undefined (false) at the beginning, and true after seeing the first sequence header, in which case we print an extra newline. This newline goes at the end of each sequence, starting from the first sequence.
END { print "\n"; } : Print the final newline after the last sequence.
s/\s+//g; print; : If the current line is sequence (not header), remove all the whitespace and print without the terminal newline.
next if /^>/; : Skip the header lines.
s/.*GGTT//i; : Replace everything (.*) up to and including the the last GGTT with nothing (= delete it). The /i modifier means case-insensitive match.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
Remove line breaks in a FASTA file

Generating csv from text file in Linux command line with sed, awk or other

I have a file with thousands of lines that I would like to have it as a csv, for later processing.
The original file looks like this:
cc_1527 (ILDO_I173_net9 VSSA) capacitor_mis c=9.60713e-16
cc_1526 (VDD_MAIN Istartupcomp_I115_G7) capacitor_mis \
c=4.18106e-16
cc_1525 (VDD_MAIN Istartupcomp_I7_net025) capacitor_mis \
c=9.71462e-16
cc_1524 (VDD_MAIN Istartupcomp_I7_ST_net14) \
capacitor_mis c=4.6011e-17
cc_1523 (VDD_MAIN Istartupcomp_I7_ST_net15) \
capacitor_mis c=1.06215e-15
cc_1522 (VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_ST_net16) \
capacitor_mis c=1.37289e-15
cc_1521 (VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_I176_G4) capacitor_mis \
c=6.81758e-16
The problem here, is that some of the lines continue to the next one, indicated by the symbol "\".
The final csv format for the first 5 lines of the original text should be:
cc_1527,(ILDO_I173_net9 VSSA),capacitor_mis c=9.60713e-16
cc_1526,(VDD_MAIN Istartupcomp_I115_G7),capacitor_mis,c=4.18106e-16
cc_1525,(VDD_MAIN Istartupcomp_I7_net025),capacitor_mis,c=9.71462e-16
So, now everything is in one line only and the "\" characters have been removed.
Please notice that may exist spaces in the beginning of each line, so these should be trimmed before anything else is done.
Any idea on how to accomplish this. ?
Thanks in advance.
Best regards,
Pedro
Using some of the more obscure features of sed (It can do more than s///):
$ sed -E ':line /\\$/ {s/\\$//; N; b line}; s/[[:space:]]+/,/g' demo.txt
cc_1527,(ILDO_I173_net9,VSSA),capacitor_mis,c=9.60713e-16
cc_1526,(VDD_MAIN,Istartupcomp_I115_G7),capacitor_mis,c=4.18106e-16
cc_1525,(VDD_MAIN,Istartupcomp_I7_net025),capacitor_mis,c=9.71462e-16
cc_1524,(VDD_MAIN,Istartupcomp_I7_ST_net14),capacitor_mis,c=4.6011e-17
cc_1523,(VDD_MAIN,Istartupcomp_I7_ST_net15),capacitor_mis,c=1.06215e-15
cc_1522,(VDD_MAIN,ILDO_LDO_core_Istartupcomp_I7_ST_net16),capacitor_mis,c=1.37289e-15
cc_1521,(VDD_MAIN,ILDO_LDO_core_Istartupcomp_I7_I176_G4),capacitor_mis,c=6.81758e-16
Basically:
Read a line into the pattern space.
:line /\\$/ {s/\\$//; N; b line}: If the pattern space ends in a \, remove that backslash, read the next line and append it to the pattern space, and repeat this step.
s/[[:space:]]+/,/g: Convert every case of 1 or more whitespace characters to a single comma.
Print the result, and go back to the beginning with a new line.
The answer by #Shawn has been accepted by the OP and I'm not sure
if my answer is worth posting but allow me to do it just for information.
If Perl is your option, please try the following script which preserves
the whitespaces within parens not replacing them by commas:
perl -0777 -ne '
s/\\\n//g;
foreach $line (split(/\n/)) {
while ($line =~ /(\([^)]+\))|(\S+)/g) {
push(#ary, $&);
}
print join(",", #ary), "\n";
#ary = ();
}
' input.txt
Output:
cc_1527,(ILDO_I173_net9 VSSA),capacitor_mis,c=9.60713e-16
cc_1526,(VDD_MAIN Istartupcomp_I115_G7),capacitor_mis,c=4.18106e-16
cc_1525,(VDD_MAIN Istartupcomp_I7_net025),capacitor_mis,c=9.71462e-16
cc_1524,(VDD_MAIN Istartupcomp_I7_ST_net14),capacitor_mis,c=4.6011e-17
cc_1523,(VDD_MAIN Istartupcomp_I7_ST_net15),capacitor_mis,c=1.06215e-15
cc_1522,(VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_ST_net16),capacitor_mis,c=1.37289e-15
cc_1521,(VDD_MAIN ILDO_LDO_core_Istartupcomp_I7_I176_G4),capacitor_mis,c=6.81758e-16
[How it works]
First of all, -0777 -ne option tells Perl to slurp all lines
into the Perl's default variable $_.
Next, s/\\\n//g; removes trailing backslashes by merging lines.
Then split(/\n/) splits the lines on newlines back again.
The regex /(\([^)]+\))|(\S+)/g will be the most important part
which divides each line into fields. The field pattern is defined as:
"substring surrounded by parens OR substring which does not include whitespaces." It works as FPAT in awk and preserves whitespaces
between parens without dividing the line on them.
I've tested with approx. 10,000 line input and the execution time
is less than a second.
Hope this helps.

How to escape a percent sign in AWK printf?

I'm making an awk statement that will allow me to print a number of unicode nop's to the screen (in testing, 18 of them). It currently looks like the following:
awk 'BEGIN {while (c++<18) printf "%u9090"}'
When this executes this returns a run time error:
awk: run time error: not enough arguments passed to printf("%u9090")
I realised that I then had to escape my % character since I'm not passing any variables to awk, and it's expecting them. I revised to the following:
awk 'BEGIN {while (c++<18) printf "\%u9090"}'
However I'm still being presented with the same error? The gnu documentation suggests that I should be escaping using \ so I'm a bit amiss at what else to try.
All printfs I know (and in C as per the C Standard) allow you to specify a literal percent with %%.
The GNU docs you reference tell you about how to escape special characters in string literals. However, printf's first arg is interpreted as a format string, so the string literal escape mechanism is the wrong place to look. The proper place to look up is the printf specification (either for awk, or if all else fails, the C language).

Using awk how do I reprint a found pattern with a new line character?

I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where "bcd" can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of ";...:" and then reprint this pattern without the initial ;
I concluded I would have to use awk's 'gsub' to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can't quite be sure of the variability in the aaa or bcd parts; presumably, each one could be almost anything.
You should probably be looking for:
a series of one or more non-colon, non-semicolon characters followed by colon,
with one or more repeats of:
a series of one or more non-colon, non-semicolon characters followed by a semi-colon
That makes up the unit you want to match.
/[^:;]+:([^:;]+;)+/
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
{
echo "aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;"
echo "aaz: xcd;ycd;bczdd;baa:bed;bid;bud;"
} |
awk '{ gsub(/[^:;]+:([^:;]+;)+/, "&\n"); sub(/\n+$/, ""); print }'
Example output
aaa: bcd;bcd;bcddd;
aaa:bcd;bcd;bcd;
aaz: xcd;ycd;bczdd;
baa:bed;bid;bud;
Paraphrasing the question in a comment:
Why does the regular expression not include the characters before a colon (which is what it's intended to do, but I don't understand why)? I don't understand what "breaks" or ends the regex.
As I tried to explain at the top, you're looking for what we can call 'words', meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is [^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.
Let's pretend that spaces in a regex are not significant. We can space out the regex like this:
/ [^:;]+ : ( [^:;]+ ; ) + /
The slashes simply mark the ends, of course. The first cluster is a word; then there's a colon. Then there is a group enclosed in parentheses, tagged with a + at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What's inside the group? Well, a word followed by a semicolon. It doesn't have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a * in place of the +, of course.
The key to the regex stopping is that the aaa: in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because the aaa: doesn't match the group. The gsub() therefore finds the first sequence, and replaces that text with the same material and a newline (that's the "&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there's a second match to be replaced with its original material plus a newline.
I think that $0 must contain the newline at the end of the line. Therefore, without the sub() to remove a trailing newlines, the print (implictly of $0 with a newline) generated a blank line I didn't want in the output, so I removed the extraneous newline(s). The newline at the end of $0 would not be matched by the gsub() because it is not followed by a colon or semicolon.
This might work for you:
awk '{gsub(/[^;:]*:/,"\n&");sub(/^\n/,"");gsub(/: */,": ")}1' file
Prepend a newline (\n) to any string not containing a ; or a : followed by a :
Remove any newline prepended to the start of line.
Replace any : followed by none or many spaces with a : followed by a single space.
Print all lines.
Or this:
sed 's/;\([^;:]*: *\)/;\n\1 /g' file
Not sure how to do it in awk, but with sed this does what I think you want:
$ nl='
'
$ sed "s/\([^;]*:\)/\\${nl}\1/g" input
The first command sets the shell variable $nl to the string containing a single new line. Some versions of sed allow you to use \n inside the replacement string, but not all allow that. This keeps any whitespace that appears after the final ; and puts it at the start of the line. To get rid of that, you can do
$ sed "s/\([^;]*:\)/\\${nl}\1/g; s/\n */\\$nl/g" input
Ordinary awk gsub() and sub() don't allow you to specify components in the replacement strings Gnu awk - "gawk" - supplies "gensub()" which would allow "gensub(/(;) (.+:)/,"\1\n\2","g")"