Align trailing backslashes vertically - awk

I have a file that looks like this:
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
For each group of consecutive lines that have a backslash at the end, I want to arrange those backslashes in a straight vertical line. So the expected output for above sample is:
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
I managed to align all backslashes with this program:
$ awk '/\\$/ { sub(/\\$/,""); printf "%-20s\\\n",$0; next} 1' file
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
But I have no idea how to proceed from here, so I'm asking for guidance. I tried searching on SO, but top results were all about removing trailing backslashes.
Details about the actual input:
Lines may contain any character including backslash and consist of any number of characters, there is no limit.
There might be multiple blanks and tabs before and after the last backslash.
There is always at least one blank or tab preceding the last backslash.
There is no line that consists of only a backslash and zero or more blanks/tabs around it.
P.S. I'm not looking for a Perl solution.

If you don't have any tabs before the spaces-then-backslash-then-spaces at the end of each line:
$ cat tst.awk
FNR == 1 { prevHasEsc = blockNr = 0 }
hasEsc = match($0,/[[:space:]]*\\[[:space:]]*$/) {
$0 = substr($0,1,RSTART-1)
if ( ! prevHasEsc ) {
++blockNr
}
}
{ prevHasEsc = hasEsc }
NR == FNR {
if ( hasEsc ) {
lgth[blockNr] = (lgth[blockNr] > length($0) ? lgth[blockNr] : length($0))
}
next
}
hasEsc {
$0 = sprintf("%-*s \\", lgth[blockNr], $0)
}
{ print }
.
$ awk -f tst.awk file file
abc \
d \
efgh \
i
jklmnop \
q \
r
If you do then I'd suggest running the text through pr -e -t first to convert tabs to the corresponding number of blanks.

Here is one attempt using gnu-awk with a custom RS that breaks input on each substring that ends without a blackslash:
awk -v RS='[^\n]*[^\\\\[:space:]][[:blank:]]*(\n|$)' '
{
sub(/\n$/, "", RT)
}
n = split($0, lines, /[[:blank:]]+\\[[:blank:]]*\n/) {
lines[n] = RT
mlen = 0
# determine max length of a block
for (i=1; i<=n; i++)
if (mlen < length(lines[i]))
mlen = length(lines[i])
# print each segment with backslash at the end of max length
for (i=1; i<n; i++)
printf "%-" (mlen+1) "s\\\n", lines[i]
}
RT {
print RT
}' file
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
Code Demo
Details:
-v RS='[^\n]*[^\\\\[:space:]][[:blank:]]*(\n|$): Sets record separator using this regex which basically matches a line that doesn't end with a \. As a result we'll get all contiguous lines that end with \ in each record.
split($0, lines, /[[:blank:]]+\\[[:blank:]]*\n/: Splits each record by ending \ and following line break.
In the first for loop, we loop through each array element and determine longest length of the line i.e. mlen
In the second for loop, we just print each line segment using mlen+1 as length to place trailing \
Finally we print RT which is the substring we capture as a result of RS=...

Related

Concatenate columns and adds digits awk

I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.
You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.

Awk if else expression not printing correct results for mathematical operation

So I have an input file that looks like this:
atom Comp
C1 45.7006
H40 30.0407
N41 148.389
S44 502.263
F45 365.162
I also have some variables that I have called in from another file, which I know are defined correctly, as the correct values print when I call them using echo.
These values are
Hslope=-1.1120
Hint=32.4057
Cslope=-1.0822
Cint=196.4234
What I am trying to do is to for all lines with C in the first column, print (column 2 - Cint)/Cslope. The same for all lines with H in the first column with the appropriate variables and have all lines that don't have C or H print "NA".
The first line should be skipped.
Currently, my code reads
awk -v Hslope=$Hslope -v Hint=$Hint -v Cslope=$Cslope -v Cint=$Cint '{for(i=2; i<=NR; i++)
{
if($1 ~ /C/)
{ shift = (($2-Cint)/Cslope); print shift }
else if($1 ~ /H/)
{ shift = (($2-Hint)/Hslope); print shift }
else
{ print "NA" }
} }' avRNMR >> vgRNMR
Where avRNMR is the input file and vgRNMR is the output file, which is already created with the contents "shift" by another line.
I have also tried a version where print is just set to the mathematical expression instead using "shift" as a variable. Another attempt was putting $ in front of every variable. Neither of these have produced any different results.
The output I get is
shift
139.274
2.1268
2.1268
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Which is not the correct answer, particularly considering that my input file only has the six lines shown above. Note that the number of lines with C, H, and other letters is variable.
What I should get is
shift
139.27
2.13
NA
NA
NA
EDIT
As suggested, exchanging "for(i=2; i<=NR; i++)" for FNR>1 gives the following output
shift
NA
C1 45.7006
139.274
H40 30.0407
2.1268
N41 148.389
NA
S44 502.263
NA
F45 365.162
NA
Which is almost the correct output for the math answers, but not in the desired format. That first NA also means that a line is getting read to print that, which, if it is truly skipping the first line, shouldn't happen.
Remove the for loop on i=2. Add pattern FNR>1 before the action. Anchor the two patterns to the beginning of the field:
awk -v Hslope=$Hslope -v Hint=$Hint -v Cslope=$Cslope -v Cint=$Cint '
FNR > 1 { # skip first record
if($1 ~ /^C/) print (($2-Cint)/Cslope)
else if($1 ~ /^H/) print (($2-Hint)/Hslope)
else print "NA"
}' avRNMR >> vgRNMR
Warning: I didn't test that code.
EDIT: I have now tested the code:
$ cat avRNMR
atom Comp
C1 45.7006
H40 30.0407
N41 148.389
S44 502.263
F45 365.162
$ awk -v Hslope=-1.1120 -v Hint=32.4057 -v Cslope=-1.0822 -v Cint=196.4234 '
> FNR > 1 { # skip first record
> if($1 ~ /^C/) print (($2-Cint)/Cslope)
> else if($1 ~ /^H/) print (($2-Hint)/Hslope)
> else print "NA"
> }' avRNMR
139.274
2.1268
NA
NA
NA
That looks to me like what you want. Please tell me what you are seeing.
Try this:
$ awk 'NR==FNR{v[$1]=$2} NR<=FNR||FNR==1{next} /^[CH]/{c=substr($0, 0, 1); print ($2-v[c"int"])/v[c"slope"];next} {print "NA"}' FS="=" vars FS=" " file
139.274
2.1268
NA
NA
NA
The first pattern/action pair reads variables from file vars into an array v. The second skips further processing and also skips the first line for the second file file. The third will match lines with C and H and do the calculations.
You'll need to change the file names and redirect the output to your outfile.
$ cat tst.awk
{ shift = "NA" }
/^C/ { shift = ($2 - Cint) / Cslope }
/^H/ { shift = ($2 - Hint) / Hslope }
NR>1 { print shift }
$ awk -v Hslope="$Hslope" -v Hint="$Hint" -v Cslope="$Cslope" -v Cint="$Cint" -f tst.awk file
139.274
2.1268
NA
NA
NA
or if this is what you really want:
$ cat tst.awk
{ shift = (NR==1 ? "shift" : "NA") }
/^C/ { shift = ($2 - Cint) / Cslope }
/^H/ { shift = ($2 - Hint) / Hslope }
{ print shift }
$ awk -v Hslope="$Hslope" -v Hint="$Hint" -v Cslope="$Cslope" -v Cint="$Cint" -f tst.awk file
shift
139.274
2.1268
NA
NA
NA

awk duplicated lines with starting with # symbol

In the below awk is there a way to process only lines below a pattern #CHROM, however print all in the output. The problem I am having is if I ignore all lines with a # they do print in the output, but the other lines without the # get duplicated. In my data file there are thousands of lines but only the oone format below is updated by the awk. Thank you :).
file tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224
awk
awk '!/^#/
BEGIN {FS = OFS = "\t"
}
NF == 10 {
split($8, a, /[=;]/)
$11 = $12 = $13 = $14 = $15 = $18 = "."
$16 = (a[1] == "DP") ? a[2] : "DP=num_Missing"
$17 = "homref"
}
1' out > ref
curent output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 --- duplicated line ---
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref . --- this line is correct ---
desired output tab-delimited
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -m-both -o genome_split.vcf genome.vcf.gz
##bcftools_normCommand=norm -f /home/cmccabe/Desktop/NGS/picard-tools-1.140/resources/ucsc.hg19.fasta -o genome_annovar.vcf genome_split.vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 948797 . C . 0 PASS DP=159;END=948845;MAX_DP=224;MIN_DP=95 GT:DP:MIN_DP:MAX_DP 0/0:159:95:224 . . . . . 159 homref .
Your first statement:
/^#/
says "print every line that starts with #" and your last:
1
says "print every line". Hence the duplicate lines in the output.
To only modify lines that don't start with # but print all lines would be:
!/^#/ { do stuff }
1

Awk - Substring comparison

Working native bash code :
while read line
do
a=${line:112:7}
b=${line:123:7}
if [[ $a != "0000000" || $b != "0000000" ]]
then
echo "$line" >> FILE_OT_YHAV
else
echo "$line" >> FILE_OT_NHAV
fi
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
AAAAAAAAAAAAAA XXXXXX BB CCCCCCC 12312312443430000000
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
CCCCCCCCCC C C QWEQWEE DDD AAAAAAA A12312312312312310000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
BEGIN {
print "Awk Started"
FILE_OT_YHAV="FILE_OT_YHAV.test"
FILE_OT_NHAV="FILE_OT_NHAV.test"
FS=""
}
# For each line of input.
{
fline=$0
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
else
print $0 > FILE_OT_NHAV;
}
# After last line.
END {
print "Awk Ended"
}
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
##0000000##1022016##
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $

using awk or sed extract first character of each column and store it in a separate file

I have a file like below
AT AT AG AG
GC GC GG GC
i want to extract first and last character of every col n store them in two different files
File1:
A A A A
G G G G
File2:
T T G G
C C G C
My input file is very large. Is it a way that i can do it in awk or sed
With GNU awk for gensub():
gawk '{
print gensub(/.( |$)/,"","g") > "file1"
print gensub(/(^| )./,"","g") > "file2"
}' file
You can do similar in any awk with gsub() and a couple of variables.
you can try this :
write in test.awk
#!/usr/bin/awk -f
BEGIN {
# FS = "[\s]+"
outfile_head="file1"
outfile_tail="file2"
}
{
num = NF
for(i = 1; i <= NF; i++) {
printf "%s ", substr($i, 0, 1) >> outfile_head
printf "%s ", substr($i, length($i), 1) >> outfile_tail
}
}
then you run this:
./test.awk file
It's easy to do in two passes:
sed 's/\([^ ]\)[^ ]/\1/g' file > file1
sed 's/[^ ]\([^ ]\)/\1/g' file > file2
Doing it in one pass is a challenge...
Edit 1: Modified for your multiple line edit.
You could write a perl script and pass in the file names if you plan to edit it and share it. This loops through the file only once and does not require storing the file in memory.
File "seq.pl":
#!/usr/bin/perl
open(F1,">>$ARGV[1]");
open(F2,">>$ARGV[2]");
open(DATA,"$ARGV[0]");
while($line=<DATA>) {
$line =~ s/(\r|\n)+//g;
#pairs = split(/\s/, $line);
for $pair(#pairs) {
#bases = split(//,$pair);
print F1 $bases[0]." ";
print F2 $bases[length($bases)-1]." ";
}
print F1 "\n";
print F2 "\n";
}
close(F1);
close(F2);
close(DATA);
Execute it like so:
perl seq.pl full.seq f1.seq f2.seq
File "full.seq":
AT AT AG AG
GC GC GG GC
AT AT GC GC
File "f1.seq":
A A A A
G G G G
A A G G
File "f2.seq":
T T G G
C C G C
T T C C