Replace in every nth line starting from a certain line - awk

I want to replace on every third line starting from second line using sed.
Input file
A1
A2
A3
A4
A5
A6
A7
.
.
.
Expected output
A1
A2
A3
A4_edit
A5
A6
A7_edit
.
.
.
I know there are many solution releted to this is available on stack but for this specific problem, I was unable to find.
My try:
sed '1n;s/$/_edit/;n'
This only replacing on every second line from the beginning.

Something like this?
$ seq 10 | sed '1b ; n ; n ; s/$/_edit/'
1
2
3
4_edit
5
6
7_edit
8
9
10_edit
This breaks down to a cycle of
1b if this is the first line in the input, start the next cycle, using sed default behaviour to print the line and read the next one - which skips the first line in the input
n print the current line and read the next line - which skips the first line in a group of three
n print the current line and read the next line - which skips the second line in a group of three
s/$/_edit/ substitute the end of line for _edit on the third line of each group of three
then use the default sed behaviour to print, read next line and start the cycle again
If you want to skip more than one line at the start, change 1b to, say, 1,5b.
As Wiktor Stribiżew has pointed out in the comments, as an alternative, there is a GNU range extension first~step which allows us to write
sed '4~3s/$/_edit/'
which means substitute on every third line starting from line 4.

In case you are ok with awk, try following.
awk -v count="-1" '++count==3{$0=$0"_edit";count=0} 1' Input_file
Append > temp_file && mv temp_file Input_file in case you want to save output into Input_file itself.
Explanation:
awk -v count="-1" ' ##Starting awk code here and mentioning variable count whose value is -1 here.
++count==3{ ##Checking condition if increment value of count is equal to 3 then do following.
$0=$0"_edit" ##Appending _edit to current line value.
count=0 ##Making value of count as ZERO now.
} ##Closing block of condition ++count==3 here.
1 ##Mentioning 1 will print edited/non-edited lines.
' Input_file ##Mentioning Input_file name here.

Another awk
awk 'NR>3&&NR%3==1{$0=$0"_edit"}1' file
A1
A2
A3
A4_edit
A5
A6
A7_edit
A8
A9
A10_edit
A11
A12
A13_edit
NR>3 Test if line is larger then 3
NR%3==1 and every third line
{$0=$0"_edit"} edit the line
1 print everything

You can use seds ~ step operator.
sed '4~3s|$|_edit|'
~ is a feature of GNU sed, so it will be available in most (all?) distros of Linux. But to use it on macOS (which comes with BSD sed), you would have to install GNU sed to get this feature: brew install gnu-sed.

Related

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

Print file from the beginning until a pattern is found in the 3rd column

At the end of the third column there are three numbers that start with 1.0, how can I print all rows from the beginning until the first of these three numbers (it is in bold below) is found? is there a simple, one-liner solution with awk or sed?
-80.2743682 -8.6420053 0
-80.2740679 -8.6418526 0.0371169854015
-80.2737584 -8.641709 0.0747022394455
-80.2734489 -8.6415562 0.112728651431
-80.2731395 -8.6414126 0.150303954286
-80.27283 -8.6412689 0.187893918808
-80.2725297 -8.6411162 0.22501096103
-80.2722202 -8.6409635 0.263032506523
-80.2719108 -8.6408198 0.300612533441
-80.2716103 -8.6406581 0.33821233839
-80.27131 -8.6405053 0.375334461391
-80.2710095 -8.6403526 0.412471166086
-80.2707001 -8.6401999 0.450482911927
-80.2703906 -8.6400471 0.488509444666
-80.2700993 -8.6398853 0.52522713763
-80.2697989 -8.6397326 0.562354088058
-80.2694894 -8.6395709 0.600828217536
-80.2691981 -8.6394181 0.637071247595
-80.2688886 -8.6392473 0.676023425931
-80.2685882 -8.6390946 0.713150425596
-80.2682788 -8.6389509 0.750730603804
-80.2679783 -8.6387981 0.787872459485
-80.2676599 -8.6386635 0.825947927083
-80.2673686 -8.6385108 0.862185868818
-80.2670501 -8.6383762 0.900271491364
-80.2667406 -8.6382235 0.938293256103
-80.2664311 -8.6380798 0.975883478883
-80.2661217 -8.6379452 **1.01304928976** <--- print until here
-80.2658214 -8.6378015 1.04972439283
-80.2655029 -8.6376578 1.08821462134
Any help is much appreciated.
With awk it'd be:
awk '$3>=1{exit} 1' file
Could you please try following. Written and tested with shown samples only.
awk '1;$NF~/^1\.0/{exit}' Input_file
Also in case your Input_file is having more than 3 fields then change above to:
awk '1;$3~/^1\.0/{exit}' Input_file
Sorry I am little confuse with description in case you need to print till 3rd occurrence of 1.0 then try following(though your shown bold lines shw you want to print till very first occurrence only but adding this in case anyone needs it).
awk '1;$NF~/^1\.0/ && ++count==3{exit}' Input_file
sed -r '1,/.+ +.*+ +0.263/ p' -n < Input_file
this sed says:
don't print anything from Input_file
except for, starting with line 1
up to the line matching /.+ +.*+ +0.263/. Which means:
1 or more characters followed by
1 or more spaces followed by
1 or more characters followed by
1 or more spaces followed by
the number 0.263
print these lines
The regex can definitely be improved, by specifying start of line and spaces vs tabs. But the general idea is there.

Problems with awk substr

I am trying to split a file column using the substr awk command. So the input is as follows (it consists of 4 lines, one blank line):
#NS500645:122:HYGVMBGX2:4:21402:2606:16446:ACCTAGAAGG:R1
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I want to split the second line by the pattern "GATC" but keeping it on the right sub-string like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
I want that the last line have the same length as the splitted one and regenerate the file like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTAT
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
GATCC
EEEEE
For split the last colum I am using this awk script:
cat prove | paste - - - - | awk 'BEGIN
{FS="\t"; OFS="\t"}\ {gsub("GATC","/tGATC", $2); {split ($2, a, "\t")};\ for
(i in a) print substr($4, length(a[i-1])+1,
length(a[i-1])+length(a[i]))}'
But the output is as follows:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Being the second and third line longer that expected.
I check the calculated length that are passed to the substr command and are correct:
1 30
31 70
41 45
Using these length the output should be:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEE
But as I showed it is not the case.
Any suggestions?
I guess you're looking something like this, but your question formatting is really confusing
$ awk -v OFS='\t' 'NR==1 {next}
NR==2 {n=index($0,"GATC")}
/^[^+]/ {print substr($0,1,n-1),substr($0,n)}' file
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I assumed your file is in this format
dummy header line to be ignored
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
+
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

awk: identify column by condition, change value, and finally print all columns

I want to extract the value in each row of a file that comes after AA. I can do this like so:
awk -F'[;=|]' '{for(i=1;i<=NF;i++)if($i=="AA"){print toupper($(i+1));next}}'
This gives me the exact information I need and converts to uppercase, which is exactly what I want to do. How can I do this and then print the entire row with this altered value in its previous position? I am essentially trying to do a find and replace where the value is changed to uppercase.
EDIT:
Here is a sample input line:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=g|||;VT=SNP
and here is a how I would like the output to look:
11 128196 rs576393503 A G 100 PASS AC=453;AF=0.0904553;AN=5008;NS=2504;DP=5057;EAS_AF=0.0159;AMR_AF=0.0259;AFR_AF=0.3071;EUR_AF=0.006;SAS_AF=0.0072;AA=G|||;VT=SNP
All that has changed is the g after AA= is changed to uppercase.
Following awk may help you on same.
awk '
{
match($0,/AA=[^|]*/);
print substr($0,1,RSTART+2) toupper(substr($0,RSTART+3,RLENGTH-3)) substr($0,RSTART+RLENGTH)
}
' Input_file
With GNU sed and perl, using word boundaries
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | sed 's/\bAA=[^;=|]*\b/\U&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
$ echo 'SAS_AF=0.0072;AA=g|||;VT=SNP' | perl -pe 's/\bAA=[^;=|]*\b/\U$&/'
SAS_AF=0.0072;AA=G|||;VT=SNP
\U will uppercase string following it until end or \E or another case-modifier
use g modifier if there can be more than one match per line

print whole variable contents if the number of lines are greater than N

How to print all lines if certain condition matches.
Example:
echo "$ip"
this is a sample line
another line
one more
last one
If this file has more than 3 lines then print the whole variable.
I am tried:
echo $ip| awk 'NR==4'
last one
echo $ip|awk 'NR>3{print}'
last one
echo $ip|awk 'NR==12{} {print}'
this is a sample line
another line
one more
last one
echo $ip| awk 'END{x=NR} x>4{print}'
Need to achieve this:
If this file has more than 3 lines then print the whole file. I can do this using wc and bash but need a one liner.
The right way to do this (no echo, no pipe, no loops, etc.):
$ awk -v ip="$ip" 'BEGIN{if (gsub(RS,"&",ip)>2) print ip}'
this is a sample line
another line
one more
last one
You can use Awk as follows,
echo "$ip" | awk '{a[$0]; next}END{ if (NR>3) { for(i in a) print i }}'
one more
another line
this is a sample line
last one
you can also make the value 3 configurable from an awk variable,
echo "$ip" | awk -v count=3 '{a[$0]; next}END{ if (NR>count) { for(i in a) print i }}'
The idea is to store the contents of the each line in {a[$0]; next} as each line is processed, by the time the END clause is reached, the NR variable will have the line count of the string/file you have. Print the lines if the condition matches i.e. number of lines greater than 3 or whatever configurable value using.
And always remember to double-quote the variables in bash to avoid undergoing word-splitting done by the shell.
Using James Brown's useful comment below to preserve the order of lines, do
echo "$ip" | awk -v count=3 '{a[NR]=$0; next}END{if(NR>3)for(i=1;i<=NR;i++)print a[i]}'
this is a sample line
another line
one more
last one
Another in awk. First test files:
$ cat 3
1
2
3
$ cat 4
1
2
3
4
Code:
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 3 # look ma, no lines
[this line left intentionally blank. no wait!]
$ awk 'NR<4{b=b (NR==1?"":ORS)$0;next} b{print b;b=""}1' 4
1
2
3
4
Explained:
NR<4 { # for tghe first 3 records
b=b (NR==1?"":ORS) $0 # buffer them to b with ORS delimiter
next # proceed to next record
}
b { # if buffer has records, ie. NR>=4
print b # output buffer
b="" # and reset it
}1 # print all records after that