Insert pattern from current line in next line

Insert pattern from current line in next line - awk

I have this file :
>AX-899-Af-889-[A/G]
GTCCATTCAGGTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTATTTTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAATGACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I need to insert the pattern [X/X] present in the lines starting by > in the next line at the 10th position and replace this 10th character :
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I can extract the pattern :
awk 'match($0, /^>/) {split($0,a,"-"); print; getline; print a[5]}1' file
Also replace the 10th character by a pattern ("N" for example) : sed 's/^\([ATCG].\{8\}\)[ATCG]/\1N/' file

With your shown samples, please try following awk.
awk '
BEGIN{ FS=OFS="-" }
/^>/ {
val=$NF
print
next
}
{
print substr($0,1,9) val substr($0,11)
val=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="-" } ##Starting BEGIN section from here and setting FS and OFS as - here.
/^>/ { ##Checking condition if line starts from > then do following.
val=$NF ##Setting last field($NF) to val here.
print ##printing current line here.
next ##next will skip all further statements from here.
}
{
print substr($0,1,9) val substr($0,11) ##printing substring from 1st to 9 chars of current line.
##Followed by val and rest of values from 11th char to till last of current line.
val="" ##Nullifying val here.
}
' Input_file ##Mentioning Input_file name here.

Another:
$ awk '
BEGIN { FS=OFS="" } # each char is a field of its own
{
if(/^>/) # if record starts with a >
b=substr($0,length-4,5) # get last 5 chars to buffer
else # otherwise
$10=b # replace 10th char with buffer
}1' file # output
Some output:
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
...

Using sed
$ cat sed.script
/^>/{ #If the line starts with >
p #Print it to create a duplicate line
s/[^[]*\([^]]*]\)/\1/ #Using back referencing, extract the pattern at the end
h #Store the pattern in hold space
d #Now stored in hold space, delete the duplicated line.
}
{
G #Append the contents of the hold space to that of the pattern space.
s/\n// #Remove the newline created by previous command
s/\(.\{9\}\).\([^[]*\)\(.*\)/\1\3\2/ #Replace 10th character with the content obtained from the hold space
}
$ sed -f sed.script input_file
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
Or as a one liner
$ sed '/^>/{p;s/[^[]*\([^]]*]\)/\1/;h;d};{G;s/\n//;s/\(.\{9\}\).\([^[]*\)\(.*\)/\1\3\2/}' input_file

Another idea using sed:
sed -E '/^>/{N;s/(.*-)(\[[^][]*])(\n.{9})./\1\2\3\2/}' file
Explanation
/^>/ If the line starts with >
N Append the next line to the pattern space
(.*-) Capture group 1, match till the last occurrence of -
(\[[^][]*]) Capture group 2, match from opening to closing square brackets [...]
(\n.{9}). Capture a newline and 9 characters in group 3 and match the 10th character
\1\2\n\3\2 The replacement using the backreferences to the capture groups including newline
Output
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT

Related

Remove words that are less than two characters long AND don't contain a vowel

You can remove words less than length 2 with
sed -e 's/ [a-zA-Z0-9]\{1\} / /g'
although I'm not sure how remove only words that don't contain a vowel AND are less than length 2, in one command.
Thus a sentence
this is my w example of a sentence p
would end like
this is my example of a sentence

Could you please try following.
awk '
{
val=""
for(i=1;i<=NF;i++){
if($i!~/[aieou]/ && length($i)<2){ a="" }
else{ val=(val?val OFS:"")$i }
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting an awk program from here.
{
val="" ##Nullifying val value here.
for(i=1;i<=NF;i++){ ##Starting a for loop from here.
if($i!~/[aieou]/ && length($i)<2){ a="" } ##Checking condition if field is NOT containing any vowels and length is lesser than 2 then do nothing.
else{ val=(val?val OFS:"")$i } ##Else(in case above condition is FALSE) create val which contains current field value.
}
print val ##Printing val here.
}
' Input_file ##Mentioning Input_file name here.

With a GNU sed, you can use
LC_ALL=C sed 's/[ \t]*\b[b-df-hj-np-tv-zB-DF-HJ-NP-TV-Z]\b//g' file
# Or
LC_ALL=C sed 's/[ \t]*\b[b-df-hj-np-tv-z]\b//gI' file
It matches zero or more spaces or tabs, then a consonant word consisting of a single letter where \b mark word boundaries. LC_ALL=C is used to make sure the bracket expression ranges are compliant with the ASCII table codes.
See an online demo.

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files) :
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I should get an output like that :
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).
I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.
If you see a solution..

With your shown samples please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
/^>/{
found=0
match($0,/.*\//)
if((substr($0,RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[$0] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match($0,/.*\//) ##using match to match a regex to till / in current line.
if((substr($0,RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.

Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

How to use Awk to output multiple consecutive lines

Input/File
A:1111
B:21222
C:33rf33
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
J:efee
Basically need to select the line that contains 2122 (i.e line B/2)
& line which starts with 444dct4 (i.e Line D) till efe989ef (i.e line I/9)
To summarize
Select Line B (contains 2122)
Select Line D (444dct4) till Line I
Desired Output
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef

Could you please try following, written and tested with shown samples in GNU awk. This one also takes care in case line's 2nd column 21222 in between range of 444dct4 to efe989ef then it will NOT re-print it.
awk -F':' '
$2=="21222" && !found{
print
next
}
$2=="444dct4"{
found=1
}
found
$2=="efe989ef"{
found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here and setting field separator as colon here.
$2=="21222" && !found{ ##Checking if 2nd field is 21222 and found is NOT set then try following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
$2=="444dct4"{ ##Checking condition if 2nd field is 444dct4 then do following.
found=1 ##Setting found to 1 here.
}
found ##Checking condition if found is SET then print that line.
$2=="efe989ef"{ ##Checking condition if 2nd field is efe989ef then do following.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.

$ awk -F: '
/2122/ { # line that contains 2122
print
next # to avoid duplicate printing if 2122 also in D-I
}
$2~/^444dct4/,$2~/efe989ef/ # starts with 444dct4 till efe989ef
' file
Output:
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Edit:
One-liner:
$ awk -F: '/2122/{print; next} $2~/^444dct4/,$2~/efe989ef/' file.txt

awk -v str1="2122" -v str2="444dct4" -v str3="efe989ef" 'BEGIN { flag=0 } $0 ~ str1 { print } $0 ~ str2 { flag=1 } $0 ~ str3 { flag=0;print;next } flag' file
For flexibility, set the line to find as str1, the from as str2 and the to as str3. Set a print flag (flag) to begin with. When 2122 is in the second field print. Then when the second field begins with 44dct4 set the print flag to one. When the second field starts with efe989ef, set the print flag to 0, print the line and skip to the next record. The variable flag will then determine what does and doesn't get printed.

Awk adding a pipe at the end of the first line

I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?

Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.

Remove duplicate line that contain an unknown string

file.txt
test (CODE:700|SIZE:2356)
asdasdad (CODE:700|SIZE:124)
xcvxcva (CODE:700|SIZE:8974)
asdavasdasdasd (CODE:700|SIZE:124)
link-categories (CODE:700|SIZE:8974)
edit (CODE:700|SIZE:124)
I need command get all duplicated SIZE: value , then remove all duplicated lines have this value except one line, i mean the output should be like this:
test (CODE:700|SIZE:2356)
xcvxcva (CODE:700|SIZE:8974)
asdavasdasdasd (CODE:700|SIZE:124)
i found this command sed '/SIZE:124/,+1 d' file.txt in Remove duplicate line only contain specific string
but this command removed all lines, what i need is remove duplicated lines except one line + this command will not search for duplicated SIZE: value, so it's not working!
What i need is:
search for duplicated SIZE: value like 124 above!
all lines have this value remove it, except one line or two line if you can.

It can be done using this simple awk also:
awk -F '[ |]+' '!seen[$NF]++{print}' file
test (CODE:700|SIZE:2356)
asdasdad (CODE:700|SIZE:124)
xcvxcva (CODE:700|SIZE:8974)

Could you please try following.
awk 'match($0,/SIZE:[0-9]+/){val=substr($0,RSTART,RLENGTH);array[val]=$0;val=""} END{for(key in array){print array[key]}}' Input_file
OR adding a non-one liner form of solution:
awk '
match($0,/SIZE:[0-9]+/){
val=substr($0,RSTART,RLENGTH)
array[val]=$0
val=""
}
END{
for(key in array){
print array[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/SIZE:[0-9]+/){ ##Using match function to match regex of SIZE: then digits in each line here.
val=substr($0,RSTART,RLENGTH) ##Creating variable val whose value is sub string of current line which has matched value from current line.
array[val]=$0 ##Creating an array named array with index of variable val and value is current line.
val="" ##Nullify variable val here.
}
END{ ##Starting END block of this awk program here.
for(key in array){ ##Traversing through array here.
print array[key] ##Printing array value here.
}
}
' Input_file ##Mentioning Input_file name here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Insert pattern from current line in next line - awk

Related

Remove words that are less than two characters long AND don't contain a vowel

Removing lines which match with specific pattern from another file

How to use Awk to output multiple consecutive lines

Awk adding a pipe at the end of the first line

Remove duplicate line that contain an unknown string

Categories

Resources