I have this file :
>AX-89948491
CACCTTTT[C/T]ATTTCATTCCTAC
>AX-89940152
AGATGAGA[A/G]TAAAGCTTCTGTC
>AX-89922107
ACAGAAAT[G/T]TATAGATATTACT
I need to find the pattern "[A-Z]/[A-Z]" (it is necessarily present every two lines) ; and put it on the line before like this :
>AX-89948491-[C/T]
CACCTTTT[C/T]ATTTCATTCCTAC
>AX-89940152-[A/G]
AGATGAGA[A/G]TAAAGCTTCTGTC
>AX-89922107-[G/T]
ACAGAAAT[G/T]TATAGATATTACT
I did :
awk 'tmp=/\[[A-Z]\/[A-Z]]/{if (a && a !~ /\[[A-Z]\/[A-Z]]/) print a"-"$tmp; print} {a=$0}' my_file
But that gives the entire line , not the pattern.
Any help?
You could print the previous line plus the current matched part of the pattern, and given that it is present every 2 lines:
awk '
match($0, /\[[A-Z]\/[A-Z]]/) {
m = substr($0, RSTART, RLENGTH)
print prev "-" m ORS $0
}
{prev = $0}
' my_file
Output
>AX-89948491-[C/T]
CACCTTTT[C/T]ATTTCATTCCTAC
>AX-89940152-[A/G]
AGATGAGA[A/G]TAAAGCTTCTGTC
>AX-89922107-[G/T]
ACAGAAAT[G/T]TATAGATATTACT
With your shown samples only, please try following awk program. Here is tac + awk + tac solution. Simple explanation would be using tac to print output in reverse lines order(from bottom to up) sending it to awk program to get [[A-Z]/[A-Z] and saving its matched value to val variable and printing that line, if match function doesn't have any matched regex value then printing that line(basically lines where we need to add [[A-Z]/[A-Z] value) along with - and val value. Now passing this output to tac again to get output in exact same format in which OP has shown us samples.
tac Input_file |
awk '
match($0,/\[[A-Z]\/[A-Z]]/){
val=substr($0,RSTART,RLENGTH)
print
next
}
{
print $0"-"val
}
' | tac
Related
I have this file :
>AX-899-Af-889-[A/G]
GTCCATTCAGGTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTATTTTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAATGACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I need to insert the pattern [X/X] present in the lines starting by > in the next line at the 10th position and replace this 10th character :
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I can extract the pattern :
awk 'match($0, /^>/) {split($0,a,"-"); print; getline; print a[5]}1' file
Also replace the 10th character by a pattern ("N" for example) : sed 's/^\([ATCG].\{8\}\)[ATCG]/\1N/' file
With your shown samples, please try following awk.
awk '
BEGIN{ FS=OFS="-" }
/^>/ {
val=$NF
print
next
}
{
print substr($0,1,9) val substr($0,11)
val=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="-" } ##Starting BEGIN section from here and setting FS and OFS as - here.
/^>/ { ##Checking condition if line starts from > then do following.
val=$NF ##Setting last field($NF) to val here.
print ##printing current line here.
next ##next will skip all further statements from here.
}
{
print substr($0,1,9) val substr($0,11) ##printing substring from 1st to 9 chars of current line.
##Followed by val and rest of values from 11th char to till last of current line.
val="" ##Nullifying val here.
}
' Input_file ##Mentioning Input_file name here.
Another:
$ awk '
BEGIN { FS=OFS="" } # each char is a field of its own
{
if(/^>/) # if record starts with a >
b=substr($0,length-4,5) # get last 5 chars to buffer
else # otherwise
$10=b # replace 10th char with buffer
}1' file # output
Some output:
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
...
Using sed
$ cat sed.script
/^>/{ #If the line starts with >
p #Print it to create a duplicate line
s/[^[]*\([^]]*]\)/\1/ #Using back referencing, extract the pattern at the end
h #Store the pattern in hold space
d #Now stored in hold space, delete the duplicated line.
}
{
G #Append the contents of the hold space to that of the pattern space.
s/\n// #Remove the newline created by previous command
s/\(.\{9\}\).\([^[]*\)\(.*\)/\1\3\2/ #Replace 10th character with the content obtained from the hold space
}
$ sed -f sed.script input_file
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
Or as a one liner
$ sed '/^>/{p;s/[^[]*\([^]]*]\)/\1/;h;d};{G;s/\n//;s/\(.\{9\}\).\([^[]*\)\(.*\)/\1\3\2/}' input_file
Another idea using sed:
sed -E '/^>/{N;s/(.*-)(\[[^][]*])(\n.{9})./\1\2\3\2/}' file
Explanation
/^>/ If the line starts with >
N Append the next line to the pattern space
(.*-) Capture group 1, match till the last occurrence of -
(\[[^][]*]) Capture group 2, match from opening to closing square brackets [...]
(\n.{9}). Capture a newline and 9 characters in group 3 and match the 10th character
\1\2\n\3\2 The replacement using the backreferences to the capture groups including newline
Output
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?
Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.
I'm using awk to process the following [sample] of data:
id,desc
168048,Prod_A
217215,Prod_C
217215,Prod_B
168050,Prod_A
168050,Prod_F
168050,Prod_B
What I'm trying to do is to create a column 'item' enumerating the lines within the same 'id':
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
Here what I've tried:
BEGIN {
FS = ","
a = 1
}
NR != 1 {
if (id != $1) {
id = $1
printf "%s,%s\n", $0, "#"a
}
else {
printf "%s,%s\n", $0, "#"a++
}
}
But it messes the numbering:
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#1
168050,Prod_A,#2
168050,Prod_F,#2
168050,Prod_B,#3
Could someone give me some hints?
P.S. The line order doesn't matter
$ awk -F, 'NR>1{print $0,"#"++c[$1]}' OFS=, file
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
How it works
-F,
This sets the field separator on input to a comma.
NR>1{...}
This limits the commands in braces to lines other than the first, that is, the one with the header.
print $0,"#"++c[$1]
This prints the line followed by # and a count of the number of times that we have seen the first column.
Associative array c keeps a count of the number of times that an id has been seen. For every line, we increment by 1 the count for id $1. ++ increments. Because ++ precedes c[$1], the increment is done before the value if printed.
OFS=,
This sets the field separator on output to a comma.
Printing a new header as well
$ awk -F, 'NR==1{print $0,"item"} NR>1{print $0,"#"++c[$1]}' OFS=, file
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
I have two files, par1.txt, par2.txt. I want to look at the first field or column of both files, compare them and then if they match print the record or row where they are matched.
Examplefiles:
par1.txt
ocean;stuff about an ocean;definitions of oeans
park;stuff about parks;definitions of parks
ham;stuff about ham;definitions of ham
par2.txt
hand,stuff about hands,definitions of hands
bread,stuff about bread,definitions of bread
ocean,different stuff about an ocean,difference definitions of oceans
ham,different stuff about ham,different definitions of ham
As for my output I want something like
ocean:stuff about an ocean:definitions of oeans
ocean:different stuff about an ocean:difference definitions of oceans
ham:different stuff about ham:different definitions of ham
ham:stuff about ham:definitions of ham
The FS in the files are different, as shown in the example.
The output FS doesn't have to be ":" it just can't be a space.
Using awk:
awk -v OFS=":" '
{ $1 = $1 }
NR==FNR { lines[$1] = $0; next }
($1 in lines) { print lines[$1] RS $0 }
' FS=";" par1.txt FS="," par2.txt
Output:
ocean:stuff about an ocean:definitions of oeans
ocean:different stuff about an ocean:difference definitions of oceans
ham:stuff about ham:definitions of ham
ham:different stuff about ham:different definitions of ham
Explanation:
Set the Output field separator to :. If you want space delimited you dont need to set -v OFS.
$1=$1 helps us reformat the entire line so that it can take the value of OFS while re-constructing.
NR==FNR reads the first file in to array.
When we process the second file, we look for first column in our array. If is present we print the line from array and the line from second file.
FS=";" par1.txt FS="," par2.txt is a technique where you can specify different field separator for different files.
If you have repeatitive first column in both files and would like to capture everything then use the following. It is similar logic but we keep all lines in array and print at the end.
awk -v OFS=":" '
{ $1 = $1 }
NR==FNR {
lines[$1] = (lines[$1] ? lines[$1] RS $0 : $0);
next
}
($1 in lines) {
lines[$1] = lines[$1] RS $0;
seen[$1]++
}
END { for (patt in seen) print lines[patt] }
' FS=";" par1.txt FS="," par2.txt
Edited Answer
Based on your comments, I believe you have more than 2 files and that the files have sometimes commas and sometimes semicolons as separators and that you want to print any number of lines which have matching first fields as long there are more than one with that first field. If so, I think you want this:
awk -F, '
{
gsub(/;/,",");$0=$0; # Replace ";" with "," and reparse line using new field sep
sep=""; # Preset record separator to blank
if(counts[$1]++) sep="\n"; # Add newline if anything already stored in records[$1]
records[$1] = records[$1] sep $0; # Append this record to other records with same key
}
END { for (x in counts) if (counts[x]>1) print records[x] }' par*.txt
Original Answer
I came up with this:
awk -F';' '
FNR==NR {x[$1]=$0; next}
$1 in x {printf "%s\n%s\n",$0,x[$1]}' par1.txt <(sed 's/,/;/' par2.txt)
Read in par1.txt and store in array x[] indexed by first field. Replace the comma in par2.txt with a semicolon so that the separators match. As each line of par2.txt is read, see if it is in the stored array x[] and if it is, print the stored array x[] and the current line.
I'm trying to run the command below, and its giving me the error. Thoughts on how to fix? I would rather have this be a one line command than a script.
grep "id\": \"http://room.event.assist.com/event/room/event/" failed_events.txt |
head -n1217 |
awk -F/ ' { print $7 } ' |
awk -F\" ' { print "url \= \"http\:\/\/room\.event\.assist\.com\/event\/room\/event\/'{ print $1 }'\?schema\=1\.3\.0\&form\=json\&pretty\=true\&token\=582EVTY78-03iBkTAf0JAhwOBx\&account\=room_event\"" } '
awk: non-terminated string url = "ht... at source line 1
context is
>>> <<<
awk: giving up
source line number 2
The line below exports out a single column of ID's:
grep "id\": \"http://room.event.assist.com/event/room/event/" failed_events.txt |
head -n1217 |
awk -F/ ' { print $7 } '
156512145
898545774
454658748
898432413
I'm looking to get the ID's above into a string like so:
" url = "string...'ID'string"
take a look what you have in last awk :
awk -F\"
' #single start here
{ print " #double starts for print, no ends
url \= \"http\:\/\/room\.event\.assist\.com\/event\/room\/event\/
' #single ends here???
{ print $1 }'..... #single again??? ...
(rest codes)
and you want to print exact {print } out? i don't think so. why you were nesting print ?
Most of the elements of your pipe can be expressed right inside awk.
I can't tell exactly what you want to do with the last awk script, but here are some points:
Your "grep" is really just looking for a string of text, not a
regexp.
You can save time and simplify things if you use awk's
index() function instead of a RE. Output formats are almost always
best handled using printf().
Since you haven't provided your input data, I can't test this code, so you'll need to adapt it if it doesn't work. But here goes:
awk -F/ '
BEGIN {
string="id\": \"http://room.event.assist.com/event/room/event/";
fmt="url = http://example.com/event/room/event/%s?schema=whatever\n";
}
count == 1217 { nextfile; }
index($0, string) {
split($7, a, "\"");
printf(fmt, a[0]);
count++;
}' failed_events.txt
If you like, you can use awk's -v option to pass in the string variable from a shell script calling this awk script. Or if this is a stand-alone awk script (using #! shebang), you could refer to command line options with ARGV.