Awk adding a pipe at the end of the first line - awk

I have a little problem with my awk command.
The objective is to add a new column to my CSV :
The header must be "customer_id"
The next rows must be a customer_id from an array
Here is my csv :
email|event_date|id|type|cha|external_id|name|date
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13
I would like to have this output :
email|event_date|id|type|cha|external_id|name|date|customer_id
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
But when I'm doing the awk I have this result :
awk -v a="$(echo "${customerIdList[#]}")" 'BEGIN{FS=OFS="|"} FNR==1{$(NF+1)="customer_id"} FNR>1{split(a,b," ")} {print $0,b[NR-1]}' test.csv
email|event_date|id|type|cha|external_id|name|date|customer_id|
abcd#google.fr|2020-11-13 08:04:44|12|Invalid|Mail|disable|One|2020-11-13|20200
dcab#google.fr|2020-11-13 08:04:44|13|Invalid|Mail|disable|Two|2020-11-13|20201
Where customerIdList = (20200 20201)
There is a pipe just after the "customer_id" header and I don't know why :(
Can someone help me ?

Could you please try following, written and tested with shown samples.
awk -v var="${customerIdList[*]}" '
BEGIN{
num=split(var,arr," ")
}
FNR==1{
print $0"|customer_id"
next
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"")
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v var="${customerIdList[*]}" ' ##Starting awk program from here, creating var variable and passing array values to it.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var,arr," ") ##Splitting var into arr with space delimiter.
}
FNR==1{ ##Checking condition if this is first line.
print $0"|customer_id" ##Then printing current line with string here.
next ##next will skip all further statements from here.
}
{
$0=$0 (arr[FNR-1]?"|" arr[FNR-1]:"") ##Checking condition if value of arr with current line number -1 is NOT NULL then add its value to current line with pipe else do nothing.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

awk -v IdList="${customerIdList[*]}" 'BEGIN { split(IdList,ListId," ") } NR > 1 { $0=$0"|"ListId[NR-1]}1' file
An array will need to be created within awk and so pass the array as a space separated string and then use awk's split function to create the array IdList. The ignoring the headers (NR>1), set the line equal to the line plus the index of ListId array NR-1.

Related

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files) :
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I should get an output like that :
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).
I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.
If you see a solution..
With your shown samples please try following. Written and tested in GNU awk.
awk '
FNR==NR{
arr[$0]
next
}
/^>/{
found=0
match($0,/.*\//)
if((substr($0,RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[$0] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match($0,/.*\//) ##using match to match a regex to till / in current line.
if((substr($0,RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.
Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

How to use Awk to output multiple consecutive lines

Input/File
A:1111
B:21222
C:33rf33
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
J:efee
Basically need to select the line that contains 2122 (i.e line B/2)
& line which starts with 444dct4 (i.e Line D) till efe989ef (i.e line I/9)
To summarize
Select Line B (contains 2122)
Select Line D (444dct4) till Line I
Desired Output
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Could you please try following, written and tested with shown samples in GNU awk. This one also takes care in case line's 2nd column 21222 in between range of 444dct4 to efe989ef then it will NOT re-print it.
awk -F':' '
$2=="21222" && !found{
print
next
}
$2=="444dct4"{
found=1
}
found
$2=="efe989ef"{
found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here and setting field separator as colon here.
$2=="21222" && !found{ ##Checking if 2nd field is 21222 and found is NOT set then try following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
$2=="444dct4"{ ##Checking condition if 2nd field is 444dct4 then do following.
found=1 ##Setting found to 1 here.
}
found ##Checking condition if found is SET then print that line.
$2=="efe989ef"{ ##Checking condition if 2nd field is efe989ef then do following.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F: '
/2122/ { # line that contains 2122
print
next # to avoid duplicate printing if 2122 also in D-I
}
$2~/^444dct4/,$2~/efe989ef/ # starts with 444dct4 till efe989ef
' file
Output:
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Edit:
One-liner:
$ awk -F: '/2122/{print; next} $2~/^444dct4/,$2~/efe989ef/' file.txt
awk -v str1="2122" -v str2="444dct4" -v str3="efe989ef" 'BEGIN { flag=0 } $0 ~ str1 { print } $0 ~ str2 { flag=1 } $0 ~ str3 { flag=0;print;next } flag' file
For flexibility, set the line to find as str1, the from as str2 and the to as str3. Set a print flag (flag) to begin with. When 2122 is in the second field print. Then when the second field begins with 44dct4 set the print flag to one. When the second field starts with efe989ef, set the print flag to 0, print the line and skip to the next record. The variable flag will then determine what does and doesn't get printed.

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

Filter out FASTA files by specified sequence length in bash

There's a FASTA file assembly.fasta containing contig names and corresponding sequences:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG
I want to get only contigs longer than 30 letters and get a new FASTA file assembly.filtered.fasta containing only those long sequences with contig names, in this format:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
Using gnu-awk, you may use this simpler version:
awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt $0} {prt = RT}' file
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
A very quick way to achieve what you are after is:
awk -v n=30 '/^>/{ if(l>n) print b; b=$0;l=0;next }
{l+=length;b=b ORS $0}END{if(l>n) print b }' file
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
Could you please try following, tested and written with shown samples.
awk '
/^>/{
if(sign_val && strLen>=30){
print sign_val ORS line
}
strLen=line=""
sign_val=$0
next
}
{
strLen+=length($0)
line=(line?line ORS:"")$0
}
END{
if(sign_val && strLen>=30){
print sign_val ORS line
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if line starts from > then do following.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
strLen=line="" ##Nullify variables steLen and line here.
sign_val=$0 ##Setting sign_val to current line here.
next ##next will skip all further statements from here.
}
{
strLen+=length($0) ##Checking length of line and keep adding it here.
line=(line?line ORS:"")$0 ##Creating line variable and keep appending it to it with new line.
}
END{ ##Starting END block of this program from here.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
}
' Input_file ##mentioning Input_file name here.

Remove duplicate line that contain an unknown string

file.txt
test (CODE:700|SIZE:2356)
asdasdad (CODE:700|SIZE:124)
xcvxcva (CODE:700|SIZE:8974)
asdavasdasdasd (CODE:700|SIZE:124)
link-categories (CODE:700|SIZE:8974)
edit (CODE:700|SIZE:124)
I need command get all duplicated SIZE: value , then remove all duplicated lines have this value except one line, i mean the output should be like this:
test (CODE:700|SIZE:2356)
xcvxcva (CODE:700|SIZE:8974)
asdavasdasdasd (CODE:700|SIZE:124)
i found this command sed '/SIZE:124/,+1 d' file.txt in Remove duplicate line only contain specific string
but this command removed all lines, what i need is remove duplicated lines except one line + this command will not search for duplicated SIZE: value, so it's not working!
What i need is:
search for duplicated SIZE: value like 124 above!
all lines have this value remove it, except one line or two line if you can.
It can be done using this simple awk also:
awk -F '[ |]+' '!seen[$NF]++{print}' file
test (CODE:700|SIZE:2356)
asdasdad (CODE:700|SIZE:124)
xcvxcva (CODE:700|SIZE:8974)
Could you please try following.
awk 'match($0,/SIZE:[0-9]+/){val=substr($0,RSTART,RLENGTH);array[val]=$0;val=""} END{for(key in array){print array[key]}}' Input_file
OR adding a non-one liner form of solution:
awk '
match($0,/SIZE:[0-9]+/){
val=substr($0,RSTART,RLENGTH)
array[val]=$0
val=""
}
END{
for(key in array){
print array[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/SIZE:[0-9]+/){ ##Using match function to match regex of SIZE: then digits in each line here.
val=substr($0,RSTART,RLENGTH) ##Creating variable val whose value is sub string of current line which has matched value from current line.
array[val]=$0 ##Creating an array named array with index of variable val and value is current line.
val="" ##Nullify variable val here.
}
END{ ##Starting END block of this awk program here.
for(key in array){ ##Traversing through array here.
print array[key] ##Printing array value here.
}
}
' Input_file ##Mentioning Input_file name here.