How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? - sequence

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? I have a problem with annotation due to inadequate gene ID's.

Related

Iterating through Pandas dataframe and dictionary items

here's a tough one.
Problem Introduction:
I'm working with two different files: a GFF3, which is basically a "9 columns" TSV, and a FASTA, which is a text file.
Now, importing the GFF3 file with gffpandas it looks like this:
seq_id source type start end score strand phase attributes
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005
Dropping the seq_id column I got the following "values":
ctg.s1.000000F_arrow
ctg.s1.000001F_arrow
ctg.s1.000003F_arrow
ctg.s2.000000F_arrow
Now let's step to the FASTA file, which looks like this:
>ctg.s1.000000F_arrow
CCGGAACATCGTCCTCATCCGCAAAGTCGAGCTCTGCCTCGATCATTGCACGCGAATGGGTCAGCCGTCGGGCCCAACCG
GCATAGAGTGCGGACTGTCCGCCACCGGACTGCTCTATGGCGAGACGACGCTGCATTTCCGTTTCTGCCGCAATCAGGTC
>ctg.s1.000001F_arrow
ACGCCGGCTGCAACTTTGAGAAGATGTGGCGATGTCGACCGCTGCATCCCGCCCTTCTCTGCAGAATTTTCCAGCTGTCC
GAGGACATTGGCAAAAAGGCCCTTGGAATCCTTGCGGCTCATTCTCTCCCCCATGCCTTCCAGAAGAGGCCCTCGAGTTC
>ctg.s1.000003F_arrow
GGCGCTGGTTTTCCCCGACACCTCGCCGCGCGGCGAGGGCGTGGCTGACGACGAGGCTTATGATCTCGGTCAGGGTGCGG
GCTTCTATGTCAATGCGACGCAGAAGCCCTGGTCGCCGCACTATCGCATGTATGATTATATCGTCACCGAATTGCCCGCC
>ctg.s2.000000F_arrow
GCGCTCGACGGCATGCCCGTACGCGGCCGATCCTGCGCCGCTTCCTTAACCTTAGCTGCGGATGGAAAGTCGTCCTCGGA
GTTCGGCTCGCAAACGCTTTCGAGCGCGCAATTGACGACGATGTCGTACCCAACTTAGATCGCCGAACGCCATGAGGTCG
Assuming that the uppercased text part is much longer than two lines, as you can see, the text part characterised by ">" symbol presents the same values of seq_id GFF3 column.
As a matter of fact I wrote few line to assign to the FASTA file a dictionary in which the "key" is the text part characterised by ">" symbol, the "item" is the uppercased part.
Problem processing
For each attributes value inside the dataframe there's a corresponding start and end value which is an interval of the corresponding seq_id. I'd like to extract from the FASTA file that exact interval but with respect to the seq_id value which refers to. I mean the the interval 56-244 must be searched only for the FASTA sequence of ctg.s1.000000F_arrow, as well as 902-2167.
The final goal is to obtain a dataframe which has an additional 10th column (es. 'sequence') that contains the corresponding FASTA sequence of the interval, like this:
seq_id source type start end score strand phase attributes sequence
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001 CCGGAACATCGTCCTCATCCG
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002 CAAGGACATCGTGATCAATTC
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003 TCGCCGCGCGGCGAGTGATTA
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004 TCGAGCGCGCAATTGACGACG
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005 AGATCGCCGAACGCCATATTT
N.B. The sequences in sequence have been randomly typed of the same length but will differ proportionally to the end - start dimension for each attributes value.
I hope I was clear in the explanation.
Thank you everybody for the help.
Assuming df the DataFrame and dic the dictionary and the sequence indexing to be starting at 1 (not 0 like python indexing):
df['sequence'] = [dic[k][int(i-1):int(j)] for k, i, j in
zip(df['seq_id'], df['start'], df['end'])]

Remove just strings from the entries in my first column of data frame

I have strings and numbers in my first column of a data frame:
rn
AT457
X5377
X3477
I want to remove just the strings and keep the numbers from each entry in the column called rn.
Any help is appreciated.
Use a regular expression to do this.
For example, with R :
## Sample data :
df=data.frame(rn=c("AT457","X5377","X3477"))
## Replace the letters with *nothing* ('\D' is used to identify non-digit characters)
df$rn_strip=gsub('\\D',"",df$rn)
## Output :
rn rn_strip
1 AT457 457
2 X5377 5377
3 X3477 3477

How to extract just numeric value with REGEXP_EXTRACT in BigQuery?

I am trying to extract just the numbers from a particular column in BigQuery.
The fields concerned have this format: value = "Livraison_21J|Relais_19J" or "RELAIS_15 DAY"
I am trying to extract the number of days for each value preceeded by the keyword "Relais".
The days range from 1 to 100.
I used this to do so:
SELECT CAST(REGEXP_EXTRACT(delivery, r"RELAIS_([0-9]+J)") as string) as relayDay
FROM TABLE
I want to be able to extract just the number of days regardless of the the string that comes after the numbers, be it "J" or "DAY".
Sample data :
RETRAIT_2H|LIVRAISON_5J|RELAIS_5J | 5J
LIVRAISON_21J|RELAIS_19J | 19J
LIVRAISON_21J|RELAIS_19J | 19J
RETRAIT_2H|LIVRAISON_3J|RELAIS_3J | 3J
You may use
REGEXP_EXTRACT(delivery, r"(?:.*\D)?(\d+)\s*(?:J|DAY)")
See the regex demo
Details
(?:.*\D)? - an optional non-capturing group that matches 0+ chars other than line break chsrs as many as possible and then a non-digit char (this pattern is required to advance the index to the location right before the last sequence of digits, not the last digit)
(\d+) - Group 1 (just what the REGEXP_EXTRACT returns): one or more digits
\s* - 0+ whitespaces
(?:J|DAY) - J or DAY substrings.

Regex to replace multiple patterns with single not working

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM
Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!
You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

find partial match between two data frame

I have two data frames.
head(NEexp)
Gene Transcript Ratio_log2 FDR
HLHmgamma HLHmgamma-RA 3.759200 1.09e-10
Brd Brd-RA 3.527000 2.66e-08
CG4080 CG4080-RE 3.378500 2.95e-50
RpII215 RpII215-RA 3.343967 1.82e-10
head(excel$gene)
Enhancer of split mgamma, helix-loop-helix
distal antenna
CG4080 gene product from transcript CG4080-RB
As you can see, the two gene column match partially(HLHmgamma matches Enhancer of split mgamma, helix-loop-helix; CG4080 matches CG4080 gene product from transcript CG4080-RB), is there anyway that I can link these two?
codes I have tried so far:
genename=as.character(NEexp$Gene)
query=paste("select * from excel where excel.gene LIKE \"", genename,"\ ",sep"")
Newtable<-dbGetQuery(con,query)
dbGetQuery(con,"select * from excel, NEexp where excel.gene LIKE % "NEexp$Gene" %")
You need merge , which basically is the same as join in SQL. But first you might want to split excel$gene to get the part you want to match.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html