How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID?

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? - sequence

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? I have a problem with annotation due to inadequate gene ID's.

Related

Iterating through Pandas dataframe and dictionary items

here's a tough one.
Problem Introduction:
I'm working with two different files: a GFF3, which is basically a "9 columns" TSV, and a FASTA, which is a text file.
Now, importing the GFF3 file with gffpandas it looks like this:
seq_id source type start end score strand phase attributes
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005
Dropping the seq_id column I got the following "values":
ctg.s1.000000F_arrow
ctg.s1.000001F_arrow
ctg.s1.000003F_arrow
ctg.s2.000000F_arrow
Now let's step to the FASTA file, which looks like this:
>ctg.s1.000000F_arrow
CCGGAACATCGTCCTCATCCGCAAAGTCGAGCTCTGCCTCGATCATTGCACGCGAATGGGTCAGCCGTCGGGCCCAACCG
GCATAGAGTGCGGACTGTCCGCCACCGGACTGCTCTATGGCGAGACGACGCTGCATTTCCGTTTCTGCCGCAATCAGGTC
>ctg.s1.000001F_arrow
ACGCCGGCTGCAACTTTGAGAAGATGTGGCGATGTCGACCGCTGCATCCCGCCCTTCTCTGCAGAATTTTCCAGCTGTCC
GAGGACATTGGCAAAAAGGCCCTTGGAATCCTTGCGGCTCATTCTCTCCCCCATGCCTTCCAGAAGAGGCCCTCGAGTTC
>ctg.s1.000003F_arrow
GGCGCTGGTTTTCCCCGACACCTCGCCGCGCGGCGAGGGCGTGGCTGACGACGAGGCTTATGATCTCGGTCAGGGTGCGG
GCTTCTATGTCAATGCGACGCAGAAGCCCTGGTCGCCGCACTATCGCATGTATGATTATATCGTCACCGAATTGCCCGCC
>ctg.s2.000000F_arrow
GCGCTCGACGGCATGCCCGTACGCGGCCGATCCTGCGCCGCTTCCTTAACCTTAGCTGCGGATGGAAAGTCGTCCTCGGA
GTTCGGCTCGCAAACGCTTTCGAGCGCGCAATTGACGACGATGTCGTACCCAACTTAGATCGCCGAACGCCATGAGGTCG
Assuming that the uppercased text part is much longer than two lines, as you can see, the text part characterised by ">" symbol presents the same values of seq_id GFF3 column.
As a matter of fact I wrote few line to assign to the FASTA file a dictionary in which the "key" is the text part characterised by ">" symbol, the "item" is the uppercased part.
Problem processing
For each attributes value inside the dataframe there's a corresponding start and end value which is an interval of the corresponding seq_id. I'd like to extract from the FASTA file that exact interval but with respect to the seq_id value which refers to. I mean the the interval 56-244 must be searched only for the FASTA sequence of ctg.s1.000000F_arrow, as well as 902-2167.
The final goal is to obtain a dataframe which has an additional 10th column (es. 'sequence') that contains the corresponding FASTA sequence of the interval, like this:
seq_id source type start end score strand phase attributes sequence
1 ctg.s1.000000F_arrow prokka gene 56.0 244.0 . + . NHDIEHOJ_00001 CCGGAACATCGTCCTCATCCG
3 ctg.s1.000000F_arrow prokka gene 902.0 2167.0 . - . NHDIEHOJ_00002 CAAGGACATCGTGATCAATTC
5 ctg.s1.000001F_arrow prokka gene 2363.0 2905.0 . - . NHDIEHOJ_00003 TCGCCGCGCGGCGAGTGATTA
7 ctg.s1.000003F_arrow prokka gene 2916.0 3947.0 . - . NHDIEHOJ_00004 TCGAGCGCGCAATTGACGACG
9 ctg.s2.000000F_arrow prokka gene 4353.0 5174.0 . + . NHDIEHOJ_00005 AGATCGCCGAACGCCATATTT
N.B. The sequences in sequence have been randomly typed of the same length but will differ proportionally to the end - start dimension for each attributes value.
I hope I was clear in the explanation.
Thank you everybody for the help.

Assuming df the DataFrame and dic the dictionary and the sequence indexing to be starting at 1 (not 0 like python indexing):
df['sequence'] = [dic[k][int(i-1):int(j)] for k, i, j in
zip(df['seq_id'], df['start'], df['end'])]

Remove just strings from the entries in my first column of data frame

I have strings and numbers in my first column of a data frame:
rn
AT457
X5377
X3477
I want to remove just the strings and keep the numbers from each entry in the column called rn.
Any help is appreciated.

Use a regular expression to do this.
For example, with R :
## Sample data :
df=data.frame(rn=c("AT457","X5377","X3477"))
## Replace the letters with *nothing* ('\D' is used to identify non-digit characters)
df$rn_strip=gsub('\\D',"",df$rn)
## Output :
rn rn_strip
1 AT457 457
2 X5377 5377
3 X3477 3477

How to extract just numeric value with REGEXP_EXTRACT in BigQuery?

I am trying to extract just the numbers from a particular column in BigQuery.
The fields concerned have this format: value = "Livraison_21J|Relais_19J" or "RELAIS_15 DAY"
I am trying to extract the number of days for each value preceeded by the keyword "Relais".
The days range from 1 to 100.
I used this to do so:
SELECT CAST(REGEXP_EXTRACT(delivery, r"RELAIS_([0-9]+J)") as string) as relayDay
FROM TABLE
I want to be able to extract just the number of days regardless of the the string that comes after the numbers, be it "J" or "DAY".
Sample data :
RETRAIT_2H|LIVRAISON_5J|RELAIS_5J | 5J
LIVRAISON_21J|RELAIS_19J | 19J
LIVRAISON_21J|RELAIS_19J | 19J
RETRAIT_2H|LIVRAISON_3J|RELAIS_3J | 3J

You may use
REGEXP_EXTRACT(delivery, r"(?:.*\D)?(\d+)\s*(?:J|DAY)")
See the regex demo
Details
(?:.*\D)? - an optional non-capturing group that matches 0+ chars other than line break chsrs as many as possible and then a non-digit char (this pattern is required to advance the index to the location right before the last sequence of digits, not the last digit)
(\d+) - Group 1 (just what the REGEXP_EXTRACT returns): one or more digits
\s* - 0+ whitespaces
(?:J|DAY) - J or DAY substrings.

Regex to replace multiple patterns with single not working

I am working on replacing multiple occurance of string 0000 with single random number in HANA SQL
I have used these patterns
'(\w+)\s+\1'
'([0000 ]+) \1'
but all occurrences are replaced except the last occurrence of the pattern
SELECT REPLACE_REGEXPR('(\w+)\s+\1' IN '0000 0000 0000' WITH ROUND(RAND()*1000) OCCURRENCE ALL) AS a2
FROM DUMMY;
Current output is
RANDOM 0000
expected output is
RANDOM

Try this regex:
((0000) +)+(0000)
Look Here
And if it's OK to use any digit and more \ less times then 4:
(\d+ +)+\d+
Good Luck!

You may use
\b(\d+)(?:\s+\1)+\b
See the regex demo
You need \d to match digits (if you need to match letters and _ keep on using \w).
Also, to match 1 or more repetitions of a sequence of patterns you need (?:....)+, a + quantified non-capturing group.
Pattern details
\b - word boundary
(\d+) - Group 1: one or more digits
(?:\s+\1)+ - 1+ repetitions of 1+ whitespaces and the same value as captured in Group 1
\b - word boundary
Regex graph:

find partial match between two data frame

I have two data frames.
head(NEexp)
Gene Transcript Ratio_log2 FDR
HLHmgamma HLHmgamma-RA 3.759200 1.09e-10
Brd Brd-RA 3.527000 2.66e-08
CG4080 CG4080-RE 3.378500 2.95e-50
RpII215 RpII215-RA 3.343967 1.82e-10
head(excel$gene)
Enhancer of split mgamma, helix-loop-helix
distal antenna
CG4080 gene product from transcript CG4080-RB
As you can see, the two gene column match partially(HLHmgamma matches Enhancer of split mgamma, helix-loop-helix; CG4080 matches CG4080 gene product from transcript CG4080-RB), is there anyway that I can link these two?
codes I have tried so far:
genename=as.character(NEexp$Gene)
query=paste("select * from excel where excel.gene LIKE \"", genename,"\ ",sep"")
Newtable<-dbGetQuery(con,query)
dbGetQuery(con,"select * from excel, NEexp where excel.gene LIKE % "NEexp$Gene" %")

You need merge , which basically is the same as join in SQL. But first you might want to split excel$gene to get the part you want to match.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? - sequence

How can I convert this format of S. tuberosum gene sequence ID - Soltu.DM.10G013850.1 - to Entrez ID? I have a problem with annotation due to inadequate gene ID's.

Related

Iterating through Pandas dataframe and dictionary items

Remove just strings from the entries in my first column of data frame

How to extract just numeric value with REGEXP_EXTRACT in BigQuery?

Regex to replace multiple patterns with single not working

find partial match between two data frame

Categories

Resources