How to use Bioproject ID, for example, PRJNA12997, in biopython? - sequence

I have an Excel file in which are given more then 2000 organisms, where each one of them has a Bioproject ID associated (like PRJNA12997). The idea is to use these IDs to get the sequence for a later multiple alignment with other five sequences that I have in a text file.
Can anyone help me understand how I can do this using biopython? At least the part with the bioproject ID.

You can first get the info using Bio.Entrez:
from Bio import Entrez
Entrez.email = "Your.Name.Here#example.org"
# This call to efetch fails sometimes with a 400 error.
handle = Entrez.efetch(db="bioproject", id="PRJNA12997")
I've been trying, and Entrez.read(handle) doesn't seems to work. But if you do record_xml = handle.read() you'll get the XML entry for this record. In this XML you can get the ID for the organism, in this case 12997.
handle = Entrez.esearch(db="nuccore", term="12997[BioProject]")
search_results = Entrez.read(handle)
Now you can efecth from your search results. At this point you should use Biopython to parse whatever you will get in the efetch step, playing with the rettype http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/
for result in search_results["IdList"]:
entry = Entrez.efetch(db="nuccore", id=result, rettype="fasta")
this_seq_in_fasta = entry.read()

Related

How to get the latest papers from pubmed

This is a bit of a specific question, but somebody must have done this before. I would like to get the latest papers from pubmed. Not papers about a certain subjects, but all of them. I thought to query depending on modification date (mdat). I use biopython.py and my code looks like this
handle = Entrez.egquery(mindate='2015/01/10',maxdate='2017/02/19',datetype='mdat')
results = Entrez.read(handle)
for row in results["eGQueryResult"]:
if row["DbName"]=="nuccore":
print(row["Count"])
However, this results in zero papers. If I add term='cancer' I get heaps of papers. So the query seems to need the term keyword... but I want all papers, not papers on a certain subjects. Any ideas how to do this?
thanks
carl
term is a required parameter, so you can't omit it in your call to Entrez.egquery.
If you need all the papers within a specified timeframe, you will probably need a local copy of MEDLINE and PubMed Central:
For MEDLINE, this involves getting a license. For PubMed Central, you
can download the Open Access subset without a license by ftp.
EDIT for python3. The idea is that the latest pubmed id is the same thing as the latest paper (which I'm not sure is true). Basically does a binary search for the latest PMID, then gives a list of the n most recent. This does not look at dates, and only returns PMIDs.
There is an issue however where not all PMIDs exist, for example https://pubmed.ncbi.nlm.nih.gov/34078719/ exists, https://pubmed.ncbi.nlm.nih.gov/34078720/ does not (retraction?), and https://pubmed.ncbi.nlm.nih.gov/34078721/ exists. This ruins the binary search since it can't know if it's found a PMID that hasn't been used yet, or if it has found one that has previously existed.
CODE:
import urllib
def pmid_exists(pmid):
url_stem = 'https://www.ncbi.nlm.nih.gov/pubmed/'
query = url_stem+str(pmid)
try:
request = urllib.request.urlopen(query)
return True
except urllib.error.HTTPError:
return False
def get_latest_pmid(guess = 27239557, _min_guess=None, _max_guess=None):
#print(_min_guess,'<=',guess,'<=',_max_guess)
if _min_guess and _max_guess and _max_guess-_min_guess <= 1:
#recursive base case, this guess must be the largest PMID
return guess
elif pmid_exists(guess):
#guess PMID exists, search for larger ids
_min_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _max_guess else guess*2
else:
#guess PMID does not exist, search for smaller ids
_max_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _min_guess else guess//2
return get_latest_pmid(next_guess, _min_guess, _max_guess)
#Start of program
n = 5
latest_pmid = get_latest_pmid()
most_recent_n_pmids = range(latest_pmid-n, latest_pmid)
print(most_recent_n_pmids)
OUTPUT:
[28245638, 28245639, 28245640, 28245641, 28245642]

Talend - Dynamic Column Name (Enterprise version)

Can anyone help me solve this case?
I have much file to process, two of them is like on below screenshot with my expected output.
I use this transformation on Talend: tFileList---tInputExcel---tUnpivotRow---tMap---tPostgresqlOutput
The output is different to my expected output. This is the screenshot of the output
Can anyone help me to reach my expected output which is like on my first picture above?
This will be pretty hard. You'd have to handle that as a text file. And whenever you found "store" value in the first column you'd update your type with the value.
Here's how I'd start:
Basically tJavaFlex begin piece would contain:
String col1Type
String colNType
main part:
if input_row.col0.equalsIgnoreCase("store") {
col1Type = input_row.col1;
col2Type = input_row.col2;
colNType = input_row.colN;
continue; /*(so this record will be Ignored for the rest of the components!)*/
}
output_row.col1Type = col1Type;
output_row.col1Value = Integer.valueOf(input_row.col1);
/*coz we have text and need numbers :( */
I think using propagate results will save you from writing down all the other fields.
And from here it would be very simple as you have key-type-value-type-value-type-value results.

REGEX_EXTRACT error in PIG

I have a CSV file with 3 columns: tweetid , tweet, and Userid. However within the tweet column there are comma separated values.
i.e. of 1 row of data:
`396124437168537600`,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
I want to extract all 3 fields individually, but REGEX_EXTRACT is giving me an error with this code:
a = LOAD tweets USING PigStorage(',') AS (f1,f2,f3);
b = FILTER a BY REGEX_EXTRACT(f1,'(.*)\\"(.*)',1);
The error is:
error: Filter's condition must evaluate to boolean.
In the use case shared, reading the data using PigStrorage(',') will result in missing savava143 (last field value)
A = LOAD '/Users/muralirao/learning/pig/a.csv' USING PigStorage(',') AS (f1,f2,f3);
DUMP A;
Output : A : Observe that the last field value is missing.
(396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.")
For the use case shared, to extract all the values from CSV file with field values having ',' we can use either CSVExcelStorage or CSVLoader.
Approach 1 : Using CSVExcelStorage
Ref : http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Input : a.csv
396124437168537600,"I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.",savava143
Pig Script :
REGISTER piggybank.jar;
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (f1,f2,f3);
DUMP A;
Output : A
(396124437168537600,I really wish I didn't give up everything I did for you, I'm so mad at my self for even letting it get as far as it did.,savava143)
Approach 2 : Using CSVLoader
Ref : http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
Below script makes use of CSVLoader(), DUMP A will result in the same output seen earlier.
A = LOAD 'a.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (f1,f2,f3);
The error is that you do not want to FILTER based on a regex but GENERATE new fields based on a regex. To filter, you need to know if the line have to be filtered, hence the boolean requirement.
Therefore, you have to use :
b = FOREACH a GENERATE REGEX_EXTRACT(FIELD, REGEX, HOW_MANY_GROUPS_TO_RETURN);
However, as #Murali Rao said, your values are not just coma separated but CSV (think how you will handle a coma in tweet : it is not a field separator, just some content).

Zoho Creator making a custom function for a report

Trying to wrap my head around zoho creator, its not as simple as they make it out to be for building apps… I have an inventory database, and i have four fields that I call to fill a field called Inventory Number (Inv_Num1) –
First Name (First_Name)
Last Name (Last_Name)
Year (Year)
Number (Number)
I have a Custom Function script that I call through a Custom Action in the form report. What I am trying to do is upload a CSV file with 900 entries. Of course, not all of those have those values (first/last/number) so I need to bulk edit all of them. However when I do the bulk edit, the Inv_Num1 field is not updated with the new values. I use the custom action to populate the Inv_Num1 field with the values of the other 4 fields.
Heres is my script:
void onetime.UpdateInv()
{
for each Inventory_Record in Management
{
FN = Inventory_Record.First_Name.subString(0,1);
LN = Inventory_Record.Last_Name.subString(0,1);
YR = Inventory_Record.Year.subString(2,4);
NO = Inventory_Record.Number;
outputstr = FN + LN + YR + NO;
Inventory_Record.Inv_Num1 = outputstr;
}
}
I get this error back when I try to run this function
Error.
Error in executing UpdateInv workflow.
Error in executing For Each Record task.
Error in executing Set Variable task. Unable to update template variable FN.
Error evaluating STRING expression :
Even though there is a First Name for example, it still thinks there is none. This only happens on the fields I changed with Bulk Edit. If I do each one by hand, then the custom action works—but of course then the Inv_Num1 is already updated through my edit on success functions and makes the whole thing moot.
this may be one year late, you might have found the solution but just to highlight, the error u were facing was just due to the null value in first name.
you just have put a null check on each field and u r good to go.
you can generate the inv_number on the time of bulk uploading also by adding null check in the same code on and placing the code on Add> On Submt.( just the part inside the loop )
the Better option would be using a formula field, you just have to put this formula in that formula field and you'll get your inventory_number autogenerated , you can rename the formula_field to Inv Number or whaterver u want.
Since you are using substring directly in year Field, I am assuming the
year field as string.else you would have to user Year.tostring().substring(2,4) & instead of if(Year=="","",...) you have to put if(Year==null , null,...);
so here's the formula
if(First_Name=="","",First_Name.subString(0,1))+if(Last_Name =="","",Last_Name.subString(0,1)) + if(Year=="","",Year.subString(2,4)+Number
Let me know ur response if u implement this.
Without knowing the datatype its difficult to fix, but making the assumption that your Inventory_Record.number is a numeric data item you are adding a string to a number:
The "+" is used for string Concatenation - Joiner but it also adds two numbers together so think "a" + "b" = "ab" for strings but for numbers 1 + 2 = 3.
All good, but when you do "a" + 2 the system doesn't know whether to add or concatenate so gives an error.

Easiest way to find a value in one of three columns?

I am using twilio to provide audio conference functionality in my rails app. When I call my conference number, twilio passes on a couple of values - including 'From' which contains the caller's phone number in international format.
I have a profile for every user in my system and in my controller I am querying the profile to provide a personalised welcome message. Every profile contains between 0 and 3 numbers (primary, secondary and cellphone) and I need to check the caller's ID against those three fields in all profiles.
When I use the console on my dev machine, the following code finds the correct profile:
Profile.find_by('+44000000000')
When I upload to heroku, I use following code instead:
name = Profile.find_by(params['From']) || 'there'
Which causes an error in my app:
2014-04-03T19:20:22.801284+00:00 app[web.1]: PG::DatatypeMismatch: ERROR: argument of WHERE must be type boolean, not type bigint
2014-04-03T19:20:22.801284+00:00 app[web.1]: LINE 1: SELECT "profiles".* FROM "profiles" WHERE (+4400000000) ...
Any suggestion how that could be solved?
Thanks!
Additional information:
I think my problem is that I don't know how to query either the whole profile or three columns at once. Right now the code:
name = Profile.find_by(params['From'])
is not correct (params['From'] contains a phone number) because I am not telling rails to query the columns primary phone number, secondary phone number and cellphone. Neither am I querying the whole profile which would also be an option.
So the question basically is:
How can I change this code:
Profile.find_by(params['From'])
so that it queries either all fields in all profiles or just the three columns with phone numbers which each profile contains?
Is there something like Profile.where(:primary_number).or.where(:secondary_number)or.where(:cellphone) => params['From']
?
I am not familiar with twilio and not sure if this helps but find and find_by_attribute_name accepts array of values as options:
name = Profile.find_by([params['From'], 'there'] )
suppose params['From'] was here , This should generate:
SELECT `profiles`.* FROM `profiles` WHERE `profiles`.`attribute` IN ('here', 'there')
Or:
If you are trying to build dynamic matcher at run time , which is called Meta-programming , you can try this code:
name = eval("Profile.find_by_#{params['From']) || 'there'}(#rest of query params here) ")
Update
First of all, i think you are not using find_by correctly!! the correct syntax is:
Model.find_by(attribute_name: value)
#e.g
Profile.find_by(phone_number: '0123456')
Which will call where and retrive one record, but passing a value will generate a condition that always passes, for example:
Model.find_by('wrong_condition')
#will generate SQL like:
SELECT `models`.* FROM `models` WHERE ('wrong_condition') LIMIT 1
#which will return the first record in the model since there is no valid condition here
Why don't you try:
Profile.where('primary_number = ? OR secondary_number = ? OR cellphone = ?', params['From'], params['From'], params['From'])
You can write your query like:
Profile.where("primary_number = ? or secondary_number = ? or cellphone = ?", params['From'])
Just double check the syntax, but that should do it.