Searching on pubmed using biopython - xlrd

I am trying to input over 200 entries into pubmed in order to record the number of articles published by an author and to refine the search by including his/her mentor and institution. I have tried to do this using biopython and xlrd (the code is below), but I am consistently getting 0 results for all three formats of inquiries (1. by name, 2. by name and institution name, and 3. by name and mentor's name). Are there steps of troubleshooting that I can do, or should I use a different format when using the keywords indicated below to search on pubmed?
Example output of the input queries;search_term is a linked list with lists of the input queries.
print(*search_term[8:15], sep='\n')
[text:'Andrew Bland', 'Weill Cornell Medical College', text:'David Cutler MD']
[text:'Andy Price', 'University of Alabama at Birmingham School of Medicine', text:'Jason Warem, PhD']
[text:'Bah Chamin', 'University of Texas Southwestern Medical School', text:'Dr. Timothy Hillar']
[text:'Eduo Cera', 'University of Colorado School of Medicine', text:'Dr. Tim']
Code used to generate the input queries above and to search on Pubmed:
Entrez.email = "mollyzhaoe#college.harvard.edu"
for search_term in search_terms[8:55]:
handle = Entrez.egquery(term="{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
handle_1 = Entrez.egquery(term = "{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[2]))
handle_2 = Entrez.egquery(term = "{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[1]))
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = ['','','']
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[0] = row["Count"]
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[1] = row["Count"]
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[2] = row["Count"]

Check your indentation, it is difficult to know which part belongs to which loop.
If you want to troubleshoot, try printing your egquery, e.g.
print("{0} AND ((2010[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
and paste the output to pubmed and see what you get. Perhaps modify it a bit and see which search term causes the problems.
Your input format is a little bit hard to guess. Print the query and make sure you are getting the right search values.
For the author names, try to get rid of the academic titles, PubMed might confused them with the initials, e.g. House MD, might be Mark David House.

Related

Search reference in text (Create Loop in function)

I have a question I've been working with view for a while now and in a table I have a text formatted this way ... name of the book, name of the author, number of pages or name of the book, name of the edition, year of the edition, name of the author, number of pages. I want to make sure that when I move my mouse over the reference ([name of the book] or [name of the book, name of the edition, year of the edition]) the description is displayed after returning my api. The way to find the reference with only the name of the book works but I don't know how to search on a single parameter or 3 in my function.
getBiblio(biblio) {
this.resultbiblio = biblio.split(/\s*,|$\s*/, 1)
axios
.get('../api/biblio/' + this.resultbiblio )
.then(response => (this.biblio = response.data))
},
biblio = Les Fables, Jean de la Fontaine, 1600
Now let's say that my data is biblio: L'avare, édition de poche, 2002, Molière, 1300 I would like to test if there is a reference to L'avare, if the answer is no then I cut the biblio data to form the sentence L'avare, édition de poche, 2002 and I test to see if there is a reference.

Pandas: Subset of subset with multiple conditions

I need to grab a subset of the following using multiple conditions:
Event Type must contain the string 'Outreach'
AND any other field can contain the string 'STEM' - case insensitive.
Data Sample:
Title Event Type Presenter Description Tags
STEM event STEM Gloria Bubbles Craft
Robots Outreach STEM - John EV3 Bots
School STEM Outreach Billy Robots Craft
Code:
cond = df['Event Type'].str.contains('Outreach')
stemA = df[cond]
This gets me all the outreach events.
cond = df['Event Type'].str.contains('Outreach') & (df['Presenter'].str.contains('STEM') | df['Tags'].str.contains('STEM') | df['Description'].str.contains('STEM') | df['Title'].str.contains('STEM'))
stem[cond]
I was hoping for a grep-like solution. The above gets me less than grep does on the command line and I know this result is wrong from looking at the data.
IIUC, this should work for you
cols_to_include = df.columns[df.columns != 'Event Type']
a = df[cols_to_include].astype(str).sum(axis=1)
df[df['Event Type'].str.contains('Outreach') & (a.str.contains('STEM', regex=True))]

SPARQL Multi-Valued properties - Rendering Results

I am new to SPARQL, and graph database querying as a whole so please excuse any ignorance but I am trying to write a basic output using some data stored within Fueski and am struggling to understand the best practice for handling duplication of rows due to the cardinality that exist between the various concepts.
I will use a simple example to hopefully demonstrate my point.
Data Set
This is a representative sample of the types of data and relationships I am currently working with;
Data Set
Based on this structure I have produced the following triples (N-Triple format);
<http://www.test.com/ontologies/Author/JohnGrisham> <http://www.test.com/ontologies/property#firstName> "John" .
<http://www.test.com/ontologies/Author/JohnGrisham> <http://www.test.com/ontologies/property#lastName> "Grisham" .
<http://www.test.com/ontologies/Author/JohnGrisham> <http://www.test.com/ontologies/property#hasWritten> <http://www.test.com/ontologies/Book/TheClient> .
<http://www.test.com/ontologies/Author/JohnGrisham> <http://www.test.com/ontologies/property#hasWritten> <http://www.test.com/ontologies/Book/TheFirm> .
<http://www.test.com/ontologies/Book/TheFirm> <http://www.test.com/ontologies/property#name> "The Firm" .
<http://www.test.com/ontologies/Book/TheFirm> <http://www.test.com/ontologies/property#soldBy> <http://www.test.com/ontologies/Retailer/Foyles> .
<http://www.test.com/ontologies/Book/TheFirm> <http://www.test.com/ontologies/property#soldBy> <http://www.test.com/ontologies/Retailer/Waterstones> .
<http://www.test.com/ontologies/Book/TheClient> <http://www.test.com/ontologies/property#name> "The Client" .
<http://www.test.com/ontologies/Book/TheClient> <http://www.test.com/ontologies/property#soldBy> <http://www.test.com/ontologies/Retailer/Amazon> .
<http://www.test.com/ontologies/Book/TheClient> <http://www.test.com/ontologies/property#soldBy> <http://www.test.com/ontologies/Retailer/Waterstones> .
<http://www.test.com/ontologies/Retailer/Amazon> <http://www.test.com/ontologies/property#name> "Amazon" .
<http://www.test.com/ontologies/Retailer/Waterstones> <http://www.test.com/ontologies/property#name> "Waterstones" .
<http://www.test.com/ontologies/Retailer/Foyles> <http://www.test.com/ontologies/property#name> "Foyles" .
Render Output Format
Now what I am trying to do is render a page where all authors are displayed showing details of all the books and the retailers in which those individual books are sold. so something like this (suedo code);
for-each:Author
<h1>Author.firstName + Author.lastName</h1>
for-each:Author.Book
<h2>Book.Name</h2>
Sold By:
for-each:Book.Retailer
<h2>Retailer.name</h2>
SPARQL
For the rendering to work my thinking was I would need the author's First name and last name, then all book names they have and the various retailer names those books are sold through and therefore I came up with the following SPARQL;
PREFIX p: <http://www.test.com/ontologies/property#>
SELECT ?authorfirstname
?authorlastname
?bookname
?retailername
WHERE {
?author p:firstName ?authorfirstname;
p:lastName ?authorlastname;
p:hasWritten ?book .
OPTIONAL {
?book p:name ?bookname;
p:soldBy ?retailer .
?retailer p:name ?retailername .
}
}
This provides the following results;
Results Triple Table
Unfortunately due to the duplication of rows my basic rendering attempt cannot produce output as expected, in fact it's rendering a new "Author" section for every row returned from the query.
I guess what I'm trying to understand is how should this type of rendering should be done.
Is it the renderer that is supposed to regroup data back into the graph form it wants to travese (I honestly cannot see how this can be the case)
Is the SPARQL invalid - is there a way to do what I want in the SPARQL language itself?
Am I just doing something completely wrong?
AMENDMENT - More Detailed Analysis on GROUP_CONCAT
When reviewing the options available to me I came across GROUP_CONCAT but after a bit of playing with it decided it probably wasn't the option that was going to give me what I wanted and probably wasn't the best route. The reasons for this are;
Data Size
Whilst the data set I am running my examples over in this post is small only spanning 3 concepts and a very restricted data set the actual concepts and data I am running against in the real world is far far larger where concatenating results will produce extremely long delimitered strings, especially for free format columns such as descriptions.
Loss of context
Whilst trying out group_concat I quickly realised that I couldn't understand the context of how the various data elements across the group_concat columns related.. I can show that by using the book example above.
SPARQL
PREFIX p: <http://www.test.com/ontologies/property#>
select ?authorfirstname
?authorLastName
(group_concat(distinct ?bookname; separator = ";") as ?booknames)
(group_concat(distinct ?retailername; separator = ";") as ?retailernames)
where {
?author p:firstName ?authorfirstname;
p:lastName ?authorLastName;
p:hasWritten ?book .
OPTIONAL {
?book p:name ?bookname;
p:soldBy ?retailer .
?retailer p:name ?retailername .
}
}
group by ?authorfirstname ?authorLastName
This produced the following output;
firstname = "John"
lastname = "Grisham"
booknames = "The Client;The Firm"
retailernames = "Amazon;Waterstones;Foyles"
As you can see this has produced one result row but you can no longer work out how the various data elements relate. Which Retailers are for which Book?
Any help/guidance would be greatly appreciated.
Current Solution
Based on the recommended solution below I have used the concept of keys to bring the various data sets togehter however I have tweeked it slightly so that I am using a query per concept (E.g. author, book and retailer) and then used the keys to bring together the results in my renderer.
Author Results
firstname lastname books
--------------------------------------------------------------------------------
1 John Grisham ontologies/Book/TheClient|ontologies/Book/TheFirm
Book Results
id name retailers
-------------------------------------------------------------------------------------------------------
1 ontologies/Book/TheClient The Client ontologies/Retailer/WaterStones|ontologies/Retailer/Amazon
2 ontologies/Book/TheFirm The Firm ontologies/Retailer/WaterStones|ontologies/Retailer/Foyles
Retailer Results
id name
--------------------------------------------------
1 ontologies/Retailer/Amazon Amazon
2 ontologies/Retailer/Waterstones Waterstones
3 ontologies/Retailer/Foyles Foyles
What I then do in my renderer is use the ID's to pull results from the various result sets...
for-each author a : authors
output(a.firstname)
for-each book b : a.books.split("|")
book = books.get(b) // get the result for book b (e.g. Id to Foreign key)
output(book.name)
for-each retailer r : book.retailers.split("|")
retailer = retailers.get(r)
output(retailer.name)
So effectively you are stitching together what you want from the various different result sets and presenting it.
This seems to be working OK for the moment.
I find it easier to construct objects out of the SPARQL results in code rather than trying to form a query that returns only a single row per the relevant resource.
I would use the URI of the resources to identify which rows belong to which resource (author in this case), and then merge the result rows based on said URI.
For JS applications I use the code here to construct objects out of SPARQL results.
For complex values I use __ in the variable name to denote that an object should be constructed from the value. For example all values with variables prefixed with ?book__ would be turned into an object with the remainder of the variable's name as the name of the object's attribute, each object identified by ?book__id. So having values for ?book__id and ?book__name would result in an attribute book for the author, such that author.book = { id: '<book-uri>', name: 'book name'} (or a list of such objects if there are multiple books).
For example in this case I would use the following query:
PREFIX p: <http://www.test.com/ontologies/property#>
SELECT ?id ?firstName ?lastName ?book__id ?book__name
?book__retailer
WHERE {
?id p:firstName ?firstName;
p:lastName ?lastName;
p:hasWritten ?book__id .
OPTIONAL {
?book__id p:name ?book__name;
p:soldBy/p:name ?book__retailer .
}
}
And in the application code I would construct Author objects that look like this (JavaScript notation):
[{
id: '<http://www.test.com/ontologies/Author/JohnGrisham>',
firstName: 'John',
lastName: 'Grisham',
book: [
{
id: '<http://www.test.com/ontologies/Book/TheFirm>',
name: 'The Firm',
retailer: ['Amazon', 'Waterstones', 'Foyles']
},
{
id: '<http://www.test.com/ontologies/Book/TheClient>',
name: 'The Client',
retailer: ['Amazon', 'Waterstones', 'Foyles']
}
]
}]
This is a common problem that can strike any relational database, I suppose. As you say GROUP_CONCAT is useful in many situations, but does lose fidelity.
I worked out a solution you might find interesting. Let's assume you want to construct a view or result tree looping though authors, then for each author their books, then for each author the retailer.
SELECT DISTINCT ?authorname ?bookname ?retailername {
...
} ORDER BY ?authorname ?bookname ?retailername
That gives you results like this:
author book retailer
-----------------------------
1 author1 book1 retailer1
2 author1 book1 retailer2
3 author1 book2 retailer2
4 author2 book3 retailer2
5 author2 book3 retailer3
...
Because of the ordering it's possible to step through
get next result
currentauthor = author in result
print currentauthor
while author in next result = currentauthor:
get next result
currentbook = book in result
print currentauthor
while book in next result = currentbook:
get next result
print retailer in result

How to store smart-list rules in a relational database

The system I'm building has smart groups. By smart groups, I mean groups that update automatically based on these rules:
Include all people that are associated with a given client.
Include all people that are associated with a given client and have these occupations.
Include a specific person (i.e., by ID)
Each smart groups can combine any number of these rules. So, for example, a specific smart list might have these specific rules:
Include all people that are associated with client 1
Include all people that are associated with client 5
Include person 6
Include all people associated with client 10, and who have occupations 2, 6, and 9
These rules are OR'ed together to form the group. I'm trying to think about how to best store this in the database given that, in addition to supporting these rules, I'd like to be able to add other rules in the future without too much pain.
The solution I have in mind is to have a separate model for each rule type. The model would have a method on it that returns a queryset that can be combined with other rules' querysets to, ultimately, come up with a list of people. The one downside of this that I can see is that each rule would have its own database table. Should I be concerned about this? Is there, perhaps, a better way to store this information?
Why not use Q objects?
rule1 = Q(client = 1)
rule2 = Q(client = 5)
rule3 = Q(id = 6)
rule4 = Q(client = 10) & (Q(occupation = 2) | Q(occupation = 6) | Q(occupation = 9))
people = Person.objects.filter(rule1 | rule2 | rule3 | rule4)
and then store their pickled strings into the database.
rule = rule1 | rule2 | rule3 | rule4
pickled_rule_string = pickle.dumps(rule)
Rule.objects.create(pickled_rule_string=pickled_rule_string)
Here are the models we implemented to deal with this scenario.
class ConsortiumRule(OrganizationModel):
BY_EMPLOYEE = 1
BY_CLIENT = 2
BY_OCCUPATION = 3
BY_CLASSIFICATION = 4
TYPES = (
(BY_EMPLOYEE, 'Include a specific employee'),
(BY_CLIENT, 'Include all employees of a specific client'),
(BY_OCCUPATION, 'Include all employees of a speciified client ' + \
'that have the specified occupation'),
(BY_CLASSIFICATION, 'Include all employees of a specified client ' + \
'that have the specified classifications'))
consortium = models.ForeignKey(Consortium, related_name='rules')
type = models.PositiveIntegerField(choices=TYPES, default=BY_CLIENT)
negate_rule = models.BooleanField(default=False,
help_text='Exclude people who match this rule')
class ConsortiumRuleParameter(OrganizationModel):
""" example usage: two of these objects one with "occupation=5" one
with "occupation=6" - both FK linked to a single Rule
"""
rule = models.ForeignKey(ConsortiumRule, related_name='parameters')
key = models.CharField(max_length=100, blank=False)
value = models.CharField(max_length=100, blank=False)
At first I was resistant to this solution as I didn't like the idea of storing references to other objects in a CharField (CharField was selected, because it is the most versatile. Later on, we might have a rule that matches any person whose first name starts with 'Jo'). However, I think this is the best solution for storing this kind of mapping in a relational database. One reason this is a good approach is that it's relatively easy to clean hanging references. For example, if a company is deleted, we only have to do:
ConsortiumRuleParameter.objects.filter(key='company', value=str(pk)).delete()
If the parameters were stored as serialized objects (e.g., Q objects as suggested in a comment), this would be a lot more difficult and time consuming.

SQL: Grabbing two when theres already one match?

For the recipient field in my message system, if you enter Megan Fo, it will ask you "Did you mean Megan Fox?". Then I also have a "Who did you mean? Megan Fox, Megan Foxxy, Megan Foxxie" if there's more than one found.
And then ofcourse there is if you made it correct. (if this SQL statement is returning 1 only out and the full is 1).
SELECT users.id, users.firstname, users.lastname,
(users.firstname = 'Meg' AND users.lastname = 'Meg') AS full FROM users
INNER JOIN users_friends ON users.id=users_friends.uID
WHERE users_friends.bID='1' AND users_friends.type = 'friend' AND users_friends.accepted = '1' AND (
(users.firstname = 'Meg' AND users.lastname='Meg') OR
(users.firstname LIKE 'Meg%' AND users.lastname LIKE 'Meg%') OR
users.firstname LIKE 'Meg%' OR
users.lastname LIKE 'Meg%')
This is working fine, although now I am having an issue when Im trying to send to a user
"Meg Meg"
Then it returns 2, "Meg Meg" and "Megan Fox". I want it to return one only if it matchs the full. So now I am keep getting "Who did you mean?" as i build it up so if thers more than 1 rowcount, then "who did you mean.."
if($sql_findUser->rowCount() == 1){
$get = $sql_findUser->fetch();
if($get["full"] == 1){
echo "SUCCESS, ONE FOUND FULL MATCH"
}else{
echo "Did you mean ...?"
}
}elseif($sql_findUser->rowCount()>1){
Who did you mean?
}
How can i fix this?
At the end of your SQL statement, it looks like you're including everyone with a first name starting with "Meg", which is why Meg Meg and Megan Fox both match. Try shortening your WHERE clause to:
WHERE users_friends.bID='1' AND users_friends.type = 'friend' AND users_friends.accepted = '1' AND (
(users.firstname = 'Meg' AND users.lastname='Meg') OR
(users.firstname LIKE 'Meg%' AND users.lastname LIKE 'Meg%'))
It will be a tad slower, but you could split the logic in two queries, one that uses exact matching (= instead of like), and if that one does not yield anything, then run the current query.
However, that will make problems, when the user enters Megan Fox, and really meant Megan Foxxy.
As I understand it, this is your problem when using "Meg Meg"...
- Your code identifies an exact match with "Meg Meg"
- Your code also identifies a partial match with "Megan Fox"
- You don't want the partial matches to be included if you get an exact match
The issue here is that no single record in the return set 'knows' about the other records.
To know that you do Not want to include the partial match, you must have already completed a full check for the exact match.
There appear to be two ways of doing that to me...
1) Sequential but seperate queries...
Have your client run a query for exact matches.
If no records are returned, run a query for partial matches on first AND last name.
If no records are returned, run a query for partial matches on first OR last name.
etc, etc.
2. Run one query where you specify the type of match
Add a field to your query that is something like this...
CASE WHEN (users.firstname = 'Meg' AND users.lastname='Meg') THEN 1
WHEN (users.firstname LIKE 'Meg%' AND users.lastname LIKE 'Meg%') THEN 2
WHEN (users.firstname LIKE 'Meg%' OR users.lastname LIKE 'Meg%') THEN 3
END AS 'match_type'
Then also add it to the ORDER BY clause, to make full matches at the top, etc, etc.
Your client can then see how many of each match type were generated and choose to discard/ignore the matches that are not relevant.