get grouped results with sparql query - sparql

I still feel like a SPARQL newbie, so I may be way off base about what SPARQL GROUP BY does, but here's my questions.
Suppose I wanted to request all resources in graph database called Categories, and I wanted to get all the items associated with these categories, along with the names of the items and their price.
Right now my SPARQL queries are giving me back something like the following table:
**Categories Item ItemName ItemPrice**
Tools HammerID Hammer $12
Tools SawID Saw $13
Tools WrenchID Wrench $10
Food AppleID Apple $5
Food CornID Corn $1
I wanted to use GROUP BY to group the items under a single category, so that when I start processing it, I can look through each unique category and then display the items that belong in that category.
Right now if I loop through the above results, I will be iterating over 5 entries instead of 2.
The other way I can describe the results I want are by imaging what the corresponding json data would look like. I want something like:
[
tools: [
{id: hammerId
title: hammer
price: $12},
{id: sawId
title: saw
price: $13},
{id: wrenchId
title: wrench
price: $10}
],
food: [
{id: appleId
title: apple
price: $5},
{id: cornId
title: corn
price: $1}
]
]
With the results, like this I can directly loop over the top level items, and then display the results for each.
Can I use GROUP BY to tell SPARQL to give me results like this?

No, you can't. A SPARQL SELECT query-result is defined as a sequence of solutions, with each solution being a set of variable-value pairs (with a value being defined as an IRI, BNode, or literal value). Basically it's a simple table. There is no provision for 'nested' solutions like you'd need for your JSON-like structure.
However the difference is purely syntactic. If you group, you know the result will deliver all solutions belonging to the same group together (one after the other) - so in processing the result you can simply treat the grouped variable as a marker. And of course if you really want, you can easily rewrite the query result into this kind of syntactic structure yourself - it's just a different way of writing down the exact same information, after all.

Related

How to automatically break down a SQL-like query with many joins into discrete, independent steps?

Note: This is a learning exercise to learn how to implement a SQL-like relational database. This is just one thin slice of a question in the overall grand vision.
I have the following query, given a test database with a few hundred records:
select distinct "companies"."name"
from "companies"
inner join "projects" on "projects"."company_id" = "companies"."id"
inner join "posts" on "posts"."project_id" = "projects"."id"
inner join "comments" on "comments"."post_id" = "posts"."id"
inner join "addresses" on "addresses"."company_id" = "companies"."id"
where "addresses"."name" = 'Address Foo'
and "comments"."message" = 'Comment 3/3/2/1';
Here, the query is kind of unrealistic, but it demonstrates the point which I am trying to make. The point is to have a query with a few joins, so that I can figure out how to write this in sequential steps.
The first part of the question is (which I think I've partially figured out), is how do you write these joins as a sequence of independent steps, with the output of one fed into the input of the other? Also, is there more than one way to do it?
// step 1
let companies = select('companies')
// step 2
let projects = join(companies, select('projects'), 'id', 'company_id')
// step 3
let posts = join(projects, select('posts'), 'id', 'project_id')
// step 4
let comments = join(posts, select('comments'), 'id', 'post_id')
// step 5
let finalPosts = posts.filter(post => !!comments.find(comment => comment.post_id === post.id))
// step 6
let finalProjects = projects.filter(project => !!posts.find(post => post.project_id === project.id))
// step 7, could also be run in parallel to step 2 potentially
let addresses = join(companies, select('addresses'), 'id', 'company_id')
// step 8
let finalCompanies = companies.filter(company => {
return !!posts.find(post => post.company_id === company.id)
&& !!addresses.find(address => address.company_id === company.id)
})
These filters could probably be more optimized using indexes of some sort, but that is beside the point I think. This just demonstrates that there seem to be about 8 steps to find the companies we are looking for.
The main question is, how do you automatically figure out the steps from the SQL query?
I am not asking about how to parse the SQL query into an AST. Assume we have some sort of object structure we are dealing with, like an AST, to start.
How would you have to have the SQL query in structured object form, such that it would lead to these 8 steps? I would like to be able to specify a query (using a custom JSON-like syntax, not SQL), and then have it divide the query into these steps to divide and conquer so to speak and perform the queries in parts (for learning how to implement distributed databases). But I don't see how we go from SQL-like syntax, to 8 steps. Can you show how that might be done?
Here is the full code for the demo, which you can run with psql postgres -f test.sql. The result should be "Company 3".
Basically looking for a high level algorithm (doesn't even need to be code), which describes the key way you would break down some sort of AST-like object representation of a SQL query, into the actual planned steps of the query.
My algorithm looks like this in my head:
represent SQL query in object tree.
convert object tree to steps.
I am not really sure what (1) should be structured like, and even if we had some sort of structure, I'm not sure how to get that to complete (2). Looking for more details on the implementations of these steps, mainly step (2).
My "object structure" for step 1 would be something like this:
const query = {
select: {
distinct: true,
columns: ['companies.name'],
from: ['companies'],
},
joins: [
{
type: 'inner join',
table: 'projects',
left: 'projects.company_id',
right: 'companies.id',
},
...
],
conditions: [
{
left: 'addresses.name',
op: '=',
right: 'Address Foo'
},
...
]
}
I am not sure how useful that is, but it doesn't relate to steps at all. At a high level, what kind of code would I have to write to convert that object sort of structure into steps? Seems like one potential avenue is do a topological sort on the joins. But then you need to combine that with the select and conditions somehow, not sure how you would even begin to programmatically know what step should be before what other step, or even what the steps are. Maybe if I somehow could break it into known "chunks", then it would be simple to apply TOP sort to it after that, but then the question is still, how to get into chunks from the object structure / SQL?
Basically, I have been reading about the theory behind "query planning and optimization", but don't know how to apply it in this regard. How did this site do it?
One aspect is breaking at least the where conditions into CNF.
Implementing joins is a huge topic which is probably out of scope for a StackOverflow answer.
If you're looking for practical information about how joins are implemented, I would suggest...
The Join Operation section of Use The Index, Luke for different types of join implementation.
Section 7 of the The SQLite Query Optimizer Overview covers joins. And reading the SQLite source code. It is about as small a practical SQL implementation will get.
The output of explain in Postgresql gives very detailed information about how it has implemented the query. And they are explained in Operator Optimization Information

MongoDB delete documents not containing custom words

I have a news article scraper that pulls articles based on certain content. On occasions, the crawlers pull back articles irrelevant to what they're supposed to.
I want to delete documents that DO NOT contain the relevant keywords. I ran the below code in pandas and was successful in deleting the unwanted documents:
relevant_words = ['Bitcoin', 'bitcoin', 'Ethereum', 'ethereum', 'Tether', 'tether', 'Cardano', 'cardano', 'XRP', 'xrp']
articles['content'] = articles['content'][articles['content'].str.contains('|'.join(relevant_words))].str.lower()
articles.dropna(subset=['content'], inplace=True)
My DB structure is as follows:
_id:
title:
url:
description:
author:
publishedAt:
content:
source_id:
urlToImage:
summarization:
The content field can contain anyone from one sentence to several paragraphs. I'm thinking a python script that iterates over the content field looking for documents without the relevant words and deleting them.
Filter into a new dataframe.
You were on the right track except you have to go in the following order;
join and convert to lower case:
'|'.join(relevant_words).lower()
Filter
m= articles['content'].str.contains('|'.join(relevant_words).lower())
Mask the filter
articles[m]
Combined code
new_articles=articles[articles['content'].str.contains('|'.join(relevant_words).lower())]

Django Q&Q versus filter.filter

While hunting a bug, I found out that the following 2 statements do different things:
Query 1
Order.objects \
.filter(items__name__icontains="Foo") \
.filter(items__name__icontains="Bar") \
.distinct()
Query 2
Order.objects \
.filter(
Q(items__name__icontains="Foo") &
Q(items__name__icontains="Bar")
) \
.distinct()
The result is as follows:
Query 1 does include orders that have items which either contain "Foo" or "Bar". For example one item's name is "Foo" while another item's name is "Bar".
Query 2 however only includes orders that have at least one item that contains all keywords, for example an item with a name of "Foo Bar".
Looking at the queries, I can see that the filter() method adds another INNER JOIN to the query while the other doesn't.
I can see the reasoning behind this, but I really wonder if that's the intended behavior.
The difference is that the first query has two filter() calls, and the second query only has one.
The first query tries to find an object with a related item containing 'Foo' and a related item containing 'Bar'. The second query tries to find an item with a single related item that contains both 'Foo' and 'Bar'
The fact that one uses Q() objects is not significant - you could change the first query to:
Order.object.filter(
Q(items__name__icontains="Foo"
).filter(
Q(items__name__icontains="Bar")
)
However the Q() is required in your second Query 2m since it would be invalid Python to repeat the keyword argument in .filter(items__name__icontains="Foo", items__name__icontains="Bar")
See the docs on spanning multi-values relationships for more info.

eBay API categoryId in findItemsAdvanced call returns wrong categories

I'm trying to use the categoryId in my findItemsAdvanced query:
api.execute('findItemsAdvanced', {
'keywords': 'laptop',
'categoryId': '51148'}
The results I get are, for example (printing the searchResult dictionary):
'itemId': {'value': '200971548007'}, 'isMultiVariationListing': .............
'primaryCategory': {'categoryId': {'value': '69202'}, 'categoryName': {'value': 'Air Conditioning'}}
....."
You can see that the result has a categoryId of 69202, and not 51148.
What am I doing wrong here? I'm just using the finding.py code at:
https://github.com/timotheus/ebaysdk-python
Thanks
Edit
I've done some tests. I extracted the XML that the SDK builds. If I call with:
'categoryId': '177'
The response is:
the request_xml is <?xml version='1.0' encoding='utf-8'?><findItemsAdvancedRequest
xmlns="http://www.ebay.com/marketplace/search/v1/services"><categoryId>177</categoryId>
<itemFilter><name>Condition</name><value>Used</value></itemFilter><itemFilter>
<name>LocatedIn</name><value>GB</value></itemFilter><keywords>laptop</keywords>
<paginationInput><entriesPerPage>100</entriesPerPage><pageNumber>1</pageNumber>
</paginationInput></findItemsAdvancedRequest>
and I get the same with
'categoryId': ['177']
I find this a bit odd, I thought the appropriate name for the XML categoryId was 'CategoryId' with a capital C. If I do that I don't get an error, but the result is not restricted to the categoryId requested.
Doing it like above, I still get the error:
Exception: findItemsAdvanced: Domain: Marketplace, Severity: Error,
errorId: 3, Invalid category ID.
The code below will do a keyword search for 'laptops' across the UK eBay site and restrict the search to the two categories Apple Laptops(111422) and PC Laptops & Netbooks(177). In addition the results are filtered to only show the first 25 used items that are priced between £200 and £400. The results are also sorted by price from high to low.
There are a few things to keep in mind about this example.
It assumes that you have already installed ebaysdk-python.
According to the eBay docs the categoryId field is a string and more than one category can be specified. An array is therefore used to hold the category ids that we are interested in.
Our request needs to search for items in the UK eBay site. We therefore pass EBAY-GB as the siteid parameter.
Category ids are different across each eBay site. For example the category PC Laptops & Netbooks(177) does not exist in Belgium. (Which incidently is the site that is used in the ebaysdk-python finding.py example.)
This example is also available as a Gist
import ebaysdk
from ebaysdk import finding
api = finding(siteid='EBAY-GB', appid='<REPLACE WITH YOUR OWN APPID>')
api.execute('findItemsAdvanced', {
'keywords': 'laptop',
'categoryId' : ['177', '111422'],
'itemFilter': [
{'name': 'Condition', 'value': 'Used'},
{'name': 'MinPrice', 'value': '200', 'paramName': 'Currency', 'paramValue': 'GBP'},
{'name': 'MaxPrice', 'value': '400', 'paramName': 'Currency', 'paramValue': 'GBP'}
],
'paginationInput': {
'entriesPerPage': '25',
'pageNumber': '1'
},
'sortOrder': 'CurrentPriceHighest'
})
dictstr = api.response_dict()
for item in dictstr['searchResult']['item']:
print "ItemID: %s" % item['itemId'].value
print "Title: %s" % item['title'].value
print "CategoryID: %s" % item['primaryCategory']['categoryId'].value
I hope the following will explain why performing a search on the Belgium site results in items that contain the category 177 even though this is not valid for Belgium but is valid for the UK.
Basically eBay allow sellers from one site to appear in the search results of another site as long as they meet the required criteria, such as offering international shipping. It allows sellers to sell to other countries without the need to actually list on those sites.
From the example XML that elelias provided I can see that a keyword search for 'laptop' was made on the Belgium site with the results filtered so that only items located in the UK was to be returned.
<itemFilter>
<name>LocatedIn</name>
<value>GB</value>
</itemFilter>
Because the search was limited to those located in the UK you won't see any Belgium items in the results. Since the items where listed on the UK site they will contain information relevant to the UK. For example the category id 177. eBay does not convert the information to make it relevant to the site that you are searching on.
It is important to remember that what ever you are trying to do with the Finding API can also be repeated using the actual advance search on eBay. For example it is possible to re-create the issue by performing a keyword search for used items on the Belgium site.
This url is the equivalent of your code that was performing the search without specifying the category 177. As you can see from the results it returns items that where listed on the UK site but which are appearing in the Belgium site. It you click on some of the items, for example, you can even see that it displays the UK category PC Laptops & Netbooks (177) even though this does not exist on the Belgium site. This matches the results form your code where it was returning 177 but would not let you specify the same value in the request as you was searching the Belgium site.
I hope this helps.
Because categoryId is repeatable. You will need to pass an array into the call. Something like this should work.
api.execute('findItemsAdvanced', {
'keywords': 'laptop',
'categoryId': [
{'51148'}
]
}
Note: See how the itemFilter element is an array in the sample file of the SDK.
'itemFilter': [
{'name': 'Condition',
'value': 'Used'},
{'name': 'LocatedIn',
'value': 'GB'},
],

freebase getting plain names of types and sorting by commonality

I'd like to be able to get a list of types by their common name from a freebase ID
{
"id": "/m/02mjmr", #obama
"type":[]
}​
How can I return the names of the types instead of their IDs? The above returns
0: "/common/topic"xp
1: "/people/person"xp
2: "/user/robert/default_domain/presidential_candidate"xp
3: "/book/author"xp
4: "/award/award_winner"xp
5: "/book/book_subject"xp
6: "/user/robert/x2008_presidential_election/candidate"xp
7: "/government/politician"xp
8: "/organization/organization_member"xp
9: "/user/robert/default_domain/my_favorite_things"xp
And lastly, how could I sort them by count? or by notability possibly?
Ie,
President
Nobel Prize Winner
Author
Person
etc?
Possibly something similar to the notable types API, but it looks like it's going away?
http://wiki.freebase.com/wiki/Notable_types_API
You can get names and instance counts with
{
"id": "/m/02mjmr",
"type": [{
"name": null,
"id":null,
"/type/type/domain":{"key":[{"namespace":"/","limit":0}],"id":null}
"/freebase/type_profile/instance_count": null,
"sort":"/freebase/type_profile/instance_count"
}]
}​
One definition of "notable" is low frequency, so you could just invert your instance count sort to get notability. Limiting this to types in the Freebase "commons" would exclude noisy user types. One way to identify commons types is to look for /type/type/domain property values which are in the root namespace (ie a single path segment like /government)
For your example, the lowest frequency commons types are:
43 /government/us_president US President /government
51 /people/appointer Appointer /people
73 /architecture/building_occupant Building Occupant /architecture
204 /government/political_appointer Political Appointer /government
230 /book/poem_character Poem character /book
254 /event/public_speaker Public speaker /event
You could refine the filtering further by blacklisting the types that you think are not notable for your application. There are currently 2134 commons types and a bunch of those are primitive data types or things for system usage, so it wouldn't take you long to go through and hand curate the entire list.
You might also be interested in looking at the Freebase Search API which returns one or more notable types with each result. You can search for a specific topic by MID like this:
https://www.googleapis.com/freebase/v1/search?query=/m/02mjmr&indent=true