TOP function in apache pig - hive

In a dataset (approx. 200k records), there is column named tags ( comma separated list of tags associated with question. examples of tags are "html","error" etc so on .
php,error,gd,image-processing
php,error,gd,image-processing
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
lisp,scheme,subjective,clojure
cocoa-touch,objective-c,design-patterns
cocoa-touch,objective-c,design-patterns
cocoa-touch,objective-c,design-patterns
core-animation
django,django-models
django,django-models
aspûnet
scala,pattern-matching,oop,object-oriented-design,design-principles
scala,pattern-matching,oop,object-oriented-design,design-principles
scala,pattern-matching,oop,object-oriented-design,design-principles
.
.
.
.
.
how to find top 10 most commonly used tags in dataset? in pig or hive

Solution:
1) tokenize the data
https://pig.apache.org/docs/r0.14.0/func.html#tokenize
2) flatten out the bag created by the tokenizer
https://pig.apache.org/docs/r0.14.0/basic.html#flatten
3) group by the terms, and count, sg like this:
counts = FOREACH (GROUP data BY term) GENERATE
group AS term,
COUNT(data) AS term_cnt;
4) Than group your data again by some identifier, and MAX
https://pig.apache.org/docs/r0.14.0/func.html#max
Or order it to get the top x

Related

How to query to get comma separated values if the subject is the same? [duplicate]

This question already has an answer here:
Aggregating results from SPARQL query
(1 answer)
Closed 1 year ago.
I have more records of the same subject and predicate, but different object, like:
Alex hasFriend A
Alex hasFriend B
Alex hasFriend C
Alex hasFriend D
How could I query the data to get the result with comma separated values, like:
Alex hasFriend A, B, C, D
The SPARQL query is like this:
select distinct ?person ?friend where {
?person <http://www.example.com/hasFriend> ?friend.
}
Adding an answer to summarize the comment section above. The GROUP_CONCAT aggregate function can be used to achieve the results you are looking for. Note that the function allows you to specify the delimiter of your choice.
SELECT ?person (GROUP_CONCAT(?friend; separator=', ') AS ?friends)
WHERE
{
?person <http://www.example.com/hasFriend> ?friend
}
GROUP BY ?person

how to group count items in SPARQL, accumulating low hit entries?

How do I count grouped entries in SPARQL, merging entries whose quantity is less than a specific factor?
Consider for example the Nobel Prize data. I could get a count of all family names with a query like
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name (count(*) as ?count) WHERE {
?id foaf:familyName ?name
}
GROUP BY $name
ORDER BY DESC($count)
How do I modify the query so it only returns the family names occuring at least 3 times, accumulating the other names as other.
Just wrap your SELECT into another one.
Query
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name_ (SUM(?count) AS ?count_) {
{
SELECT ?name (COUNT(*) AS ?count) {
?id foaf:familyName ?name
} GROUP BY ?name
}
BIND (IF(?count > 2, ?name, "Other") AS ?name_)
} GROUP BY ?name_ ORDER BY DESC(IF(?name_ = "Other", -1 , ?count_))
Results
name_ count_
----------- ---------
Smith 5
Fischer 4
Wilson 4
Lee 3
Lewis 3
Müller 3
Other 878

How to return an union of entity id's

Consider having a list of persons(parent) and their posts(children).
Note that I have simplified the example but it could be parent-child-childofChild-etc, so some entities with a tree of children.
I would like to have a query that returns a union of some filtered(on some criteria) persons id's and their post id's like so:
person1Id
person2Id
person1Post1Id
person1Post2Id
.......
Given a concrete situation with 3 persons(with properties age,name) where Person1 has 2 posts and the other persons have no posts I managed to do this in 2 ways but both have limitations. So inside the where clause we have:
?entityId ?p ?o.
{
#persons/posts graph pattern
}
filter (?entityId =?person || ?entityId=?post) # || otherChildFitler
This works but uses filter so I want to avoid by using union:
{
SELECT (?person as ?s) WHERE{ #criteria}
order by ...
limit 1
}
UNION
{
{
SELECT ?person WHERE{ #criteria}
order by ...
limit 1
}
?s a <http://www.example.org/schema/Post> .
?s <http://www.example.org/schema/postedBy> ?person .
}
This also works but I have to duplicate the person inner query for each child.
So I tried to somehow use bind like so:
(Assume the inner query returns 1 person that has 2 posts.)
select ?s ?person ?projectedOutPerson
where
{
#tag 1
{
#returns 1 person with 2 posts
SELECT ?person WHERE
{
?person a <http://www.example.org/schema/Person> .
optional{?person <http://www.example.org/schema/age> ?age.}
}
order by desc(?age)
limit 1
}
#tag 2
bind(?person as ?projectedOutPerson)
#tag 3
{
bind(?projectedOutPerson as ?s)
}
#tag 4
UNION
{
?s a <http://www.example.org/schema/Post> .
?s <http://www.example.org/schema/postedBy> ?projectedOutPerson.
}
#tag 5
}
This retuns posts id's but doesn't add personIds and it has some curious behaviour.
I'm testing this on bigdata 1.3 and stardog 2.
?projectedOutPerson is bound ok(tag 2-3) if i select a person with no posts no posts are returned.
In both db the bind between tag3-4 is not done/joined so ?s is not bound to a person(returns an empty result).
if i remove portion tag 4-5 bigdata displays selected person(so now bind in tag 3-4 works) but stardog returns ( empty result, person1, person1) so in stardog chaining the binding doesn't work
bind(?person as ?p1) #only this is bound
bind(?p2 as ?p3)
bind(?p3 as ?s)
In sparql1.1 docs "The variable introduced by the BIND clause must not have been used in the group graph pattern up to the point of use in BIND. " but each time a new variable is introduced, so I think it should work.
So how can I solve this without duplicating the subquery or use filter.
It's much easier to answer this kind of question if you provide some data. If I understand your question correctly, you have some data like the following. I'm providing it in Turtle because it's fairly human readable.
#prefix : <http://stackoverflow.com/q/21115947/1281433/> .
# person1 has two posts, person2 has one post, and
# person3 has no posts at all.
:person1 a :Person ;
:hasPost :person1post1 , :person1post2 .
:person2 a :Person ;
:hasPost :person2post1 .
:person3 a :Person .
Now, if you want to select each person, as well as all their posts, you can do it with a query like this:
prefix : <http://stackoverflow.com/q/21115947/1281433/>
select ?id where {
?person a :Person ;
:hasPost? ?id .
}
-----------------
| id |
=================
| :person3 |
| :person2 |
| :person2post1 |
| :person1 |
| :person1post2 |
| :person1post1 |
-----------------
The trick here is that in the query pattern
?person a :Person ;
:hasPost? ?id .
the ?person a :Person ensures that ?person is a Person. Then, the pattern ?person :hasPost? ?id finds ?ids such that there's a path from ?person to ?id of length zero or one. The length zero case means that ?id can be bound to the same value of ?person. The case of length one means that you'll get every x such that ?person :hasPost x.
Now, in the case of persons and posts, it doesn't make a lot of sense (I think) to talk about posts having other posts, but if you're just looking for descendants in a tree structure, you can use * instead of ? in your property path. For instance, if you had this data:
#prefix : <http://stackoverflow.com/q/21115947/1281433/> .
# 1
# / \
# / \
# 2 3
# / \ / \
# 4 5 6 7
# \ \ \
# 8 9 10
:node1 :hasChild :node2 , :node3 .
:node2 :hasChild :node4 , :node5 .
:node3 :hasChild :node6 , :node7 .
:node4 :hasChild :node8 .
:node6 :hasChild :node9 .
:node7 :hasChild :node10 .
You could get :node3 and all its descendants with a query like this:
prefix : <http://stackoverflow.com/q/21115947/1281433/>
select ?node where {
:node3 :hasChild* ?node
}
-----------
| node |
===========
| :node3 |
| :node7 |
| :node10 |
| :node6 |
| :node9 |
-----------

ConcatRelated returns multiple lines per record ID

I made a query in Access 2003 using ConcatRelated to return a string of all children per each parent. However, my test query is returning multiple identical lines - one for each child. So the SQL:
SELECT Moms.MomID, Moms.MomLast,
ConcatRelated("KidFirst","KidsAgeQ","MomID =" & kidsageq.MomID) AS Kids
FROM Moms INNER JOIN KidsAgeQ ON Moms.MomID = KidsAgeQ.MomID;
returns the following:
MomID - MomLast - Kids
34 . . . . . . Q . . . . . . . Pippin, Sunshine, Rose
34 . . . . . . Q . . . . . . . Pippin, Sunshine, Rose
34 . . . . . . Q . . . . . . . Pippin, Sunshine, Rose
Is this normal? And whether it's normal or not, how do I fix it to return only a single record for each MomID?
I suspect you get multiple rows per MomID due to the INNER JOIN with KidsAgeQ. You shouldn't need to include that table in order to retrieve the concatenated KidFirst values matching each MomID.
If this query doesn't give you the result you need, please show us sample data from Moms.
SELECT
m.MomID,
m.MomLast,
ConcatRelated("KidFirst","KidsAgeQ","MomID =" & m.MomID) AS Kids
FROM Moms AS m;

how to pull out list of all hyperlinked people on a persons wikipedia page using SPARQL and dbpedia

I want to pull out a list of all the "persons" which have a link to another person on Wikipedia.
For instance, George H. W. Bush has this sentence in his bio:
"Bush was born in Milton, Massachusetts, to Senator
Prescott Bush and Dorothy Walker Bush."
Now Dorothy Bush is hyperlinked to her own page. Can I get a list which looks like:
George H. W. Bush | Dorothy Walker Bush
George H. W. Bush | Babe Ruth
George H. W. Bush | Bill Clinton
and to extend this.. for everyone on Wikipedia? I'll obviously have to break this down into bit sized chunks for it to output but I just am not sure how to code this to select for linked persons only. Thanks
One way to start would simply be to search for connected resources that are both of type Person. You can use dbpedia's web based query form.
SELECT ?person1 ?p ?person2
WHERE {
?person1 ?p ?person2.
?person1 a foaf:Person.
?person2 a foaf:Person.
}
ORDER BY ?person1
LIMIT 10
OFFSET 0
You can "split this data into chunks" by using the ORDER BY keyword and iterating over the value after OFFSET (eg. 10, 20, 30, ...). You should save all results of these seperate queries and then combine them afterwards to get the full result.
If you are only looking for a particular kind of interperson relationship on dbpedia, the following query will give you all the properties used to connect two persons.
SELECT DISTINCT ?p
WHERE {
?person1 ?p ?person2.
?person1 a foaf:Person.
?person2 a foaf:Person.
}
Choose one or several of those properties, eg. http://dbpedia.org/property/married, and get a list of person related by this property using the following query.
SELECT ?person1 ?person2
WHERE {
?person1 <http://dbpedia.org/property/married> ?person2.
?person1 a foaf:Person.
?person2 a foaf:Person.
}
ORDER BY ?person1
LIMIT 10
OFFSET 0
As you will see by yourself property usage on dbpedia is quite heterogeneous, so it might take some effort to get what you want.
Hope this helps as a starting point.