Say I have the following tables that model tags attached to articles:
articles (article_id, title, created_at, content)
tags (tag_id, tagname)
articles_tags (article_fk, tag_fk)
What is the idiomatic way to retrieve the n newest articles with all their attached tag-names? This appears to be a standard problem, yet I am new to SQL and don't see how to elegantly solve this problem.
From an application perspective, I would like to write a function that returns a list of records of the form [title, content, [tags]], i.e., all the tags attache to an article would be contained in a variable length list. SQL relations aren't that flexible; so far, I can only think about a query to joint the tables that returns a new row for each article/tag combination, which I then need to programmatically condense into the above form.
Alternatively, I can think of a solution where I issue two queries: First, for the articles; second, an inner join on the link table and the tag table. Then, in the application, I can filter the result set for each article_id to obtain all tags for a given article? The latter seems to be a rather verbose and inefficient solution.
Am I missing something? Is there a canonical way to formulate a single query? Or a single query plus minor postprocessing?
On top of the bare SQL question, how would a corresponding query look like in the Opaleye DSL? That is, if it can be translated at all?
You would typically use a row-limiting query that selects the articles and orders them by descending date, and a join or a correlated subquery with an aggregation function to generate the list of tags.
The following query gives you the 10 most recent articles, along with the name of their related tags in an array:
select
a.*,
(
select array_agg(t.tagname)
from article_tags art
inner join tags t on t.tag_id = art.tag_fk
where art.article_fk = a.article_id
) tags
from articles
order by a.created_at desc
limit 10
You have converted most of GMB's answer successfully to Opaleye in your answer to your subsequent question. Here's a fully-working version in Opaleye.
In the future you are welcome to ask such questions on Opaleye's issue tracker. You will probably get a quicker response there.
{-# LANGUAGE Arrows #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE TemplateHaskell #-}
import Control.Arrow
import qualified Opaleye as OE
import qualified Data.Profunctor as P
import Data.Profunctor.Product.TH (makeAdaptorAndInstance')
type F field = OE.Field field
data TaggedArticle a b c =
TaggedArticle { articleFk :: a, tagFk :: b, createdAt :: c}
type TaggedArticleR = TaggedArticle (F OE.SqlInt8) (F OE.SqlInt8) (F OE.SqlDate)
data Tag a b = Tag { tagKey :: a, tagName :: b }
type TagR = Tag (F OE.SqlInt8) (F OE.SqlText)
$(makeAdaptorAndInstance' ''TaggedArticle)
$(makeAdaptorAndInstance' ''Tag)
tagsTable :: OE.Table TagR TagR
tagsTable = error "Fill in the definition of tagsTable"
taggedArticlesTable :: OE.Table TaggedArticleR TaggedArticleR
taggedArticlesTable = error "Fill in the definition of taggedArticlesTable"
-- | Query all tags.
allTagsQ :: OE.Select TagR
allTagsQ = OE.selectTable tagsTable
-- | Query all article-tag relations.
allTaggedArticlesQ :: OE.Select TaggedArticleR
allTaggedArticlesQ = OE.selectTable taggedArticlesTable
-- | Join article-ids and tag names for all articles.
articleTagNamesQ :: OE.Select (F OE.SqlInt8, F OE.SqlText, F OE.SqlDate)
articleTagNamesQ = proc () -> do
ta <- allTaggedArticlesQ -< ()
t <- allTagsQ -< ()
OE.restrict -< tagFk ta OE..=== tagKey t -- INNER JOIN ON
returnA -< (articleFk ta, tagName t, createdAt ta)
-- | Aggregate all tag names for all articles
articleTagsQ :: OE.Select (F OE.SqlInt8, F (OE.SqlArray OE.SqlText))
articleTagsQ =
OE.aggregate ((,) <$> P.lmap (\(i, _, _) -> i) OE.groupBy
<*> P.lmap (\(_, t, _) -> t) OE.arrayAgg)
(OE.limit 10 (OE.orderBy (OE.desc (\(_, _, ca) -> ca)) articleTagNamesQ))
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
i have the books with different tags (crime, fantastic, dramatic etc.).
that's my sql-code:
query := `
SELECT gotoboox.books.id, gotoboox.books.title
FROM gotoboox.books
LEFT JOIN gotoboox.books_tags ON gotoboox.books.id = gotoboox.books_tags.book_id
LEFT JOIN gotoboox.tags ON gotoboox.books_tags.tag_id = gotoboox.tags.id
WHERE gotoboox.tags.title IN ($1)
GROUP BY gotoboox.books.title, gotoboox.books.id
`
rows, err := p.Db.Query(query, pq.Array(tags))
but i have got empty result.
for example, if i write
..WHERE gotoboox.tags.title IN ('Crime', 'Comedia').. // WITHOUT pg.Array()
its okay.
so, i need pass correctly my pq.Array(tags) to the 'where in'-statement.
P.S. tags is a slice of strings. "tags []string"
Something like this:
gotoboox.tags.title IN ('Crime', 'Comedia')
is, more or less, a short way to write:
gotoboox.tags.title = 'Crime' or gotoboox.tags.title = 'Comedia'
so you don't want to supply an array for the placeholder in IN ($1) unless tags.title is itself an array (which it isn't).
If you want to pass a slice for the placeholder and use pq.Array, you want to use = ANY(array) in the SQL:
query := `... WHERE gotoboox.tags.title = any ($1) ...`
rows, err := p.Db.Query(query, pq.Array(tags))
Alternatively, if tags had n elements then you could build a string like:
"$1,$2,...,$n"
fmt.Sprintf that into your SQL (which is perfectly safe since you know exactly what's in the strings):
p := placeholders(len(tags))
q := fmt.Sprintf("select ... where gotoboox.tags.title in (%s) ...", p)
and then supply values for all those placeholders when you query:
rows, err := p.DB.Query(q, tags...)
Below query will results in books with all selected tags.
select
b.id,
b.title
from
books b
join books_tags bt on bt.book_id = b.id
join tags t on bt.tag_id = t.id
where
t.title = any($1)
group by
b.id,
b.title
having
count(*)= cardinality($1)
I have an entity predicate eg. "Person" with related functional predicates storing attributes about the entity.
Eg.
Person(x), Person:id(x:s) -> string(s).
Person:dateOfBirth[a] = b -> Person(a), datetime(b).
Person:height[a] = b -> Person(a), decimal(b).
Person:eyeColor[a] = b -> Person(a), string(b).
Person:occupation[a] = b -> Person(a), string(b).
What I would like to do is in the terminal, do the equivalent of the SQL query:
SELECT id, dateOfBirth, eyeColor FROM Person
I am aware of the print command to get the details of a single functional predicate, but I would like to get a combination of them.
lb print /workspace 'Person:dateOfBirth'
You can use the "lb query" command to execute arbitrary logiql queries against your database. Effectively you create a temporary, anonymous, predicate with the results you want to see, and then a rule for populating that predicate using the logiql language. So in your case it would be something like:
lb query <workspace> '_(id, dob, eye) <-
Person(p),
Person:id(p:id),
Person:dateOfBirth[p] = dob,
Person:eyeColor[p] = eye.'
Try the query command with joins:
lb query /workspace '_(x, y, z) <- Person(p), Person:id(p:x), Person:dateOfBirth[p] = y, Person:eyeColor[p] = z.'
I have a file (In Relation A) with all tweets
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
...
I have another file (In Relation B) with words to be filtered
sick
viral fever
feeling
...
My Code
//loads all the tweets
A = load 'tweets' as tweets;
//loads all the words to be filtered
B = load 'filter_list' as filter_list;
Expected Output
(sick,1)
(viral fever,2)
(feeling,1)
...
How do i achieve this in pig using a join?
EDITED SOLUTION
The basic concept that I supplied earlier will work, but it requires the addition of a UDF to generate NGrams pairs of the tweets. You then union the NGram pairs to the Tokenized tweets, and then perform the wordcount function on that dataset.
I've tested the code below, and it works fine against the data provided. If records in your filter_list have more than 2 words in a string (ie: "I feel bad"), you'll need to recompile the ngram-udf with the appropriate count (or ideally, just turn it into a variable and set the ngram count on the fly).
You can get the source code for the NGramGenerator UDF here: Github
ngrams.pig
REGISTER ngram-udf.jar
DEFINE NGGen org.apache.pig.tutorial.NGramGenerator;
--Load the initial data
A = LOAD 'tweets.txt' as (tweet:chararray);
--Create NGram tuple with a size limit of 2 from the tweets
B = FOREACH A GENERATE FLATTEN(NGGen(tweet)) as ngram;
--Tokenize the tweets into single word tuples
C = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)tweet)) as ngram;
--Union the Ngram and word tuples
D = UNION B,C;
--Group similar tuples together
E = GROUP D BY ngram;
--For each unique ngram, generate the ngrame name and a count
F = FOREACH E GENERATE group, COUNT(D);
--Load the wordlist for joining
Z = LOAD 'wordlist.txt' as (word:chararray);
--Perform the innerjoin of the ngrams and the wordlist
Y = JOIN F BY group, Z BY word;
--For each intersecting record, store the ngram and count
X = FOREACH Y GENERATE $0,$1;
DUMP X;
RESULTS/OUTPUT
(feeling,1)
(viral fever,2)
tweets.txt
today i am not feeling well
i have viral fever!!!
i have a fever
i wish i had viral fever
wordlist.txt
sick
viral fever
feeling
Original Solution
I don't have access to my Hadoop system at the moment to test this answer, so the code may be off slightly. The logic should be sound, however. An easy solution should be:
Perform the classic wordcount program against the tweets dataset
Perform an inner join of the wordlist and tweets
Generate the data again to get rid of the duplicate word in the tuple
Dump/Store the join results
Example code:
A = LOAD 'tweets.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word;
C = GROUP B BY word;
D = FOREACH C GENERATE group, COUNT(B);
Z = LOAD 'wordlist.txt' as (word:chararray);
Y = JOIN D BY group, Z BY word;
X = FOREACH Y GENERATE ($1,$2);
DUMP X;
As far as I know, this is not possible using a join.
You could do a CROSS followed by a FILTER with a regex match.
I want to implement this pseudo code in SQL.
This is my code:
k = 1
C1 = generate counts from R1
repeat
k = k + 1
INSERT INTO R'k
SELECT p.Id, p.Item1, …, p.Itemk-1, q.Item
FROM Rk-1 AS p, TransactionTable as q
WHERE q.Id = p.Id AND
q.Item > p.Itemk-1
INSERT INTO Ck
SELECT p.Item1, …, p.Itemk, COUNT(*)
FROM R'k AS p
GROUP BY p.Item1, …, p.Itemk
HAVING COUNT(*) >= 2
INSERT INTO Rk
SELECT p.Id, p.Item1, …, p.Itemk
FROM R!k AS p, Ck AS q
WHERE p.item1 = q.item1 AND
.
.
p.itemk = q.itemk
until Rk = {}`
How can I code this so that it changes columns using k as a variable?
For APRIORI to be reasonably fast, you need efficient data structures. I'm not convinced storing the data in SQL again will do the trick. But of course it depends a lot on your actual data set. Depending on your data set, APRIORI, FPGrowth or Eclat may each be the better choice sometimes.
Either way, using a table layout like Item1, Item2, Item3, ... pretty much is no-go in SQL table design. You may end up on The Daily WTF...
Consider keeping your itemsets in main memory, and only scanning the database using an efficient iterator.
I want to Sum of the calculated column Red which is calculated in the Function IsRed() that return an integer.
When I run the query I get the following error: Method 'Int32 IsRed(Int32)' has no supported translation to SQL.
How should I rewrite this to get it to work.Thanks.
From xx In
(From l In L3s
Join a In BLs On l.L3ID Equals a.L3ID
Order By l.ID
Select PID = l.ID,
Red = IsRed(a.D1.Day- l.D2.Day))
Group By Key = xx.PID Into G = Sum(xx.Red)
Select Key, G
Function IsRed(ByVal dayx As Integer) As Integer
If (dayx < -7) Then
Return 1
Else
Return 0
End If
End Function
That's because LINQ to Entities can't figure out how to translate the IsRed function into SQL. You can create a custom SQL Function and tie that to your IsRed function, but honestly the easiest thing will be to just inline your code:
From xx In
(From l In L3s
Join a In BLs On l.L3ID Equals a.L3ID
Order By l.ID
Select PID = l.ID,
Red = If(a.D1.Day - l.D2.Day < -7, 1, 0))
Group By Key = xx.PID Into G = Sum(xx.Red)
Select Key, G
What that message is saying to you is that LINQ-to-Entities does not know how to translate a call to your custom function into SQL (well, technically, into entity-SQL, but that's not a relevant distinction here).
You won't be able to do this. The expression framework does not provide semantic conversion of any arbitrary function into SQL (which isn't possible, anyway, since SQL is not a turing-complete procedural language). If you want to do this, you'll have to embed that logic directly into your query.