Implementation apriori in SQL - sql

I want to implement this pseudo code in SQL.
This is my code:
k = 1
C1 = generate counts from R1
repeat
k = k + 1
INSERT INTO R'k
SELECT p.Id, p.Item1, …, p.Itemk-1, q.Item
FROM Rk-1 AS p, TransactionTable as q
WHERE q.Id = p.Id AND
q.Item > p.Itemk-1
INSERT INTO Ck
SELECT p.Item1, …, p.Itemk, COUNT(*)
FROM R'k AS p
GROUP BY p.Item1, …, p.Itemk
HAVING COUNT(*) >= 2
INSERT INTO Rk
SELECT p.Id, p.Item1, …, p.Itemk
FROM R!k AS p, Ck AS q
WHERE p.item1 = q.item1 AND
.
.
p.itemk = q.itemk
until Rk = {}`
How can I code this so that it changes columns using k as a variable?

For APRIORI to be reasonably fast, you need efficient data structures. I'm not convinced storing the data in SQL again will do the trick. But of course it depends a lot on your actual data set. Depending on your data set, APRIORI, FPGrowth or Eclat may each be the better choice sometimes.
Either way, using a table layout like Item1, Item2, Item3, ... pretty much is no-go in SQL table design. You may end up on The Daily WTF...
Consider keeping your itemsets in main memory, and only scanning the database using an efficient iterator.

Related

Chaining endless sql and performance

I am chaining sql according to user filter which is unknown.
For instance he would like to first ask for certain dates :
def filterDates(**kwargs):
q = ('''
SELECT date_num, {subject_col}, {in_col} as {out_col}
FROM {base}
WHERE date_num BETWEEN {date1} AND {date2}
ORDER BY date_num
''').format(subject_col=subject_col,**kwargs)
return q
(base is input query string from previous, see next)
and then he wants to calculate another thing(or many) so we pass the dates filter string query q as base to this query:
WITH BS AS (
SELECT date_num, {subject_col}, {in_col}
FROM {base}
)
SELECT t1.{subject_col},t1.{in_col}, t2.{in_col} - t1.{in_col} as {out_col}
FROM BS t1
JOIN BS t2
ON t1.{subject_col} = t2.{subject_col} AND t2.date_num = {date2}
WHERE t1.date_num = {date1}
''').format(subject_col=subject_col,**kwargs)
Here the {base} is going to be :
base='('+q+')'+'AS base'
Now we can chain queries as much as we want and it works.
How would the engine handle this ? is that means that the efficiency is bad because engine has to make 2 rounds ( instead of having a normal WHERE on the dates? ) how would he optimize this?
Is there a common good practice way to chain unknown number of queries?

SQL query slow with OR but very fast with UNION in Oracle database

I have a problem where i want to add an OR to a query, but this makes it run very slowly.
the query looks something like this:
SELECT TRIM(l.ID) ID,
TRIM(C.UID) UID,
TRIM(c.UPLOADED_ID) UPLOADED_ID,
TRIM(l.record_key) record_key,
TRIM(c.customers_record_key) customers_record_key,
c.prohibitautoupdate,
TRIM(c.c_additional_key) c_additional_key,
TRIM(c.record_status) c_record_status
FROM tmp_customers_upload l, customers c
WHERE (l.record_key = c.customers_record_key AND NVL(TRIM(l.ID),' ') <> c.UPLOADED_ID AND c.UPLOADED_ID IS NOT NULL )
OR (SUBSTR(C.UID, 6, 11) = TRIM(L.ID) AND (L.record_key<> C.UPLOADED_ID OR C.UPLOADED_ID IS NULL) AND (C.UPLOADED_ID <> L.ID OR C.UPLOADED_ID IS NULL))
and c.record_status <> 'DL'
and c.prohibitautoupdate = 0;
This query is a simplified version of what i want to run, and it takes forever to run (more than 3 minutes which is unacceptable)
now when i run this:
SELECT TRIM(l.ID) ID,
TRIM(C.UID) UID,
TRIM(c.UPLOADED_ID) UPLOADED_ID,
TRIM(l.record_key) record_key,
TRIM(c.customers_record_key) customers_record_key,
c.prohibitautoupdate,
TRIM(c.c_additional_key) c_additional_key,
TRIM(c.record_status) c_record_status
FROM tmp_customers_upload l, customers c
WHERE (l.record_key = c.customers_record_key AND NVL(TRIM(l.ID),' ') <> c.UPLOADED_ID AND c.UPLOADED_ID IS NOT NULL )
and c.record_status <> 'DL'
and c.prohibitautoupdate = 0;
UNION SELECT TRIM(l.ID) ID,
TRIM(C.UID) UID,
TRIM(c.UPLOADED_ID) UPLOADED_ID,
TRIM(l.record_key) record_key,
TRIM(c.customers_record_key) customers_record_key,
c.prohibitautoupdate,
TRIM(c.c_additional_key) c_additional_key,
TRIM(c.record_status) c_record_status
FROM tmp_customers_upload l, customers c
WHERE (SUBSTR(C.UID, 6, 11) = TRIM(L.ID) AND (L.record_key <> C.customers_record_key OR C.customers_record_key IS NULL) AND (C.UPLOADED_ID <> L.ID OR C.UPLOADED_ID IS NULL))
and c.record_status <> 'DL'
and c.prohibitautoupdate = 0;
It takes less than a second to run.
As far as I understand, the first version does an implicit JOIN and the WHERE clause is acting like ON, and the OR somehow confuses the DB which makes it do a full table scan.
my question is how to optimize the query to make it run fast? i prefer not to use the union because as i said, this is only a simplified version of the query which actually has more than 30 columns, so using the union will greatly reduce readability and maintainability of the SP which contains these queries(which are actually CURSORS in the SP)
any help will be appreciated, thank you
The problem is because you are not using modern style joins you can't see that your join conditions are totally different.
Since your join conditions are totally different when you try and make it into one query you actually cause the join conditions to morph into a CROSS JOIN (look it up) so you end up with the multiple of the two tables and no ability to use indexes.
If I had to unwind this I would first switch to modern style joins so it clear exactly how you are joining -- then it will be clear if you have a logic problem -- I think you do. If the logic problem is removed then it will probably be trivial to combine the queries.
Or not -- if the logic does not allow it to be combined (because they join different) and you have to use a union.
In that case it is probably best to create a view or a CTE that is shared by both the queries to ease the maintenance issues.

How to use a variable AS a where clause?

I have one where clause which I have to use multiple times. I am quite new to Oracle SQL, so please forgive me for my newbe mistakes :). I have read this website, but could not find the answer :(. Here's the SQL statement:
var condition varchar2(100)
exec :condition := 'column 1 = 1 AND column2 = 2, etc.'
Select a.content, b.content
from
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3)) as content
from table_name
where category = X AND :condition
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3))
) A
,
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100)) as content
from table_name
where category = Y AND :condition
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100))) B
GROUP BY
a.content, b.content
The content field is a CLOB field and unfortunately all values needed are in the same column. My query does not work ofcourse.
You can't use a bind variable for that much of a where clause, only for specific values. You could use a substitution variable if you're running this in SQL*Plus or SQL Developer (and maybe some other clients):
define condition = 'column 1 = 1 AND column2 = 2, etc.'
Select a.content, b.content
from
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3)) as content
from table_name
where category = X AND &condition
...
From other places, including JDBC and OCI, you'd need to have the condition as a variable and build the query string using that, so it's repeated in the code that the parser sees. From PL/SQL you could use dynamic SQL to achieve the same thing. I'm not sure why just repeating the conditions is a problem though, binding arguments if values are going to change. Certainly with two clauses like this it seems a bit pointless.
But maybe you could approach this from a different angle and remove the need to repeat the where clause. Querying the table twice might not be efficient anyway. You could apply your condition once as a subquery, but without knowing your indexes or the selectivity of the conditions this could be worse:
with sub_table as (
select category, content
from my_table
where category in (X, Y)
and column 1 = 1 AND column2 = 2, etc.
)
Select a.content, b.content
from
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3)) as content
from sub_table
where category = X
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3))
) A
,
(Select (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100)) as content
from sub_table
where category = Y
group by (DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100))) B
GROUP BY
a.content, b.content
I'm not sure what the grouping is for - to eliminate duplicates? This only really makes sense if you have a single X and Y record matching the other conditions, doesn't it? Maybe I'm not following it properly.
You could also use a case statement:
select max(content_x), max(content_y)
from (
select
case when category = X
then DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,3) end as content_x,
case when category = Y
then DBMS_LOB.SUBSTR(ost_bama_vrij_veld.inhoud,100) end as content_y,
from my_table
where category in (X, Y)
and column 1 = 1 AND column2 = 2, etc.
)

Select first or random row in group by

I have this query using PostgreSQL 9.1 (9.2 as soon as our hosting platform upgrades):
SELECT
media_files.album,
media_files.artist,
ARRAY_AGG (media_files. ID) AS media_file_ids
FROM
media_files
INNER JOIN playlist_media_files ON media_files.id = playlist_media_files.media_file_id
WHERE
playlist_media_files.playlist_id = 1
GROUP BY
media_files.album,
media_files.artist
ORDER BY
media_files.album ASC
and it's working fine, the goal was to extract album/artist combinations and in the result set have an array of media files ids for that particular combo.
The problem is that I have another column in media files, which is artwork.
artwork is unique for each media file (even in the same album) but in the result set I need to return just the first of the set.
So, for an album that has 10 media files, I also have 10 corresponding artworks, but I would like just to return the first (or a random picked one for that collection).
Is that possible to do with only SQL/Window Functions (first_value over..)?
Yes, it's possible. First, let's tweak your query by adding alias and explicit column qualifiers so it's clear what comes from where - assuming I've guessed correctly, since I can't be sure without table definitions:
SELECT
mf.album,
mf.artist,
ARRAY_AGG (mf.id) AS media_file_ids
FROM
"media_files" mf
INNER JOIN "playlist_media_files" pmf ON mf.id = pmf.media_file_id
WHERE
pmf.playlist_id = 1
GROUP BY
mf.album,
mf.artist
ORDER BY
mf.album ASC
Now you can either use a subquery in the SELECT list or maybe use DISTINCT ON, though it looks like any solution based on DISTINCT ON will be so convoluted as not to be worth it.
What you really want is something like an pick_arbitrary_value_agg aggregate that just picks the first value it sees and throws the rest away. There is no such aggregate and it isn't really worth implementing it for the job. You could use min(artwork) or max(artwork) and you may find that this actually performs better than the later solutions.
To use a subquery, leave the ORDER BY as it is and add the following as an extra column in your SELECT list:
(SELECT mf2.artwork
FROM media_files mf2
WHERE mf2.artist = mf.artist
AND mf2.album = mf.album
LIMIT 1) AS picked_artwork
You can at a performance cost randomize the selected artwork by adding ORDER BY random() before the LIMIT 1 above.
Alternately, here's a quick and dirty way to implement selection of a random row in-line:
(array_agg(artwork))[width_bucket(random(),0,1,count(artwork)::integer)]
Since there's no sample data I can't test these modifications. Let me know if there's an issue.
"First" pick
Wouldn't it be simpler / cheaper to just use min():
SELECT m.album
,m.artist
,array_agg(m.id) AS media_file_ids
,min(m.artwork) AS artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
WHERE p.playlist_id = 1
GROUP BY m.album, m.artist
ORDER BY m.album, m.artist;
Abitrary / random pick
If you are looking for a random selection, #Craig already provided a solution with truly random picks.
You could also use a CTE to avoid additional scans on the (possibly big) base table and then run two separate (cheap) subqueries on the small result set.
For arbitrary selection - not truly random, the result will depend on the physical order of rows in the table and implementation-specifics:
WITH x AS (
SELECT m.album, m.artist, m.id, m.artwork
FROM playlist_media_files p
JOIN media_files m ON m.id = p.media_file_id
)
SELECT a.album, a.artist, a.media_file_ids, b.artwork
FROM (
SELECT album, artist, array_agg(id) AS media_file_ids
FROM x
) a
JOIN (
SELECT DISTINCT ON (1,2) album, artist, artwork
FROM x
) b USING (album, artist);
For truly random results, you can add an ORDER BY .. random() like this to subquery b:
JOIN (
SELECT DISTINCT ON (1, 2) album, artist, artwork
FROM x
ORDER BY 1, 2, random()
) b USING (album, artist);

How to use min() in where/having clause (to avoid subquery) in Hive/SQL

I have a large table of events. Per user I want to count the occurence of type A events before the earliest type B event.
I am searching for an elegant query. Hive is used so I can't do subqueries
Timestamp Type User
... A X
... A X
... B X
... A X
... A X
... A Y
... A Y
... A Y
... B Y
... A Y
Wanted Result:
User Count_Type_A
X 2
Y 3
I could not get the "cut-off" timestamp by doing:
Select User, min(Timestamp)
Where Type=B
Group BY User;
But then how can I use that information inside the next query where I want to do something like:
SELECT User, count(Timestamp)
WHERE Type=A AND Timestamp<min(User.Timestamp_Type_B)
GROUP BY User;
My only idea so far are to determine the cut-off timestamps first and then do a join with all type A events and then select from the resulting table, but that feels wrong and would look ugly.
I'm also considering the possibility that this is the wrong type of problem/analysis for Hive and that I should consider hand-written map-reduce or pig instead.
Please help me by pointing in the right direction.
First Update:
In response to Cilvic's first comment to this answer, I've adjusted my query to the following based on workarounds suggested in the comments found at https://issues.apache.org/jira/browse/HIVE-556:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
CROSS JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
WHERE main.[Type] = 'A'
AND (sub.[User] = main.[User])
AND (main.[Timestamp] < sub.[First_B_TS])
GROUP BY main.[User]
Original:
Give this a shot:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]
I did my best to follow hive syntax. Let me know if you have any questions. I would like to know why you wish/need to avoid a subquery.
In general, I +1 coge.soft's solution. Here it is again for your reference:
SELECT [User], COUNT([Timestamp]) AS [Before_First_B_Count]
FROM [Dataset] main
JOIN (SELECT [User], min([Timestamp]) [First_B_TS] FROM [Dataset]
WHERE [Type] = 'B'
GROUP BY [User]) sub
ON (sub.[User] = main.[User]) AND (main.[Timestamp] < sub.[First_B_TS])
WHERE main.[Type] = 'A'
GROUP BY main.[User]
However, a couple things to note:
What happens when there are no B events? Assuming you would want to count all the A events per user in that case an inner join as specified in the solution wouldn't work since there would be no entry for that user in the sub table. You would need to change to a left outer join for that.
The solution also does 2 passes over the data - one to populate the sub table, other to join the sub table with the main table. Depending on your notion of performance and efficiency, there is an alternative where you could do this by a single pass of data. You can distribute the data by user using Hive's distribute by functionality and write a custom reducer that would do your count calculation in your favorite language using Hive's transform functionality.