SQL Query - Pull User Achievement - sql

First, I am sorry as I could not come up with better title for this question.
I have a badge/achievement system in my website, community users are rewarded specific badges according to their activity in the website, below sql example I use to pull the number of users who made at at least 100 forum posts (I am using informix db version 10)
SELECT tjm.userid::INTEGER AS user_id,
EXTEND(DBINFO("UTC_TO_DATETIME",tjm.creationdate/1000), year to fraction)
AS earned_date
FROM TABLE(
MULTISET(
SELECT jm.userid, jm.creationdate, (
SELECT COUNT(*) from TABLE(
MULTISET(
SELECT userid, creationdate
FROM jive:jivemessage
)
) AS i
WHERE i.userid = jm.userid AND i.creationdate < jm.creationdate
) + 1 AS row_num
FROM jive:jivemessage jm
)
) AS tjm
WHERE tjm.row_num=100
This sql takes around more than 30 minutes to execute, we have a very large community and there are millions of forum posts.
I would like to know if there is a solution to improve the query performance? I am trying to reduce the execution time because I have 40 sql queries similar to this one but for different tables and activities.

I don't now Informix DB, but the query below should do what you ask and it's ANSI SQL (except for the EXTEND part, which I copied from your original query).
SELECT
jm.userid
,EXTEND(DBINFO("UTC_TO_DATETIME",tjm.creationdate/1000), year to fraction) AS earned_date
FROM
(
-- This sub-query will return all Users who have 100 messages or more
SELECT
jm.userid
,count(jm.userid) as totalmessages
FROM
jive:jivemessage jm
GROUP BY
jm.userid
HAVING
count(jm.userid) >= 100) AS MessageCount
The above could probably be done without having to use a sub-query. The only reason why I used it is to have the DateEarned, as per original query, in the result set. Adding it to the sub-query would have required adding it to the GROUP BY, with unpredictable results if the query runs across two days (e.g. at 23:59:59).
Update 2012/08/14 - Rewritten query following new requirements
As I stated before, I don't know Informix at all, therefore the following query may or may not run.
SELECT
UsersWithBadge.userid
,MAX(UsersWithBadge.creationdate) as dateearned
FROM
(
SELECT FIRST 100
jm.userid
,jm.creationdate
FROM
jive:jivemessage jm
JOIN
(-- This sub-query will return all Users who have 100 messages or more
SELECT
jm.userid
,count(jm.userid) as totalmessages
FROM
jive:jivemessage jm
GROUP BY
jm.userid
HAVING
count(jm.userid) >= 100)
AS MessageCount ON
(MessageCount.userid = jm.userid)
) AS UsersWithBadge
GROUP BY
UsersWithBadge.userid

Related

Executing a Aggregate function within a case without Group by

I am trying to assign a specific code to a client based on the number of gifts that they have given in the past 6 months using a CASE. I am unable to use WITH (screenshot) due to the limitations of the software that I am creating the query in. It only allows for select functions. I am unsure how to get a distinct count from another table (transaction data) and use that as parameters in the CASE I have currently built (based on my client information table). Does anyone know of any workarounds for this? I am unable to GROUP BY clientID at the end of my query because not all of my columns are aggregate, and I only need to GROUP BY clientID for this particular WHEN statement in the CASE. I have looked into the OVER() clause, but I am needing my date range that I am evaluating to be dynamic (counting transactions over the last six months), and the amount of rows that I would be including is variable, as the transaction count month to month varies. Also, the software that I am building this in does not recognize the PARTITIONED BY parameter of the over clause.
Any help would be great!
EDIT:
it is not letting me attach an image... -____- I have added the two sections of code that I am looking for assistance with!
WITH "6MonthGIftCount" (
"ConstituentID"
,"GiftCount"
)
AS (
SELECT COUNT(DISTINCT "GiftView"."GiftID" FROM "GiftView" WHERE MONTHS_BETWEEN("GiftView"."GiftDate", getdate()) <= 6 GROUP BY "GiftView"."ConstituentID")
SELECT...CASE
WHEN "6MonthGiftCount"."GiftCount" >= 4
THEN 'A010'
)
Perform your grouping/COUNT(1) in a subquery to obtain the total # of donations by ConstituentID, then JOIN this total into your main query that uses this new column to perform its CASE statement.
select
hist.*,
case when timesDonated > 5 then 'gracious donor'
when timesDonated > 3 then 'repeated donor'
when timesDonated >= 1 then 'donor'
else null end as donorCode
from gifthistory hist
left join ( /* your grouping subquery here, pretending to be a new table */
select
personID,
count(1) as timesDonated
from gifthistory i
WHERE abs(months_between(giftDate, sysdate)) <= 6
group by personid ) grp on hist.personid = grp.personID
order by 1;
*Naturally, syntax changes will vary by DB; you didn't specify which it was based on, but you should be able to use this template with whichever you utilize. This works in both Oracle and SQL Server after tweaking the month calculation appropriately.

get the latest records

I am currently still on my SQL educational journey and need some help!
The query I have is as below;
SELECT
Audit_Non_Conformance_Records.kf_ID_Client_Reference_Number,
Audit_Non_Conformance_Records.TimeStamp_Creation,
Audit_Non_Conformance_Records.Clause,
Audit_Non_Conformance_Records.NC_type,
Audit_Non_Conformance_Records.NC_Rect_Received,
Audit_Non_Conformance_Records.Audit_Num
FROM Audit_Non_Conformance_Records
I am trying to tweak this to show only the most recent results based on Audit_Non_Conformance_Records.TimeStamp_Creation
I have tried using MAX() but all this does is shows the latest date for all records.
basically the results of the above give me this;
But I only need the result with the date 02/10/2019 as this is the latest result. There may be multiple results however. So for example if 02/10/2019 had never happened I would need all of the idividual recirds from the 14/10/2019 ones.
Does that make any sense at all?
You can filter with a subquery:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where TimeStamp_Creation = (
select max(TimeStamp_Creation)
from Audit_Non_Conformance_Records
)
This will give you all whose TimeStamp_Creation is equal to the greater value available in the table.
If you want all records that have the greatest day (exluding time), then you can do:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where cast(TimeStamp_Creation as date) = (
select cast(max(TimeStamp_Creation) as date)
from Audit_Non_Conformance_Records
)
Edit
If you want the latest record per refNumber, then you can correlate the subquery, like so:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where TimeStamp_Creation = (
select max(TimeStamp_Creation)
from Audit_Non_Conformance_Records a1
where a1.refNumber = a.refNumber
)
For performance, you want an index on (refNumber, TimeStamp_Creation).
If you want the latest date in SQL Server, you can express this as:
SELECT TOP (1) WITH TIES ancr.kf_ID_Client_Reference_Number,
ancr.TimeStamp_Creation,
ancr.Clause,
ancr.NC_type,
ancr.NC_Rect_Received,
ancr.Audit_Num
FROM Audit_Non_Conformance_Records ancr
ORDER BY CONVERT(date, ancr.TimeStamp_Creation) DESC;
SQL Server is pretty good about handling dates with conversions, so I would not be surprised if this used an index on TimeStamp_Creation.

SQL WITH AS statements in Ecto Subquery

I have an SQL query that using the PostgreSQL WITH AS to act as an XOR or "Not" Left Join. The goal is to return what is in unique between the two queries.
In this instance, I want to know what users have transactions within a certain time period AND do not have transactions in another time period. The SQL Query does this by using WITH to select all the transactions for a certain date range in new_transactions, then select all transactions for another date range in older_transactions. From those, we will select from new_transactions what is NOT in older_transactions.
My Query in SQL is :
/* New Customers */
WITH new_transactions AS (
select * from transactions
where merchant_id = 1 and inserted_at > date '2017-11-01'
), older_transactions AS (
select * from transactions
where merchant_id = 1 and inserted_at < date '2017-11-01'
)
SELECT * from new_transactions
WHERE user_id NOT IN (select user_id from older_transactions);
I'm trying to replicate this in Ecto via Subquery. I know I can't do a subquery in the where: statement, which leaves me with a left_join. How do I replicate that in Elixir/Ecto?
What I've replicated in Elixir/Ecto throws an (Protocol.UndefinedError) protocol Ecto.Queryable not implemented for [%Transaction....
Elixir/Ecto Code:
def new_merchant_transactions_query(merchant_id, date) do
from t in MyRewards.Transaction,
where: t.merchant_id == ^merchant_id and fragment("?::date", t.inserted_at) >= ^date
end
def older_merchant_transactions_query(merchant_id, date) do
from t in MyRewards.Transaction,
where: t.merchant_id == ^merchant_id and fragment("?::date", t.inserted_at) <= ^date
end
def new_customers(merchant_id, date) do
from t in subquery(new_merchant_transactions_query(merchant_id, date)),
left_join: ot in subquery(older_merchant_transactions_query(merchant_id, date)),
on: t.user_id == ot.user_id,
where: t.user_id != ot.user_id,
select: t.id
end
Update:
I tried changing it to where: is_nil(ot.user_id), but get the same error.
This maybe should be a comment instead of an answer, but it's too long and needs too much formatting so I went ahead and posted this as an answer. With that out of the way, here we go.
What I would do is re-write the query to avoid the Common Table Expression (or CTE; this is what a WITH AS is really called) and the IN() expression, and instead I'd do an actual JOIN, like this:
SELECT n.*
FROM transactions n
LEFT JOIN transactions o ON o.user_id = n.user_id and o.merchant_id = 1 and o.inserted_at < date '2017-11-01'
WHERE n.merchant_id = 1 and n.inserted_at > date '2017-11-01'
AND o.inserted_at IS NULL
You might also choose to do a NOT EXISTS(), which on Sql Server at least will often produce a better execution plan.
This is probably a better way to handle the query anyway, but once you do that you may also find this solves your problem by making it much easier to translate to ecto.

BigQuery - Query time becomes extremely long

Recently all my query takes too long time but basically all of them consume no data.
For example, for a really simple query
Start Time: Jan 14, 2016, 12:35:13 PM
End Time: Jan 14, 2016, 12:35:15 PM
Bytes Processed: 0 B
Bytes Billed: 0 B
Billing Tier: 1
Destination Table: ****************.******************
Write Preference: Append to table
Allow Large Results: true
Flatten Results: true
this is the information I got from the BQ console, which tells me that this query doesn't consume any data(it's true) and only takes two seconds.
But it actually takes 27 seconds when I run this query again in the console by click Run Query in the query history. And after that, the Query History in the console shows this query takes 2 seconds again.
Basically all the query in this dataset have this issue.
I have over 40000 tables in this dataset.
So my guess is that before the BQ actually run the query, it first locates the table that is gonna be used. Then it starts to execute the query, which here is the start time in the query history.
If that is the case, how should I solve it and why does it take so long?
Here is the query I mentioned(have made some changes):
select "some_id", '2015-12-01', if (count(user_id) == 0, NULL, sum(users_in_today_again) / count(user_id)) as retention
from
(
select
users_in_last_day.user_id as user_id,
if(users_in_today.user_id is null, 0, 1) as users_in_today_again
FROM
(
select user_id
from
table_date_range(ds.sessions_some_id_, date_add(timestamp('2015-12-01'), -1, "DAY"), date_add(timestamp('2015-12-01'), -1, "DAY"))
group by user_id
) as users_in_last_day
left join
(
select user_id
from table_date_range(ds.sessions_some_id_, timestamp('2015-12-01'), timestamp('2015-12-01'))
group by user_id
) as users_in_today
on users_in_last_day.user_id = users_in_today.user_id
)
Thanks in advance!
PART 1
You can check your theory about delay before start time by using Jobs:Get API with the jobid taken from Query History in BQ Console.
As you can see in Job Resources - statistics parameter in addition to startTime and endTime has also has also creationTime
PART 2
Shooting in the dark here, but try below
SELECT "some_id", '2015-12-01', IF (COUNT(user_id) == 0, NULL, SUM(users_in_today_again) / COUNT(user_id)) AS retention
FROM
(
SELECT
users_in_last_day.user_id AS user_id,
IF(users_in_today.user_id IS NULL, 0, 1) AS users_in_today_again
FROM
(
SELECT user_id FROM (
SELECT user_id, ROW_NUMBER() OVER(PARTITION BY user_id) AS pos
FROM TABLE_DATE_RANGE(ds.sessions_some_id_, DATE_ADD(TIMESTAMP('2015-12-01'), -1, "DAY"), DATE_ADD(TIMESTAMP('2015-12-01'), -1, "DAY"))
) WHERE pos = 1
) AS users_in_last_day
LEFT JOIN
(
SELECT user_id FROM (
SELECT user_id, ROW_NUMBER() OVER(PARTITION BY user_id) AS pos
FROM TABLE_DATE_RANGE(ds.sessions_some_id_, TIMESTAMP('2015-12-01'), TIMESTAMP('2015-12-01'))
) WHERE pos = 1
) AS users_in_today
ON users_in_last_day.user_id = users_in_today.user_id
)
I know, it might look silly, but explanation stats (based on some dummy data) for this version
is totally different from same for version in question
My wild guess is that heavy read/compute Stage1/2 in original version can be responsible for the delay in question
Just guess
As hinted at on the comment thread on Mikhail's question, most of the time is probably spent evaluating the TABLE_DATE_RANGE functions in the query. This time is currently accounted for between creationTime and startTime in query statistics.
In general, tens or hundreds of thousands of tables in a dataset will cause slow performance when using TABLE_DATE_RANGE, TABLE_QUERY, or the <dataset>.__TABLES__ metatable. We are working to update our public documentation to mention this.
My suggestion is that if you want to use table wildcards on a dataset, make sure it doesn't have too many tables in it. If that solution is unworkable for you, let us know if BigQuery could support something that would make your use case easier on our issue tracker.

counting date and time for historical reporting

I am currently working on a query that will be used in junction with share-point to run reports. I have a query that I know will work with Oracle, but the company I am working for is running SQL Server 2005.
What the report will do is give the person the ability to select any date and time, and give the count for that specific operation. The problem is that there are large gaps in the time stamps (because it takes a little while for the product to get to the next operation). The date type is varchar, so i used substrings to parse out the year, month, day, and time. I have sample data available.
The people looking at the reports want the ability to say at this time and day how many units went through this operation.
I know this is is confusing, let me know if you need any clarification.
Here is the oracle syntax
SELECT T3.PAYMENT_DATE, T3."Hr", T3."Min",
(SELECT COUNT(*)
FROM INVOICE_ARCHIVE T4
WHERE TO_NUMBER(TO_CHAR(T4.PAYMENT_DATE, 'MM')) <= T3."Hr"
AND TO_NUMBER(TO_CHAR(T4.PAYMENT_DATE, 'DD')) <= T3."Min") AS "NUM"
FROM(SELECT T1.PAYMENT_DATE, T2."Hr", T2."Min"
FROM (SELECT (FLOOR((LEVEL + 359)/60)) AS "Hr",
MOD((LEVEL + 359), 60) AS "Min"
FROM dual CONNECT BY LEVEL <= 961) T2, INVOICE_ARCHIVE T1
ORDER BY T1.PAYMENT_DATE, T2."Hr", T2."Min") T3
The answer to your question is the datepart() function in SQL Server. This will allow you to extract minutes and hours from dates.
The harder part is the "connect by level" portion. How is this being used? You might need to use recursive CTEs to handle this.
With the little hint from spencer, the following may suffice for your query:
SELECT T3.PAYMENT_DATE, T3."Hr", T3."Min",
(SELECT COUNT(*)
FROM INVOICE_ARCHIVE T4
WHERE datepart(month, T4.PAYMENT_DATE) <= T3."Hr" AND
datepart(day, T4.PAYMENT_DATE, 'DD') <= T3."Min"
) AS "NUM"
FROM (SELECT T1.PAYMENT_DATE, T2."Hr", T2."Min"
FROM (SELECT top 961 (FLOOR((LEVEL + 359)/60)) AS "Hr",
MOD((LEVEL + 359), 60) AS "Min"
FROM (select top 961 row_number() over (order by (select NULL)) as level
from invoice_archive
) t
) T2 cross join
INVOICE_ARCHIVE T1
) T3
ORDER BY T3.PAYMENT_DATE, T3."Hr", T3."Min"
I made the following changes:
Changed the date arithmetic to use datepart() instead of to_char() .
Replaced the method for getting a list of numbers, by using row_number() instead of connect by level
Made the cross join explicit
Moved the order by to the outer query, since neither SQL Server nor Oracle guarantee the results of an order by in a subquery (and SQL Server does not allow it unless you have a "TOP" query)