Sql Query - Limiting query results - sql

I am quite certain we cannot use the LIMIT clause for what I want to do - so wanted to find if there are any other ways we can accomplish this.
I have a table which captures which user visited which store. Every time a user visits a store, a row is inserted into this table.
Some of the fields are
shopping_id (primary key)
store_id
user_id
Now what I want is - for a given set of stores, find the top 5 users who have visited the store max number of times.
I can do this 1 store at a time as:
select store_id,user_id,count(1) as visits
from shopping
where store_id = 60
group by user_id,store_id
order by visits desc Limit 5
This will give me the 5 users who have visited store_id=60 the max times
What I want to do is provide a list of 10 store_ids and for each store fetch the 5 users who have visited that store max times
select store_id,user_id,count(1) as visits
from shopping
where store_id in (60,61,62,63,64,65,66)
group by user_id,store_id
order by visits desc Limit 5
This will not work as the Limit at the end will return only 5 rows rather than 5 rows for each store.
Any ideas on how I can achieve this. I can always write a loop and pass 1 store at a time but wanted to know if there is a better way

Using two user variable and counting the same consecutive store_id, you can replace <= 5 with whatever limit you want
SELECT a.*
FROM (
SELECT store_id, user_id, count(1) as visits
FROM shopping
WHERE store_id IN (60,61,62,63,64,65,66)
GROUP BY store_id, user_id
ORDER BY store_id, visits desc, user_id
) a,
(SELECT #prev:=-1, #count:=1) b
WHERE
CASE WHEN #prev<>a.store_id THEN
CASE WHEN #prev:=a.store_id THEN
#count:=1
END
ELSE
#count:=#count+1
END <= 5
Edit as requested some explanation :
The first subquery (a) is the one that group and order the data so you will have data like:
store_id | user_id | visits
---------+---------+-------
60 1 5
60 2 3
60 3 1
61 2 4
61 3 2
the second subquery (b) init the user variable #prev with -1 and #count with 1
then we choose all data from the subquery (a) verifying the condition in the case.
verify that the previous store_id (#prev) we have seen is different from the current store_id.
Since the first #prev is equal to -1 there is nothing that match the current store_id so the condition <> is true we enter then is the second case who just serve to change the value #prev with the current store_id. This is the trick so i can change the two user variable #count and #prev in the same condition.
if the previous store_id is equal to #prev just increment the #count variable.
we check that the count is within the value we want so the <= 5
So with our test data the:
step | #prev | #count | store_id | user_id | visits
-----+-------+--------+----------+---------+-------
0 -1 1
1 60 1 60 1 5
2 60 2 60 2 3
3 60 3 60 3 1
4 61 1 61 2 4
5 61 2 61 3 2

Major concern over here is number of times you query a database.
If you query multiple times from your script. Its simply wastage of resources and must be avoided.
That is you must NOT run a loop to run the SQL multiple times by incrementing certain value. In your case 60 to 61 and so on.
Solution 1:
Create a view
Here is the solution
CREATE VIEW myView AS
select store_id,user_id,count(1) as visits
from shopping
where store_id = 60
group by user_id,store_id
order by visits desc Limit 5
UNION
select store_id,user_id,count(1) as visits
from shopping
where store_id = 61
group by user_id,store_id
order by visits desc Limit 5
UNION
select store_id,user_id,count(1) as visits
from shopping
where store_id = 62
group by user_id,store_id
order by visits desc Limit 5
Now use
SELECT * from MyView
This is limited because you cant make it dynamic.
What if you need 60 to 100 instead of 60 to 66.
Solution 2:
Use Procedure.
I wont go into how to write a procedure cuz its late night and I got to sleep. :)
Well, procedure must accept two values 1st inital number (60) and 2nd Count (6)
Inside the procedure create a temporary table (cursor) to store data then run a loop from initial number till count times
In your case from 60 to 66
Inside the loop write desired script Replacing 60 with a looping variable.
select store_id,user_id,count(1) as visits
from shopping
where store_id = 60
group by user_id,store_id
order by visits desc Limit 5
And append the result in the temporary table (cursor).
Hope this will solve your problem.
Sorry I couldn't give you the code. If you still need it plz send me a message. I will give it to you when I wake up next morning.

UNION may be what you are looking for.
-- fist store
(select store_id,user_id,count(1) as visits
from shopping
where store_id = 60
group by user_id,store_id
order by visits desc Limit 5)
UNION ALL
-- second store
(select store_id,user_id,count(1) as visits
from shopping
where store_id = 61
group by user_id,store_id
order by visits desc Limit 5)
...
http://dev.mysql.com/doc/refman/5.0/en/union.html

If you will not save data about when a user visited a store or something like this, you could simply update the table each time a user visits a store instead of appending a new row.
Something like this:
INSERT INTO `user_store` (`user_id`, `store_id`, `visits`) VALUES ('USER', 'SHOP', 1)
ON DUPLICATE KEY UPDATE `visits` = `visits` + 1
But I think this would not work, because neither user_id nor store_id are unique. You have to add a unique primary key like this: user#store or something else.
Another opinion would be to save this data (how often a user was in a store) in a separate table containing of ID, user_id, store_id, visits and increment visits everytime you also add a new row to you existing table.
To get the Top5 you can then use:
SELECT `visits`, `user_id` FROM `user_store_times` WHERE `store_id`=10 ORDER BY `visits` DESC LIMIT 5

The simplest way would be to issue 10 separate queries, one for each store. If you use parameterized queries (e.g. using PDO in PHP) this will be pretty fast since the query will be part-compiled.
If this still proves to be too resource-intensive, then another solution would be to cache the results in the store table - i.e. add a field that lists the top 5 users for each store as a simple comma-separated list. It does mean your database would not be 100% normalised but that shouldn't be a problem.

Related

Find most recent date of purchase in user day table

I'm trying to put together a query that will fetch the date, purchase amount, and number of transactions of the last time each user made a purchase. I am pulling from a user day table that contains a row for each time a user does anything in the app, purchase or not. Basically all I am trying to get is the most recent date in which the number of transactions field was greater than zero. The below query returns all days of purchase made by a particular user when all I'm looking for is the last purchase so just the 1st row shown in the attached screenshot is what I am trying to get.
screen shot of query and result set
select tuid, max(event_day),
purchases_day_rev as last_dop_rev,
purchases_day_num as last_dop_quantity,
purchases_day_rev/nullif(purchases_day_num,0) as last_dop_spend_pp
from
(select tuid, event_day,purchases_day_rev,purchases_day_num
from
app.user_day
where purchases_day_num > 0
and tuid='122d665e-1d71-4319-bb0d-05c7f37a28b0'
group by 1,2,3,4) a
group by 1,3,4,5
I'm not going to comment on the logic of your query... if all you want is the first row of your result set, you can try:
<your query here> ORDER BY 2 DESC LIMIT 1 ;
Where ORDER BY 2 DESC orders the result set on max(event_day) and LIMIT 1 extracts only the first row.
I don't know all of the ins and outs of your data, but I don't understand why you are grouping within the subquery without any aggregate function (sum, average, min, max, etc). With that said, I would try something like this:
select tuid
,event_day
,purchases_day_rev as last_dop_rev
,purchases_day_num as last_dop_quantity
,purchases_day_rev/nullif(purchases_day_num,0) as last_day_spend_pp
from app.user_day a
inner join
(
select tuid
,max(event_day) as MAX_DAY
from app.user_day
where purchases_day_num > 0
and tuid='122d665e-1d71-4319-bb0d-05c7f37a28b0'
group by 1
) b
on a.tuid = b.tuid
and a.event_day = b.max_day;

SQL Find latest record only if COMPLETE field is 0

I have a table with multiple records submitted by a user. In each record is a field called COMPLETE to indicate if a record is fully completed or not.
I need a way to get the latest records of the user where COMPLETE is 0, LOCATION, DATE are the same and no additional record exist where COMPLETE is 1. In each record there are additional fields such as Type, AMOUNT, Total, etc. These can be different, even though the USER, LOCATION, and DATE are the same.
There is a SUB_DATE field and ID field that denote the day the submission was made and auto incremented ID number. Here is the table:
ID NAME LOCATION DATE COMPLETE SUB_DATE TYPE1 AMOUNT1 TYPE2 AMOUNT2 TOTAL
1 user1 loc1 2017-09-15 1 2017-09-10 Food 12.25 Hotel 65.54 77.79
2 user1 loc1 2017-09-15 0 2017-09-11 Food 12.25 NULL 0 12.25
3 user1 loc2 2017-08-13 0 2017-09-05 Flight 140 Food 5 145.00
4 user1 loc2 2017-08-13 0 2017-09-10 Flight 140 NULL 0 140
5 user1 loc3 2017-07-14 0 2017-07-15 Taxi 25 NULL 0 25
6 user1 loc3 2017-08-25 1 2017-08-26 Food 45 NULL 0 45
The results I would like is to retrieve are ID 4, because the SUB_DATE is later that ID 3. Which it has the same Name, Location, and Date information and there is no COMPLETE with a 1 value.
I would also like to retrieve ID 5, since it is the latest record for the User, Location, Date, and Complete is 0.
I would also appreciate it if you could explain your answer to help me understand what is happening in the solution.
Not sure if I fully understood but try this
SELECT *
FROM (
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your_table t
) a
WHERE CompleteForNameLocationAndDate = 0 AND
SUB_DATE = LastSubDate
So what we have done here:
First, if you run just the inner query in Management Studio, you will see what that does:
The first max function will partition the data in the table by each unique Name,Location,Date set.
In the case of your data, ID 1 & 2 are the first partition, 3&4 are the second partition, 5 is the 3rd partition and 6 is the 4th partition.
So for each of these partitions it will get the max value in the complete column. Therefore any partition with a 1 as it's max value has been completed.
Note also, the convert function. This is because COMPLETE is of datatype BIT (1 or 0) and the max function does not work with that datatype. We therefore convert to INT. If your COMPLETE column is type INT, you can take the convert out.
The second max function partitions by unique Name, Location and Date again but we are getting the max_sub date this time which give us the date of the latest record for the Name,Location,Date
So we take that query and add it to a derived table which for simplicity we call a. We need to do this because SQL Server doesn't allowed windowed functions in the WHERE clause of queries. A windowed function is one that makes use of the OVER keyword as we have done. In an ideal world, SQL would let us do
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your)table t
WHERE MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) = 0 AND
SUB_DATE = MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE)
But it doesn't allow it so we have to use the derived table.
So then we basically SELECT everything from our derived table Where
CompleteForNameLocationAndDate = 0
Which are Name,Location, Date partitions which do not have a record marked as complete.
Then we filter further asking for only the latest record for each partition
SUB_DATE = LastSubDate
Hope that makes sense, not sure what level of detail you need?
As a side, I would look at restructuring your tables (unless of course you have simplified to better explain this problem) as follows:
(Assuming the table in your examples is called Booking)
tblBooking
BookingID
PersonID
LocationID
Date
Complete
SubDate
tblPerson
PersonID
PersonName
tblLocation
LocationID
LocationName
tblType
TypeID
TypeName
tblBookingType
BookingTypeID
BookingID
TypeID
Amount
This way if you ever want to add Type3 or Type4 to your booking information, you don't need to alter your table layout

Query to find all timestamps more than a certain interval apart

I'm using postgres to run some analytics on user activity. I have a table of all requests(pageviews) made by every user and the timestamp of the request, and I'm trying to find the number of distinct sessions for every user. For the sake of simplicity, I'm considering every set of requests an hour or more apart from others as a distinct session. The data looks something like this:
id| request_time| user_id
1 2014-01-12 08:57:16.725533 1233
2 2014-01-12 08:57:20.944193 1234
3 2014-01-12 09:15:59.713456 1233
4 2014-01-12 10:58:59.713456 1234
How can I write a query to get the number of sessions per user?
To start a new session after every gap >= 1 hour:
SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;
Assuming request_time NOT NULL.
Explain:
In subquery sub, check for every row if a new session begins. Using the third parameter of lag() to provide the default -infinity, which is lower than any timestamp and therefore always starts a new session for the first row.
In the outer query count how many times new sessions started. Eliminate step = FALSE and count per user.
Alternative interpretation
If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:
SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;

report builder - queries returning incorrect results

I have set a reporting project where I would like to get stats for my tables and later integrate this into a webservice. For the following queries though, I am getting incorrect results and I will note where below:
1 - Get the number of new entries for a given day
SELECT COUNT(*) AS RecordsCount,CAST(FLOOR(CAST(dateadded AS float))
AS datetime)as collectionDate
FROM TFeed GROUP BY CAST(FLOOR(CAST(dateadded AS float))
AS datetime) order by collectionDate
works fine and I am able to put this in a bar graph successfully.
2 - Get the top 10 searchterms with the highest records per searchterm requested by a given client in the last 10 days
SELECT TOP 10 searchterm, clientId, COUNT(*) AS TermResults FROM TFeed
where dateadded > getdate() - 10 GROUP BY
searchterm,clientId order by TermResults desc
does not work
If I do a query in the Database for one of those terms that returns 98 in the report, the result is 984 in the database.
3 - I need to get the number of new records per client for a given day as well.
Also I was wondering if it is possible to put these queries into one report and not individual reports for each query which is not a big deal but having to cut and paste into one doc afterwards is tedious.
Any ideas appreciated
For #2,
WITH tmp as
(
SELECT clientId, searchTerm, COUNT(1) as TermResults,
DENSE_RANK() OVER (partition by clientId
ORDER BY clientId, COUNT(1) DESC) as rnk
FROM TFeed
WHERE dateadded > GETDATE() - 10
GROUP BY clientId, searchterm
)
SELECT *
FROM tmp
WHERE rnk < 11
USE RANK() if you want to skip a rank if there are two matches (if lets say term1 and term2 have the same number of count, they are both rank 1 and the following term will be ranked 3rd instead of 2nd
For #3,
you can define multiple datasets within one report. Then you would just create three charts / table and associate those with their respective datasets

a question about sql group by

I have a table named visiting that looks like this:
id | visitor_id | visit_time
-------------------------------------
1 | 1 | 2009-01-06 08:45:02
2 | 1 | 2009-01-06 08:58:11
3 | 1 | 2009-01-06 09:08:23
4 | 1 | 2009-01-06 21:55:23
5 | 1 | 2009-01-06 22:03:35
I want to work out a sql that can get how many times a user visits within one session(successive visit's interval less than 1 hour).
So, for the example data, I want to get following result:
visitor_id | count
-------------------
1 | 3
1 | 2
BTW, I use postgresql 8.3.
Thanks!
UPDATE: updated the timestamps in the example data table. sorry for the confusion.
UPDATE: I don't care much if the solution is a single sql query, using store procedure, subquery etc. I only care how to get it done :)
The question is slightly ambiguous because you're making the assumption or requiring that the hours are going to start at a set point, i.e. a natural query would also indicate that there's a result record of (1,2) for all the visits between the hour of 08:58 and 09:58. You would have to "tell" your query that the start times are for some determinable reason visits 1 and 4, or you'd get the natural result set:
visitor_id | count
--------------------
1 | 3
1 | 2 <- extra result starting at visit 2
1 | 1 <- extra result starting at visit 3
1 | 2
1 | 1 <- extra result starting at visit 5
That extra logic is going to be expensive and too complicated for my fragile mind this morning, somebody better than me at postgres can probably solve this.
I would normally want to solve this by having a sessionkey column in the table I could cheaply group by for perforamnce reasons, but there's also a logical problem I think. Deriving session info from timings seems dangerous to me because I don't believe that the user will be definitely logged out after an hours activity. Most session systems work by expiring the session after a period of inactivity, i.e. it's very likely that a visit after 9:45 is going to be in the same session because your hourly period is going to be reset at 9:08.
The problem seems a little fuzzy.
It gets more complicated as id 3 is within an hour of id 1 and 2, but if the user had visited at 9:50 then that would have been within an hour of 2 but not 1.
You seem to be after a smoothed total - for a given visit, how many visits are within the following hour?
Perhaps you should be asking for how many visits have a succeeding visit less than an hour distant? If a visit is less than an hour from the preceeding one then should it 'count'?
So what you probably want is how many chains do you have where the links are less than an arbitrary amount (so the hypothetical 9:50 visit would be included in the chain that starts with id 1).
no simple solution
There is no way to do this in a single SQL statment.
Below are 2 ideas: one uses a loop to count visits, the other changes the way the visiting table is populated.
loop solution
However, it can be done without too much trouble with a loop.
(I have tried to get the postgresql syntax correct, but I'm no expert)
/* find entries where there is no previous entry for */
/* the same visitor within the previous hour: */
select v1.* , 0 visits
into temp_table
from visiting v1
where not exists ( select 1
from visiting v2
where v2.visitor_id = v1.visitor_id
and v2.visit_time < v1.visit_time
and v1.visit_time - interval '1 hour' < v2.visit_time
)
select #rows = ##rowcount
while #rows > 0
begin
update temp_table
set visits = visits + 1 ,
last_time = v.visit_time
from temp_table t ,
visiting v
where t.visitor_id = v.visitor_id
and v.visit_time - interval '1 hour' < t.last_time
and not exists ( select 1
from visiting v2
where v2.visitor_id = t.visitor_id
and v2.visit_time between t.last_time and v.visit_time
)
select #rows = ##rowcount
end
/* get the result: */
select visitor_id,
visits
from temp_table
The idea here is to do this:
get all visits where there is no prior visit inside of an hour.
this identifies the sessions
loop, getting the next visit for each of these "first visits"
until there are no more "next visits"
now you can just read off the number of visits in each session.
best solution?
I suggest:
add a column to the visiting table: session_id int not null
change the process which makes the entries so that it checks to see if the previous visit by the current visitor was less than an hour ago. If so, it sets session_id to the same as the session id for that earlier visit. If not, it generates a new session_id .
you could put this logic in a trigger.
Then your original query can be solved by:
SELECT session_id, visitor_id, count(*)
FROM visiting
GROUP BY session_id, visitor_id
Hope this helps. If I've made mistakes (I'm sure I have), leave a comment and I'll correct it.
PostgreSQL 8.4 will have a windowing function, by then we can eliminate creating temporary table just to simulate rownumbers (sequence purposes)
create table visit
(
visitor_id int not null,
visit_time timestamp not null
);
insert into visit(visitor_id, visit_time)
values
(1, '2009-01-06 08:45:02'),
(2, '2009-02-06 08:58:11'),
(1, '2009-01-06 08:58:11'),
(1, '2009-01-06 09:08:23'),
(1, '2009-01-06 21:55:23'),
(2, '2009-02-06 08:59:11'),
(2, '2009-02-07 00:01:00'),
(1, '2009-01-06 22:03:35');
create temp table temp_visit(visitor_id int not null, sequence serial not null, visit_time timestamp not null);
insert into temp_visit(visitor_id, visit_time) select visitor_id, visit_time from visit order by visitor_id, visit_time;
select
reference.visitor_id, count(nullif(reference.visit_time - prev.visit_time < interval '1 hour',false))
from temp_visit reference
left join temp_visit prev
on prev.visitor_id = reference.visitor_id and prev.sequence = reference.sequence - 1
group by reference.visitor_id;
One or both of these may work? However, both will end up giving you more columns in the result than you are asking for.
SELECT visitor_id,
date_part('year', visit_time),
date_part('month', visit_time),
date_part('day', visit_time),
date_part('hour', visit_time),
COUNT(*)
FROM visiting
GROUP BY 1, 2, 3, 4, 5;
SELECT visitor_id,
EXTRACT(EPOCH FROM visit_time)-(EXTRACT(EPOCH FROM visit_time) % 3600),
COUNT(*)
FROM visiting
GROUP BY 1, 2;
This can't be done in a single SQL.
The better option is to handle it in stored procedure
If it were T-SQL, I would write something as:
SELECT visitor_id, COUNT(id),
DATEPART(yy, visit_time), DATEPART(m, visit_time),
DATEPART(d, visit_time), DATEPART(hh, visit_time)
FROM visiting
GROUP BY
visitor_id,
DATEPART(yy, visit_time), DATEPART(m, visit_time),
DATEPART(d, visit_time), DATEPART(hh, visit_time)
which gives me:
1 3 2009 1 6 8
1 2 2009 1 6 21
I do not know how or if you can write this in postgre though.