Assigning a field value to all uniques in a table - sql

I have an analytics table with the following fields:
unique_id,
revenue,
pagename
An analytics record is created for every page a user visits. The question I would like to answer is this: How much revenue is coming from users that have been to a maps screen (pagename=mapview) versus users that have not. The revenue is only recorded when the user hits a page with a transactional element. I'm not keeping track of whether the user has been to a maps view once they hit a page with transaction elements
Do I need to create a separate table that tracks whether a particular user (unique_id) has been to a map screen and then join this with the original table? Or is there an easier way?

You can do this with aggregation -- two levels of aggregation:
select isMapView, sum(revenue), count(*) as numUsers
from (select unique_id, sum(revenue) as revenue,
max(case when pagename = 'mapview' then 1 else 0 end) as isMapView
from t
group by unique_id
) u
group by isMapView;

Related

Link data from sql from 2/3 columns

I don't know if the question title is so clear, but here is my question:
I had table UsersMovements which contains Users along with their movements
UsersMovements:
ID
UserID
MovementID
Comments
Time/Date
I need help looking for a query which would give me if users 1, 2 & 3 had been in a common MovementID, knowing that I don't know what is the MovementID
The real case is that, I want to see if those X users which I would select been in an area (in a limited interval, assuming I had date/Time in the table)
Thank you
if you want to select list of movements which have userid 1,2 and 3 you can use group by with having
select movementid
from usermovements
where userid in(1,2,3)
group by movementid
having count(distinct userid)=3

Creating mutually exclusive groupings in SQL (tables with pairs)

Looking for some query structuring help. I have a table with rows for link timestamp, user_id, linked_id, type_if_link. These link types are for example 'email' vs. 'phone number' so in the example below you can see user 1 is not directly connected to user 3 but is via user 2. The other complication is that each 'linked account' appears in r1 as well, meaning there are several 'duplicate' fields (in the example: row 1+2 , row 3+4)
ex:
Link time user id linked_id link type
---------------------------------------------------
link_occurred at user 1 user 2 link a
link_occurred at user 2 user 1 link a
link_occurred at user 2 user 3 link b
link_occurred at user 3 user 2 link b
link_occurred_at user 4 user 5 link a
link_occurred_at user 5 user 4 link a
What functions could I use to get the first user-id, a count of all the (directly+indirectly) linked accounts and possibly an array of the linked account ids.
For example the output I would want here is:
initial user - Count linked accounts array of linked accounts
--------------------------------------------------------------
user 1 2 linked [user 2, user 3]
user 4 1 linked account [user 5]
This would give me mutually exclusive grouping of all linked networks of accounts.
I didn't know about recursive CTEs until Erwin Brandstetter mentioned them in the comment above. The concept is what it sounds like: a CTE that refers to itself, and has a base case so that recursion terminates. For your problem, a recursive CTE solution might look something like:
WITH accumulate_users AS (
-- Base case: the direct links from a user_id.
SELECT
user_id AS user_id,
ARRAY_AGG(linked_id) AS linked_accounts
FROM your_table
GROUP BY user_id
UNION ALL
-- Recursive case: transitively linked accounts.
SELECT
ARRAY_UNION(
accumulate_users.linked_accounts,
ARRAY_AGG(DISTINCT your_table.linked_id)
) AS linked_accounts
FROM accumulate_users
JOIN your_table ON CONTAINS(accumulate_users.linked_accounts, your_table.user_id)
GROUP BY accumulate_users.user_id
-- But there is no enforced termination condition, hopefully it just
-- ends at some point? This is part of why implementing recursive CTEs
-- is challenging, I think.
)
SELECT
user_id,
CARDINALITY(linked_accounts) AS count_linked_accounts,
linked_accounts
FROM accumulate_users
But, I haven't been able to test this query, because as detailed in another Stack Overflow Q&A Presto does not support recursive CTEs.
It is possible to traverse an arbitrary, but finite, number of links by repeatedly joining back to the table you have. Something like this, and I've included the second_, third_, fourth_degree_links only for clarity:
SELECT
yt1.user_id,
ARRAY_AGG(DISTINCT yt2.user_id) AS first_degree_links,
ARRAY_AGG(DISTINCT yt3.user_id) AS second_degree_links,
ARRAY_AGG(DISTINCT yt3.linked_user) AS fourth_degree_links,
ARRAY_UNION(
ARRAY_AGG(DISTINCT yt2.user_id),
ARRAY_UNION(ARRAY_AGG(DISTINCT yt3.user_id), ARRAY_AGG(DISTINCT yt3.linked_user))
) AS up_to_fourth_degree_links
FROM your_table AS yt1
JOIN your_table AS yt2 ON yt1.linked_user = yt2.user_id
JOIN your_Table AS yt3 ON yt2.linked_user = yt3.user_id
GROUP BY yt1.user_id
I've been working with a similar set of data, although I have the original identifiers as part of the raw data set. In other words the 'email' and 'phone number' in your example. I found it helpful to create a table that groups user ids by these connecting identifiers:
CREATE TABLE email_connections AS
SELECT
email,
ARRAY_AGG(DISTINCT user_id) AS users
FROM source_table
GROUP BY email
The same arbitrary-but-finite-depth set of links can then be computed by looking for intersections between the user arrays:
SELECT
3764350 AS user_id,
FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users))))) AS all_users,
CARDINALITY(FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users)))))) AS count_all_users
FROM email_connections AS emails1
JOIN email_connections AS emails2 ON CARDINALITY(ARRAY_INTERSECT(emails1.users, emails2.users)) > 0
JOIN email_connections AS emails3 ON CARDINALITY(ARRAY_INTERSECT(emails2.users, emails3.users)) > 0
JOIN email_connections AS emails4 ON CARDINALITY(ARRAY_INTERSECT(emails3.users, emails4.users)) > 0
WHERE CONTAINS(emails1.users, 3764350)
GROUP BY 1
Calculating links to an arbitrary depth is a good use case for a graph database technology like Neo4j or JanusGraph. That's what I'm now looking at to address this "user linking" problem.

PostgreSQL - Best approach for summarize data

We have data as follows in system
User data
Experience
Education
Job Application
This data will be used across application and there are few logic also attached to these data.
Just to make sure that this data are consistent across application, i thought to create View for the same and get count of these data then use this view at different places.
Now question is, as detail tables does not have relation with each other, how should i create view
Create different view for each table and then use group by
Create one view and write sub query to get these data
From performance perspective, which one is the best approach?
For e.g.
SELECT
UserId,
COUNT(*) AS ExperienceCount,
0 AS EducationCount
FROM User
INNER JOIN Experience ON user_id = User_Id
GROUP BY
UserId
UNION ALL
SELECT
UserId,
0,
COUNT(*)
FROM User
INNER JOIN Education ON user_id = user_id
GROUP BY
UserId
And then group by this to get summary of all these data in one row per user.
One way to write the query that you have specified would probably be:
SELECT UserId, SUM(ExperienceCount), SUM(EducationCount
FROM ((SELECT UserId, COUNT(*) as ExperienceCount, 0 AS EducationCount
FROM Experience
GROUP BY UserId
) UNION ALL
(SELECT UserId, 0, COUNT(*)
GROUP BY UserId
)
) u
GROUP BY UserId;
This can also be written as a FULL JOIN, LEFT JOIN, and using correlated subqueries. Each of these can be appropriate in different circumstances, depending on your data.

Sql query - selecting top 5 rows and further selecting rows only if User is present

I kind of stuck on how to implement this query - this is pretty similar to the query I posted earlier but I'm not able to crack it.
I have a shopping table where everytime a user buys anything, a record is inserted.
Some of the fields are
* shopping_id (primary key)
* store_id
* user_id
Now what I need is to pull only the list of those stores where he's among the top 5 visitors:
When I break it down - this is what I want to accomplish:
* Find all stores where this UserA has visited
* For each of these stores - see who the top 5 visitors are.
* Select the store only if UserA is among the top 5 visitors.
The corresponding queries would be:
select store_id from shopping where user_id = xxx
select user_id,count(*) as 'visits' from shopping
where store_id in (select store_id from shopping where user_id = xxx)
group by user_id
order by visits desc
limit 5
Now I need to check in this resultset if UserA is present and select that store only if he's present.
For example if he has visited a store 5 times - but if there are 5 or more people who have visited that store more than 5 times - then that store should not be selected.
So I'm kind of lost here.
Thanks for your help
This should do it. It uses an intermediate VIEW to figure out how many times each user has shopped at each store. Also, it assumes you have a stores table somewhere with each store_id listed once. If that's not true, you can change SELECT store_id FROM stores to SELECT DISTINCT store_id FROM shopping for the same effect but slower results.
CREATE VIEW shop_results (store_id, user_id, purchase_count) AS
SELECT store_id, user_id, COUNT(*)
FROM shopping GROUP BY store_id, user_id
SELECT store_id FROM stores
WHERE 'UserA' IN
(SELECT user_id FROM shop_results
WHERE shop_results.store_id = stores.store_id
ORDER BY purchase_count DESC LIMIT 5)
You can combine these into a single query by placing the SELECT from the VIEW inside the sub-query, but I think it's easier to read this way and it may well be true that you want that aggregated information elsewhere in the system — more consistent to define it once in a view than repeat it in multiple queries.

Compute Users average weight

I have two tables, Users and DoctorVisit
User
- UserID
- Name
DoctorsVisit
- UserID
- Weight
- Date
The doctorVisit table contains all the visits a particular user did to the doctor.
The user's weight is recorded per visit.
Query: Sum up all the Users weight, using the last doctor's visit's numbers. (then divide by number of users to get the average weight)
Note: some users may have not visited the doctor at all, while others may have visited many times.
I need the average weight of all users, but using the latest weight.
Update
I want the average weight across all users.
If I understand your question correctly, you should be able to get the average weight of all users based on their last visit from the following SQL statement. We use a subquery to get the last visit as a filter.
SELECT avg(uv.weight) FROM (SELECT weight FROM uservisit uv INNER JOIN
(SELECT userid, MAX(dateVisited) DateVisited FROM uservisit GROUP BY userid) us
ON us.UserID = uv.UserId and us.DateVisited = uv.DateVisited
I should point out that this does assume that there is a unique UserID that can be used to determine uniqueness. Also, if the DateVisited doesn't include a time but just a date, one patient who visits twice on the same day could skew the data.
This should get you the average weight per user if they have visited:
select user.name, temp.AvgWeight
from user left outer join (select userid, avg(weight)
from doctorsvisit
group by userid) temp
on user.userid = temp.userid
Write a query to select the most recent weight for each user (QueryA), and use that query as an inner select of a query to select the average (QueryB), e.g.,
SELECT AVG(weight) FROM (QueryA)
I think there's a mistake in your specs.
If you divide by all the users, your average will be too low. Each user that has no doctor visits will tend to drag the average towards zero. I don't believe that's what you want.
I'm too lazy to come up with an actual query, but it's going to be one of these things where you use a self join between the base table and a query with a group by that pulls out all the relevant Id, Visit Date pairs from the base table. The only thing you need the User table for is the Name.
We had a sample of the same problem in here a couple of weeks ago, I think. By the "same problem", I mean the problem where we want an attribute of the representative of a group, but where the attribute we want isn't included in the group by clause.
I think this will work, though I could be wrong:
Use an inner select to make sure you have the most recent visit, then use AVG. Your User table in this example is superfluous: since you have no weight data there and you don't care about user names, it doesn't do you any good to examine it.
SELECT AVG(dv.Weight)
FROM DoctorsVisit dv
WHERE dv.Date = (
SELECT MAX(Date)
FROM DoctorsVisit innerdv
WHERE innerdv.UserID = dv.UserID
)
If you're using SQL Server 2005 you don't need the sub query on the GROUP BY.
You can use the new ROW_NUMBER and PARTION BY functionality.
SELECT AVG(a.weight) FROM
(select
ROW_NUMBER() OVER(PARTITION BY dv.UserId ORDER BY Date desc) as ID,
dv.weight
from
DoctorsVisit dv) a
WHERE a.Id = 1
As someone else has mentioned though, this is the average weight across all the users who have VISITED the doctor. If you want the average weight across ALL of the users then anyone not visiting the doctor will give a misleading average.
Here's my stab at the solution:
select
avg(a.Weight) as AverageWeight
from
DoctorsVisit as a
innner join
(select
UserID,
max (Date) as LatestDate
from
DoctorsVisit
group by
UserID) as b
on a.UserID = b.UserID and a.Date = b.LatestDate;
Note that the User table isn't used at all.
This average omits entirely users who have no doctors visits at all, or whose weight is recorded as NULL in their latest doctors visit. This average is skewed if any users have more than one visit on the same date, and if the latest date is one of those date where the user got wighed more than once.