SQL Time Series Homework - sql

Imagine you have this two tables.
a) streamers: it contains time series data, at a 1-min granularity, of all the channels that broadcast on
Twitch. The columns of the table are:
username: Channel username
timestamp: Epoch timestamp, in seconds, corresponding to the moment the data was captured
game: Name of the game that the user was playing at that time
viewers: Number of concurrent viewers that the user had at that time
followers: Number of total followers that the channel had at that time
b) games_metadata: it contains information of all the games that have ever been broadcasted on Twitch.
The columns of the table are:
game: Name of the game
release_date: Timestamp, in seconds, corresponding to the date when the game was released
publisher: Publisher of the game
genre: Genre of the game
Now I want the Top 10 publishers that have been watched the most during the first quarter of 2019. The output should contain publisher and hours_watched.
The problem is I don't have any database, I created one and inputted some values by hand.
I thought of this query, but I'm not sure if it is what I want. It may be right (I don't feel like it is ), but I'd like a second opinion
SELECT publisher,
(cast(strftime('%m', "timestamp") as integer) + 2) / 3 as quarter,
COUNT((strftime('%M',`timestamp`)/(60*1.0)) * viewers) as total_hours_watch
FROM streamers AS A INNER JOIN games_metadata AS B ON A.game = B.game
WHERE quarter = 3
GROUP BY publisher,quarter
ORDER BY total_hours_watch DESC

Looks about right to me. You don't need to include quarter in the GROUP BY since the where clause limits you to only one quarter. You can modify the query to get only the top 10 publishers in a couple of ways depending on the SQL server you've created.
For SQL Server / MS Access modify your select statement: SELECT TOP 10 publisher, ...
For MySQL add a limit clause at the end of your query: ... LIMIT 10;

Related

SQL query to append a column which counts the # instance of a particular ID

I run analytics for a call center that more or less cold calls leads our customers give us from their CRM to sell/upsell software. I'm trying to figure out a way to calculate pick up rates based on the 1st call, 2nd call, 3rd call, etc. All data associated with our call center is logged in a table in PostgreSQL, each contact has a unique ID and each call has a unique ctc_id. My thought is to order the table by created_at/ctc_id ASC, then append a calculated column to count the number-th occurrence of the contact ID for each contact_id in a dial session/campaign [let's call it nTH_Call]. With that, I could group by nTH_Call, and based on a pickup vs voicemail (or any other negative call disposition), calculate a pickup rate based on N-th attempt to call each contact. An example table below hopefully illustrates my question:
ctc_id
call_time
contact_id
nTH_Call
347723
2021-04-06 14:14:13.287698
163836
1
354172
2021-04-12 09:29:02.564833
81153
1
354174
2021-04-12 09:29:09.759067
81153
2
367971
2021-04-21 11:57:26.050931
81153
3
369961
2021-04-22 11:45:40.009285
163836
2
374106
2021-04-26 13:41:10.687522
163836
3
In a way, I'm also trying to answer - "It takes about X.X dials (on average) to get a customer to pickup on a cold call..." etc. as well as "by the 2nd call, our average pick up rates are X.X% and by the 3rd call through, our average pick up rates are X.X%..." etc.
If it were in Excel and I had the data in a table, I'd do this --
With a =COUNTIF($C$2:C2,C2) in cell D2, and drag the formula down, then pivot the data however I'd want. But I'm trying to illustrate the rates real time on a live dashboard programmatically.
Any ideas or radically different methods I can try to answer my questions? What would I need to do in my SQL statement to make this happen?
Any help would be appreciated.
Sample Query:
SELECT
ctc_id,
call_time,
contact_id,
[contact_id OCCURRENCE order count column here!!]
FROM
call_table
ORDER BY
call_time ASC
You can calculate the nth_call order by contact in this way:
SELECT
ctc_id,
call_time,
contact_id,
ROW_NUMBER() OVER(partition by contact_id order by call_time) as nth_call
FROM
call_table
ORDER BY
call_time ASC

Query complex in Oracle SQL

I have the following tables and their fields
They ask me for a query that seems to me quite complex, I have been going around for two days and trying things, it says:
It is desired to obtain the average age of female athletes, medal winners (gold, silver or bronze), for the different modalities of 'Artistic Gymnastics'. Analyze the possible contents of the result field in order to return only the expected values, even when there is no data of any specific value for the set of records displayed by the query. Specifically, we want to show the gender indicator of the athletes, the medal obtained, and the average age of these athletes. The age will be calculated by subtracting from the system date (SYSDATE), the date of birth of the athlete, dividing said value by 365. In order to avoid showing decimals, truncate (TRUNC) the result of the calculation of age. Order the results by the average age of the athletes.
Well right now I have this:
select person.gender,score.score
from person,athlete,score,competition,sport
where person.idperson = athlete.idathlete and
athlete.idathlete= score.idathlete and
competition.idsport = sport.idsport and
person.gender='F' and competition.idsport=18 and score.score in
('Gold','Silver','Bronze')
group by
person.gender,
score.score;
And I got this out
By adding the person.birthdate field instead of leaving 18 records of the 18 people who have a medal, I'm going to many more records.
Apart from that, I still have to draw the average age with SYSDATE and TRUNC that I try in many ways but I do not get it.
I see it very complicated or I'm a bit saturated from so much spinning, I need some help.
Reading the task you got, it seems that you're quite close to the solution. Have a look at the following query and its explanation, note the differences from your query, see if it helps.
select p.gender,
((sysdate - p.birthday) / 365) age,
s.score
from person p join athlete a on a.idathlete = p.idperson
left join score s on s.idathlete = a.idathlete
left join competition c on c.idcompetition = s.idcompetition
where p.gender = 'F'
and s.score in ('Gold', 'Silver', 'Bronze')
and c.idsport = 18
order by age;
when two dates are subtracted, the result is number of days. Dividing it by 365, you - roughly - get number of years (as each year has 365 days - that's for simplicity, of course, as not all years have that many days (hint: leap years)). The result is usually a decimal number, e.g. 23.912874918724. In order to avoid that, you were told to remove decimals, so - use TRUNC and get 23 as the result
although data model contains 5 tables, you don't have to use all of them in a query. Maybe the best approach is to go step-by-step. The first one would be to simply select all female athletes and calculate their age:
select p.gender,
((sysdate - p.birthday) / 365 age
from person p
where p.gender = 'F'
Note that I've used a table alias - I'd suggest you to use them too, as they make queries easier to read (table names can have really long names which don't help in readability). Also, always use table aliases to avoid confusion (which column belongs to which table)
Once you're satisfied with that result, move on to another table - athlete It is here just as a joining mechanism with the score table that contains ... well, scores. Note that I've used outer join for the score table because not all athletes have won the medal. I presume that this is what the task you've been given says:
... even when there is no data of any specific value for the set of records displayed by the query.
It is suggested that we - as developers - use explicit table joins which let you to see all joins separated from filters (which should be part of the WHERE clause). So:
NO : from person p, athlete a
where a.idathlete = p.idperson
and p.gender = 'F'
YES: from person p join athlete a on a.idathlete = p.idperson
where p.gender = 'F'
Then move to yet another table, and so forth.
Test frequently, all the time - don't skip steps. Move on to another one only when you're sure that the previous step's result is correct, as - in most cases - it won't automagically fix itself.

Microsoft Access query to retrieve random transactions with seemingly impossible criteria

I was asked to assist with developing a report to retrieve a 25% sample of random transactions within a specific date range. I am not a programmer but I was able to devise the following fairly quickly:
SELECT TOP 25 PERCENT account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data, log.to_data, log.tran_id, log.init
FROM account INNER JOIN log ON account.ACCT = log.ACCT
GROUP BY account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data, log.to_data, log.tran_id, log.init
HAVING (((log.DATE) Between #2/7/2018# And #6/15/2018#) AND ((log.action_txt)="mod" Or (log.action_txt)="del") AND ((log.init)="J1X"
ORDER BY log.tran_dt
This returns 25% of the records within the date range. Each record row is unique but each account number potentially has multiple records on each day. In some cases the records have the same date and tran_id as well.
Upon further discussion with the requester, he actually wants to see all of the transactions for 25% of the accounts that have activity on each day within the date range. Thus if there were 100 accounts on 3/1/2018 with records in this table, he wants to see all of the transactions for 25 of those accounts; if there were 60 accounts on 3/2/2018 with records in this table, he wants to see all of the transactions for 15 of those accounts; and so on.
I was thinking that an Access module would work best in this scenario as I believe there are multiple parts to this. I figured that I need a function to loop through the date range and for each day:
1. Count the account numbers only one time
2. Return all of the transactions for 25% of the total accounts
But as I mentioned, I am not a programmer and I am exhausted from searching possible solutions for the many parts.
I think the key to your question is that you only really need a pseudo random selection of results for your report. So you can force the Random number generator to reorder your results based on a value in the record and the current time.
Something like this should work - I assume your actiontxt field is a text field and pull out the length of each field and apply that to current date/time to create a pseudo random number that can be sorted.
All I really do is change your ORDER BY line
See if this works for you
SELECT TOP 25 PERCENT
account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data,
log.to_data, log.tran_id, log.init
FROM account
INNER JOIN log ON account.ACCT = log.ACCT
GROUP BY account.CID, account.ACCT, account.NAME, log.DATE, log.action_txt, log.field_nm, log.from_data, log.to_data, log.tran_id, log.init
HAVING (((log.DATE) Between #2/7/2018# And #6/15/2018#) AND ((log.action_txt)="mod" Or (log.action_txt)="del") AND ((log.init)="J1X"
ORDER BY Rnd(CLng(Now()*Len(log.action_txt))-(Now()*Len(log.action_txt)));
Modified similar idea from other StackOverflow question and response

SQL: get records based on user likes of other records

I'm trying to write an SQL (Windows server) query that will provide some results based on what other users like.
It is a bit like on Amazon when it says 'Users who bought this also bought...'
It is based on the vote field, where a vote of '1' means a user liked a record; or a vote of '0' means they disliked it.
So when a user is on a particular record, I want to list 3 other records that users who liked the current record also liked.
snippet of relevant table provided below:
ID UserID Record ID Vote DateAdded
16 9999 12013011290 1 2008-11-11 13:23:44.000
17 8888 12013011290 0 2008-11-11 13:23:44.000
18 7777 12013011290 0 2008-11-11 13:23:44.000
20 4930 12013011290 1 2013-11-19 15:04:06.263
I think this requires ordering by a sub-select, but I'm not sure. Can anyone advise me on if this is possible and if so how! thanks.
p.s.
To maintain the quality of the results I think it would be extra useful to filter by DateAdded. That is,
- 'user x' is seeing recommended records about 'record z'
- 'user y' is someone who has liked 'record z' and 'record a'
- only count 'user y's' like of 'record a' IF they liked 'record a' an HOUR before or after they liked 'record z'
- in other words, only count the 'record a's' like if it was during the same website-browsing session as 'record z'
Hope this makes sense!
something like this?
select r.description
from record r
join (
select top 3 v.recordid from votes v
where v.vote = 1 and recordid != 123456789
and userid in
(
select userid from votes where recordid = 123456789 and vote =1
)
order by dateadded desc
) as x on x.recordid = r.id
A method I used for the basic version of this problem is indeed using multiple selects: figure out what users liked a specific item, then query further on what they tagged.
with Likers as
(select user_id from likes where content_id = 10)
select count(user_id) as like_count, content_id
from likes
natural join likers
where content_id <> 10
group by content_id
order by like_count desc;
(Tested using Sqlite3)
What you will receive is a list of items that were liked by everyone who liked item 10, ordered by the number of likes (within the search domain.) I would probably want to limit this as well, since on a larger dataset its likely to result in a large amount of stray items with only one or two similar likes that are in turn buried under items with hundreds of likes.
I suspect the reason you are checking timestamps in the first place is so that if somebody likes laundry detergent, then comes back two days later to like a movie, the system would not associate "people who like Epic Shootout 17 also like Clean More."
I would not recommend using date arithmetic for this. I might suggest creating another table to represent individual "sessions" and using the session_id for this task. Since there are (hopefully!) many, many like records on your database, you want to reduce the amount of work you are making it do. You can also use this session_id for logging any other actions a person did (for analytics purposes.) It is also computationally cheaper to ask for all things that happened within a session with a simple index and identity comparison than to perform date computations on potentially millions of records.
For reference, Piwik defines a new session as thirty minutes since the last action taken.

SQL Query Math Gymnastics

I have two tables of concern here: users and race_weeks. User has many race_weeks, and race_week belongs to User. Therefore, user_id is a fk in the race_weeks table.
I need to perform some challenging math on fields in the race_weeks table in order to return users with the most all-time points.
Here are the fields that we need to manipulate in the race_weeks table.
races_won (int)
races_lost (int)
races_tied (int)
points_won (int, pos or neg)
recordable_type(varchar, Robots can race, but we're only concerned about type 'User')
Just so that you fully understand the business logic at work here, over the course of a week a user can participate in many races. The race_week record represents the summary results of the user's races for that week. A user is considered active for the week if races_won, races_lost, or races_tied is greater than 0. Otherwise the user is inactive.
So here's what we need to do in our query in order to return users with the most points won (actually net_points_won):
Calculate each user's net_points_won (not a field in the DB).
To calculate net_points_won, you take (1000 * count_of_active_weeks) - sum(points__won). (Why 1000? Just imagine that every week the user is spotted a 1000 points to compete and enter races. We want to factor-out what we spot the user because the user could enter only one race for the week for 100 points, and be sitting on 900, which we would skew who actually EARNED the most points.)
This one is a little convoluted, so let me know if I can clarify further.
I believe that your business logic is incorrect: net_points should be the sum of points won for that user minus the number of points the user was spotted.
In addition, the check for active weeks should test races_won, races_lost, and races_tied against zero explicitly to give the system the opportunity to use indexes on those columns when the table becomes large.
SELECT user_id
, SUM(points_won) - 1000 * COUNT(*) AS net_points
FROM race_weeks
WHERE recordable_type = 'User'
AND (races_won > 0 OR races_lost > 0 OR races_tied > 0)
GROUP BY user_id
ORDER BY net_points DESC
SELECT user_id, 1000 * COUNT(*) - SUM(points_won) AS net_points
FROM race_weeks
WHERE races_won + races_lost + races_tied
AND recordable_type = 'User'
GROUP BY
user_id
ORDER BY
net_points DESC