Data Model for analytical CRM as a service - sql

We're developing an analytical CRM as a service and I have a question about data model.
A CRM user might upload a batch (about 1 million rows a week) with his clients/customers.
Also we already have 200 millions of rows in this database.
CRM users want us to provide a feature for estimating the amount of people in a segment defined by various constraints (sex, age, location, etc; ~50 mandatory filters) like in any advertisement platform (Facebook Ads, Google Adwords). Of course, he expects to see count result in real time.
For simplicity let's imagine you want to get a count of male smokers age 18 and 22, so you apply filters successively:
sex (select count (id) where sex = m)
age (select count (id) where sex = m and age in (18,22))
smokers (select count (id) where sex = m and age in (18,22) and smoker = 1)
Afterwards you might extract a list of emails from these segments (not real time).
The other thing is the operational database which only put data (operational writes from HTTPs API).
Q. How to make a data model for this case:
Which
type of DB and how many nodes
table sets, normalization and relations
to choose?
Thanks in advance!

Related

SQL Time Series Homework

Imagine you have this two tables.
a) streamers: it contains time series data, at a 1-min granularity, of all the channels that broadcast on
Twitch. The columns of the table are:
username: Channel username
timestamp: Epoch timestamp, in seconds, corresponding to the moment the data was captured
game: Name of the game that the user was playing at that time
viewers: Number of concurrent viewers that the user had at that time
followers: Number of total followers that the channel had at that time
b) games_metadata: it contains information of all the games that have ever been broadcasted on Twitch.
The columns of the table are:
game: Name of the game
release_date: Timestamp, in seconds, corresponding to the date when the game was released
publisher: Publisher of the game
genre: Genre of the game
Now I want the Top 10 publishers that have been watched the most during the first quarter of 2019. The output should contain publisher and hours_watched.
The problem is I don't have any database, I created one and inputted some values by hand.
I thought of this query, but I'm not sure if it is what I want. It may be right (I don't feel like it is ), but I'd like a second opinion
SELECT publisher,
(cast(strftime('%m', "timestamp") as integer) + 2) / 3 as quarter,
COUNT((strftime('%M',`timestamp`)/(60*1.0)) * viewers) as total_hours_watch
FROM streamers AS A INNER JOIN games_metadata AS B ON A.game = B.game
WHERE quarter = 3
GROUP BY publisher,quarter
ORDER BY total_hours_watch DESC
Looks about right to me. You don't need to include quarter in the GROUP BY since the where clause limits you to only one quarter. You can modify the query to get only the top 10 publishers in a couple of ways depending on the SQL server you've created.
For SQL Server / MS Access modify your select statement: SELECT TOP 10 publisher, ...
For MySQL add a limit clause at the end of your query: ... LIMIT 10;

How can i find the number of disk accesses needed for a relational algebra query?

Hello are any of you very nice people able to explain the concept of query optimization to me in regards to relational algebra?
my preferred method of constructing relational algebra queries is by using temporary values step by step, but the only resources i can find for explaining how query optimization works to find the amount of disk access needed uses different notation for relational algebra queries, which confuses me.
so if i am given the following relations:
department(deptNo, deptName, location)
Employee(empNo, empName, empAddress, jobDesc, deptNo*)
and have produced the following relational algebra query to find all the programmers who work in a Manchester department as so:
temp1 = department JOIN employee
temp 2 = SELECT(jobdesc = 'programmer') (temp1)
result = SELECT(location = 'Manchester)(temp 2)
And i can assume that there are 10,00 tuples in the employee relation, 50 tuples in the department relation, 100 programmers (2 in each department) and one department located in Manchester, how would i work out how many disk accesses are needed?
Thankyou in advance!
Yup - Gordon's right. However, this is an academic exercise: you're building sets of data - assume each element/tuple returned by a sub-query is one disk access. General rule of thumb - limit the most amount of data as early as possible. Lets assume you do the JOIN first (10000 employees + 50 departments = 10050 disk entries {even thought he number of rows returned is 10000!}), then you do the SELECT (assuming that the sub-query is perfectly indexed) = (100 programmers + 1 department in Manchester) thus the total number of "accesses" = 10050+101 = 10151.
If you do the SELECTS first, the whole exercise changes dramatically: (temp 1=get programmers = 100 rows/disk accesses), (temp 2=get departments = 1 row/disk access), JOIN (again, assuming perfect indexing on temporary views/queries, etc, etc) = 50 rows: therefore total number of "accesses" = 100+1+50 = 151.
Same results, but the way it is interpreted and executed can influence the amount of work the database engine has to perform.
It's been many years, every possibility that I might have got this wrong - I don't mind being corrected.

recursive geometric query : five closest entities

The question is whether the query described below can be done without recourse to procedural logic, that is, can it be handled by SQL and a CTE and a windowing function alone? I'm using SQL Server 2012 but the question is not limited to that engine.
Suppose we have a national database of music teachers with 250,000 rows:
teacherName, address, city, state, zipcode, geolocation, primaryInstrument
where the geolocation column is a geography::point datatype with optimally tesselated index.
User wants the five closest guitar teachers to his location. A query using a windowing function performs well enough if we pick some arbitrary distance cutoff, say 50 miles, so that we are not selecting all 250,000 rows and then ranking them by distance and taking the closest 5.
But that arbitrary 50-mile radius cutoff might not always succeed in encompassing 5 teachers, if, for example, the user picks an instrument from a different culture, such as sitar or oud or balalaika; there might not be five teachers of such instruments within 50 miles of her location.
Also, now imagine we have a query where a conservatory of music has sent us a list of 250 singers, who are students who have been accepted to the school for the upcoming year, and they want us to send them the five closest voice coaches for each person on the list, so that those students can arrange to get some coaching before they arrive on campus. We have to scan the teachers database 250 times (i.e. scan the geolocation index) because those students all live at different places around the country.
So, I was wondering, is it possible, for that latter query involving a list of 250 student locations, to write a recursive query where the radius begins small, at 10 miles, say, and then increases by 10 miles with each iteration, until either a maximum radius of 100 miles has been reached or the required five (5) teachers have been found? And can it be done only for those students who have yet to be matched with the required 5 teachers?
I'm thinking it cannot be done with SQL alone, and must be done with looping and a temporary table--but maybe that's because I haven't figured out how to do it with SQL alone.
P.S. The primaryInstrument column could reduce the size of the set ranked by distance too but for the sake of this question forget about that.
EDIT: Here's an example query. The SINGER (submitted) dataset contains a column with the arbitrary radius to limit the geo-results to a smaller subset, but as stated above, that radius may define a circle (whose centerpoint is the student's geolocation) which might not encompass the required number of teachers. Sometimes the supplied datasets contain thousands of addresses, not merely a few hundred.
select TEACHERSRANKEDBYDISTANCE.* from
(
select STUDENTSANDTEACHERSINRADIUS.*,
rowpos = row_number()
over(partition by
STUDENTSANDTEACHERSINRADIUS.zipcode+STUDENTSANDTEACHERSINRADIUS.streetaddress
order by DistanceInMiles)
from
(
select
SINGER.name,
SINGER.streetaddress,
SINGER.city,
SINGER.state,
SINGER.zipcode,
TEACHERS.name as TEACHERname,
TEACHERS.streetaddress as TEACHERaddress,
TEACHERS.city as TEACHERcity,
TEACHERS.state as TEACHERstate,
TEACHERS.zipcode as TEACHERzip,
TEACHERS.teacherid,
geography::Point(SINGER.lat, SINGER.lon, 4326).STDistance(TEACHERS.geolocation)
/ (1.6 * 1000) as DistanceInMiles
from
SINGER left join TEACHERS
on
( TEACHERS.geolocation).STDistance( geography::Point(SINGER.lat, SINGER.lon, 4326))
< (SINGER.radius * (1.6 * 1000 ))
and TEACHERS.primaryInstrument='voice'
) as STUDENTSANDTEACHERSINRADIUS
) as TEACHERSRANKEDBYDISTANCE
where rowpos < 6 -- closest 5 is an abitrary requirement given to us
I think may be if you need just to get closest 5 teachers regardless of radius, you could write something like this. The Student will duplicate 5 time in this query, I don't know what do you want to get.
select
S.name,
S.streetaddress,
S.city,
S.state,
S.zipcode,
T.name as TEACHERname,
T.streetaddress as TEACHERaddress,
T.city as TEACHERcity,
T.state as TEACHERstate,
T.zipcode as TEACHERzip,
T.teacherid,
T.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326))
/ (1.6 * 1000) as DistanceInMiles
from SINGER as S
outer apply (
select top 5 TT.*
from TEACHERS as TT
where TT.primaryInstrument='voice'
order by TT.geolocation.STDistance(geography::Point(S.lat, S.lon, 4326)) asc
) as T

Dynamic user ranks

I have a basic karma/rep system that awards users based on their activities (questions, answers, etc..). I want to have user ranks (title) based on their points. Different ranks have different limitations and grant powers.
ranks table
id rankname points questions_per_day
1 beginner 150 10
2 advanced 300 30
I'm not sure if I need to have a lower and upper limit, but for the sake of simplicity I have only left a max points limit, that is, a user below 150 is a 'beginner' and below or higher than 300, he's an 'advanced'.
For example, Bob with 157 points would have an 'advanced' tag displayed by his username.
How can I determine and display the rank/title of an user? Do I loop through each row and compare values?
What problems might arise if I scale this to thousands of users having their rank calculated this way? Surely it will tax the system to query and loop each time a user's rank is requested, no?
You could better cache the rank and the score. If a user's score only changes when they do certain activities, you can put a trigger on that activity. When the score changes, you can recalculate the rank and save it in the users record. That way, retreiving the rank is trivial, you only need to calculate it when the score changes.
You can get the matching rank id like this; query the rank that is closest (but below or equal to) the user schore. Store this rank id in the user's record.
I added the pseudovariable {USERSCORE} because I don't know if you use parameters or any other way to enter values in a query.
select r.id
from ranks r
where r.points <= {USERSCORE}
order by r.points desc
limit 1
A little difficult without knowing your schema. Try:
SELECT user.id, MIN(ranks.id) AS rankid FROM user JOIN ranks ON (user.score <= ranks.points) GROUP BY user.id;
Now you know the ranks id.
This is non-trivial though (GROUP BY and MAX are pipeline breakers and so quite heavyweight operations), so GolezTrol advice is good; you should cache this information and update it only when a users score changes. A trigger sounds fine for this.

SQL Query Math Gymnastics

I have two tables of concern here: users and race_weeks. User has many race_weeks, and race_week belongs to User. Therefore, user_id is a fk in the race_weeks table.
I need to perform some challenging math on fields in the race_weeks table in order to return users with the most all-time points.
Here are the fields that we need to manipulate in the race_weeks table.
races_won (int)
races_lost (int)
races_tied (int)
points_won (int, pos or neg)
recordable_type(varchar, Robots can race, but we're only concerned about type 'User')
Just so that you fully understand the business logic at work here, over the course of a week a user can participate in many races. The race_week record represents the summary results of the user's races for that week. A user is considered active for the week if races_won, races_lost, or races_tied is greater than 0. Otherwise the user is inactive.
So here's what we need to do in our query in order to return users with the most points won (actually net_points_won):
Calculate each user's net_points_won (not a field in the DB).
To calculate net_points_won, you take (1000 * count_of_active_weeks) - sum(points__won). (Why 1000? Just imagine that every week the user is spotted a 1000 points to compete and enter races. We want to factor-out what we spot the user because the user could enter only one race for the week for 100 points, and be sitting on 900, which we would skew who actually EARNED the most points.)
This one is a little convoluted, so let me know if I can clarify further.
I believe that your business logic is incorrect: net_points should be the sum of points won for that user minus the number of points the user was spotted.
In addition, the check for active weeks should test races_won, races_lost, and races_tied against zero explicitly to give the system the opportunity to use indexes on those columns when the table becomes large.
SELECT user_id
, SUM(points_won) - 1000 * COUNT(*) AS net_points
FROM race_weeks
WHERE recordable_type = 'User'
AND (races_won > 0 OR races_lost > 0 OR races_tied > 0)
GROUP BY user_id
ORDER BY net_points DESC
SELECT user_id, 1000 * COUNT(*) - SUM(points_won) AS net_points
FROM race_weeks
WHERE races_won + races_lost + races_tied
AND recordable_type = 'User'
GROUP BY
user_id
ORDER BY
net_points DESC