How to create an aggregate table (data mart) that will improve chart performance? - sql

I created a table named user_preferences where user preferences have been grouped by user_id and month.
Table:
Each month I collect all user_ids and assign all preferences:
city
district
number of rooms
the maximum price they can spend
The plan assumes displaying a graph showing users' shopping intentions like this:
The blue line is the number of interested users for the selected values in the filters.
The graph should enable filtering by parameters marked in red.
What you see above is a simplified form for clarifying the subject. In fact, there are many more users. Every month, the table increases by several hundred thousand records. The SQL query retrieving data (feeding) for chart lasts up to 50 seconds. It's far too much - I can't afford it.
So, I need to create a table (table/aggregation/data mart) where I will be able to insert the previously calculated numer of interested users for all combinations. Thanks to this, the end user will not have to wait for the data to count.
Details below:
Now the question is - how to create such a table in PostgreSQL?
I know how to write a SQL query that will calculate a specific example.
SELECT
month,
count(DISTINCT user_id) interested_users
FROM
user_preferences
WHERE
month BETWEEN '2020-01' AND '2020-03'
AND city = 'Madrid'
AND district = 'Latina'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
GROUP BY
1
The question is - how to calculate all possible combinations? Can I write multiple nested loop in SQL?
The topic is extremely important to me, I think it will also be useful to others for the future.
I will be extremely grateful for any tips.

Well, base on your query, you have the following filters:
month
city
distirct
rooms
price_max
You can try creating a view with the following structure:
SELECT month
,city
,distirct
,rooms
,price_max
,count(DISTINCT user_id)
FROM user_preferences
GROUP BY month
,city
,distirct
,rooms
,price_max
You can make this view materialized. So, the query behind the view will not be executed when queried. It will behave like table.
When you are adding new records to the base table you will need to refresh the view (unfortunately, posgresql does not support auto-refresh like others):
REFRESH MATERIALIZED VIEW my_view;
or you can scheduled a task.
If you are using only exact search for each field, this will work. But in your example, you have criteria like:
month BETWEEN '2020-01' AND '2020-03'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
In such cases, I usually write the same query but SUM the data from the materialized view. In your case, you are using DISTINCT and this may lead to counting a user multiple times.
If this is a issue, you need to precalculate too many combinations and I doubt this is the answer. Alternatively, you can try to normalize your data - this will improve the performance of the aggregations.

Related

SQL to Spotfire query filtering issue with multiple tables

I am trying to calculate hours flowing in and out of a cost center. When the cost center lends out an employee for an hour it's +1 and when they borrow an employee for an hour it's -1.
Right now I'm using a query that says
select
columns
from dbo.table
where EmployeeCostCenter <> ProjectCostCenter
So when ProjectCostCenter = ID_CostCenter it returns +HoursQuantity.
Then I update ID_CostCenter = EmployeeCostCenter then where ID_CostCenter = EmployeeCostCenter to take -HoursQuantity.
That works fine. The problem is when I import it to Spotfire I can't filter on the main table even after I added the table relations. Can anyone explain why?
I can upload the actual code if needed, but I use 4 queries and a couple of them are quite lengthy. The main table, a temp table to calculate incoming hours, and a temp table to calculate outgoing hours are the only ones involved in this problem I think.
(moved to answer to avoid lengthy discussion)
Essentially, data relations are used to populate filtering / marking between different data-sets. Just like in RDBMS, the relation is what Spotfire uses as the link between dataset. Essentially it's the same as the column or columns you join on. Thus, any column that you wish to filter in TableA and have the result set limited in TableB (or visa versa) must be a relation.
Column matches aren't related columns, but are associated for aggregations, category axis, etc within each visualization. So if TableA has "amount" and TableB has "amount debit" and you wanted to use both of these in an expression, say Sum([TableA].[amount],[TableB].[amount debit]), they would need to be matched in order to not produce erroneous results.
Lastly, once you set up your relations, you should check your filter panel to set up how you want the filtering to work. You can have the rows included, excluded, or ignored all together. Here is a link explaining that.

Transpose to Count columns of Boolean values on Access SQL

Ok, so I have a Student table that has 6 fields, (StudentID, HasBamboo, HasFlower, HasAloe, HasFern, HasCactus) the "HasPlant" fields are boolean, so 1 for having the plant, 0 for not having the plant.
I want to find the average number of plants that a student has. There are hundreds of students in the table. I know this could involve transposing of some sort and of course counting the boolean values and getting an average. I did look at this question SQL to transpose row pairs to columns in MS ACCESS database for information on Transposing (never done it before), but I'm thinking there would be too many columns perhaps.
My first thought was using a for loop, but I'm not sure those exist in SQL in Access. Maybe a SELECT/FROM/WHERE/IN type structure?
Just hints on the logic and some possible reading material would be greatly appreciated.
you could just get individual totals per category:
SELECT COUNT(*) FROM STUDENTS WHERE HasBamboo
add them all up, and divide by
SELECT COUNT(*) FROM STUDENTS
It's not a great database design though... Better normalized would be:
Table Students; fields StudentID, StudentName
Table Plants; fields PlantID, PlantName
Table OwnedPlants; fields StudentID,PlantID
The last table then stores records for each student that owns a particular plant; but you could easily add different information at the right place (appartment number to Students; Latin name to Plants; date aquired to OwnedPlants) without completely redesigning table structure and add lots of fields. (DatAquiredBamboo, DateAquiredFlower, etc etc)

bigquery dataset design, multiple vs single tables for storing the same type of data

im planning to build a new ads system and we are considering to use google bigquery.
ill quickly describe my data flow :
Each User will be able to create multiple ADS. (1 user, N ads)
i would like to store the ADS impressions and i thought of 2 options.
1- create a table for impressions , for example table name is :Impressions fields : (userid,adsid,datetime,meta data fields...)
in this options of all my impressions will be stored in a single table.
main pros : ill be able to big data queries quite easily.
main cons: table will be hugh, and with multiple queries, ill end up paying too much (:
option 2 is to create table per ads
for example, ads id 1 will create
Impression_1 with fields (datetime,meta data fields)
pros: query are cheaper, data table is smaller
cons: todo big dataquery sometimes ill have to create a union and things will complex
i wonder what are your thoughts regarding this ?
In BigQuery it's easy to do this, because you can create tables per each day, and you have the possibility to query only those tables.
And you have Table wildcard functions, which are a cost-effective way to query data from a specific set of tables. When you use a table wildcard function, BigQuery only accesses and charges you for tables that match the wildcard. Table wildcard functions are specified in the query's FROM clause.
Assuming you have some tables like:
mydata.people20140325
mydata.people20140326
mydata.people20140327
You can query like:
SELECT
name
FROM
(TABLE_DATE_RANGE(mydata.people,
TIMESTAMP('2014-03-25'),
TIMESTAMP('2014-03-27')))
WHERE
age >= 35
Also there are Table Decorators:
Table decorators support relative and absolute <time> values. Relative values are indicated by a negative number, and absolute values are indicated by a positive number.
To get a snapshot of the table at one hour ago:
SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle#-3600000]
There is also TABLE_QUERY, which you can use for more complex queries.

Select based on table name w/ SQL for time series tables?

I'm trying to count the number of actions per account over a period of time. The period of time may be variable. The way I've thought of doing this is by creating a table for each time period, containing records for each account that get upserted on user action. For example, I'd have a table named '12-05-2013' which would contain a record w/ the following attributes:
{
"account_id" = uniqueid12345,
"uploads" = 4
}
The table would basically represent the uploads by all users on that specific day.
I want to be able to find all accounts that had < 2 uploads over the last seven days. This would require me to query over the 7 tables representing the last 7 days. (IE 12-05-2013, 12-04-2013, 12-03-2013, etc). So far I've been unable to find a method of selecting out of these specific tables. Am I modeling the problem wrong? Is there an easy way to do this? I'm new to relational databases so go easy on me :)
Why wouldn't you use a query such as this?
select userid, count(*) as NumActions
from ActionTable
where ActionDate bewteen #date1 and #date2;

How do I optimise my voting application to produce monthly charts?

I'd appreciate any help you can offer - I'm currently trying to decide on a schema for a voting app I'm building with PHP / MySQL, but I'm completely stuck on how to optimise it. The key elements are to allow only one vote per user per item, and be able to build a chart detailing the top items of the month – based on votes received that month.
So far the initial schema is:
Items_table
item_id
total_points
(lots of other fields unrelated to voting)
Voting_table
voting_id
item_id
user_id
vote (1 = up; 0 = down)
month_cast
year_cast
So I'm wondering if it's going to be a case of selecting all information from voting table where month = currentMonth & year = currentYear, somehow running a count and grouping by item_id; if so, how would I go about doing so? Or would I be better off creating a separate table for monthly charts which is updated with each vote, but then should I be concerned with the requirement to update 3 database tables per vote?
I'm not particularly competent – if it shows – so would really love any help / guidance someone could provide.
Thanks,
_just_me
I wouldn't add separate tables for monthly charts; to prevent users from casting more than one vote per item, you could use a unique key on voting_table(item_id, user_id).
As for the summary, you should be able to use a simple query like
select item_id, vote, count(*), month, year
from voting_table
group by item_id, vote, month, year
I would use a voting table similar to this:
create table votes(
item_id
,user_id
,vote_dtm
,vote
,primary key(item_id, user_id)
,foreign key(item_id) references item(item_id)
,foreign key(user_id) references users(user_id)
)Engine=InnoDB;
Using a composite key on a innodb table will cluster the data around the items, making it much faster to find the votes related to an item. I added a column vote_dtm which would hold the timestamp for when the user voted.
Then I would create one or several views, used for reporting purposes.
create view votes_monthly as
select item_id
,year(vote_dtm) as year
,month(vote_dtm) as month
,sum(vote) as score
,count(*) as num_votes
from votes
group
by item_id
,year(vote_dtm)
,month(vote_dtm);
If you start having performance issues, you can replace the view with a table containing pre-computed values without even touching the reporting code.
Note that I used both count(*) and sum(vote). The count(*) would return the number of cast votes, whereas the sum would return the number of up-votes. Howver, if you changed the vote column to use +1 for upvotes and -1 for downvotes, a sum(vote) would return a score much like the votes on stackoverflow are calculated.