How do I optimise my voting application to produce monthly charts? - sql

I'd appreciate any help you can offer - I'm currently trying to decide on a schema for a voting app I'm building with PHP / MySQL, but I'm completely stuck on how to optimise it. The key elements are to allow only one vote per user per item, and be able to build a chart detailing the top items of the month – based on votes received that month.
So far the initial schema is:
Items_table
item_id
total_points
(lots of other fields unrelated to voting)
Voting_table
voting_id
item_id
user_id
vote (1 = up; 0 = down)
month_cast
year_cast
So I'm wondering if it's going to be a case of selecting all information from voting table where month = currentMonth & year = currentYear, somehow running a count and grouping by item_id; if so, how would I go about doing so? Or would I be better off creating a separate table for monthly charts which is updated with each vote, but then should I be concerned with the requirement to update 3 database tables per vote?
I'm not particularly competent – if it shows – so would really love any help / guidance someone could provide.
Thanks,
_just_me

I wouldn't add separate tables for monthly charts; to prevent users from casting more than one vote per item, you could use a unique key on voting_table(item_id, user_id).
As for the summary, you should be able to use a simple query like
select item_id, vote, count(*), month, year
from voting_table
group by item_id, vote, month, year

I would use a voting table similar to this:
create table votes(
item_id
,user_id
,vote_dtm
,vote
,primary key(item_id, user_id)
,foreign key(item_id) references item(item_id)
,foreign key(user_id) references users(user_id)
)Engine=InnoDB;
Using a composite key on a innodb table will cluster the data around the items, making it much faster to find the votes related to an item. I added a column vote_dtm which would hold the timestamp for when the user voted.
Then I would create one or several views, used for reporting purposes.
create view votes_monthly as
select item_id
,year(vote_dtm) as year
,month(vote_dtm) as month
,sum(vote) as score
,count(*) as num_votes
from votes
group
by item_id
,year(vote_dtm)
,month(vote_dtm);
If you start having performance issues, you can replace the view with a table containing pre-computed values without even touching the reporting code.
Note that I used both count(*) and sum(vote). The count(*) would return the number of cast votes, whereas the sum would return the number of up-votes. Howver, if you changed the vote column to use +1 for upvotes and -1 for downvotes, a sum(vote) would return a score much like the votes on stackoverflow are calculated.

Related

How to create an aggregate table (data mart) that will improve chart performance?

I created a table named user_preferences where user preferences have been grouped by user_id and month.
Table:
Each month I collect all user_ids and assign all preferences:
city
district
number of rooms
the maximum price they can spend
The plan assumes displaying a graph showing users' shopping intentions like this:
The blue line is the number of interested users for the selected values in the filters.
The graph should enable filtering by parameters marked in red.
What you see above is a simplified form for clarifying the subject. In fact, there are many more users. Every month, the table increases by several hundred thousand records. The SQL query retrieving data (feeding) for chart lasts up to 50 seconds. It's far too much - I can't afford it.
So, I need to create a table (table/aggregation/data mart) where I will be able to insert the previously calculated numer of interested users for all combinations. Thanks to this, the end user will not have to wait for the data to count.
Details below:
Now the question is - how to create such a table in PostgreSQL?
I know how to write a SQL query that will calculate a specific example.
SELECT
month,
count(DISTINCT user_id) interested_users
FROM
user_preferences
WHERE
month BETWEEN '2020-01' AND '2020-03'
AND city = 'Madrid'
AND district = 'Latina'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
GROUP BY
1
The question is - how to calculate all possible combinations? Can I write multiple nested loop in SQL?
The topic is extremely important to me, I think it will also be useful to others for the future.
I will be extremely grateful for any tips.
Well, base on your query, you have the following filters:
month
city
distirct
rooms
price_max
You can try creating a view with the following structure:
SELECT month
,city
,distirct
,rooms
,price_max
,count(DISTINCT user_id)
FROM user_preferences
GROUP BY month
,city
,distirct
,rooms
,price_max
You can make this view materialized. So, the query behind the view will not be executed when queried. It will behave like table.
When you are adding new records to the base table you will need to refresh the view (unfortunately, posgresql does not support auto-refresh like others):
REFRESH MATERIALIZED VIEW my_view;
or you can scheduled a task.
If you are using only exact search for each field, this will work. But in your example, you have criteria like:
month BETWEEN '2020-01' AND '2020-03'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
In such cases, I usually write the same query but SUM the data from the materialized view. In your case, you are using DISTINCT and this may lead to counting a user multiple times.
If this is a issue, you need to precalculate too many combinations and I doubt this is the answer. Alternatively, you can try to normalize your data - this will improve the performance of the aggregations.

How do you store quantities of items in an SQL database? (Postgresql)

I'm trying to think of the most straightforward way to store quantities of an item in a database. I'm creating a database to work in conjunction with a web app I'm developing to monitor and log gear lent out to people. So, I've thought a few different ways already, though I'm not sure if they will be easy to maintain into the future.
Idea 1
I have a gear table that stores the types of gear (e.g. shirt, pants, hat) along with data like sizes etc. Then for each time gear is taken out, it is logged in the gear_inventory table, storing details such as user id, gear type and boolean to signify if it was returned and return date. Then to track quantities, we'll have total_quantiy and count of gear out for a specific gear item in the gear table, this being updated manual with a second query triggered when an item is taken out or returned to plus or minus the given quantity.
Idea 2
Have the aforementioned total quantity out linked to the gear_inventory as a count of all non-returned items of that type.
Idea 3
Have an update task to change these quantities on table update or insert. Then do the same by adding or subtracting the quantity of a given query.
Idea 1 would be the easiest but not as reliable as the others. Idea 2 being the most reliable the handling of quantities is entering on the database to ensure. Then Idea 3 not as reliable since it's still relying on a scheduled task to update it.
So, how would you implement a quantity amount to ensure it doesn't get out of sync with the logged inventory records?
Edit 1
More info - The core solution I am trying to achive is having a method of storing or having a count of items taken out which can be compared to a total nunber associates with that item. As suggeated below, it will act in a similar fashsion to a bowling alley's loaning/borrowing shoes, exect users will be logged. So a record will be inserted when an item is borrrowed, and that record will be updated with return date on return. The types of items will be in its own table to store details on a general item, and that will have a one to many relationship with the logged gear table. The problem is, what is a full proof/reliable way to store/retrieve number of items out. Or am I overthinking this and a simple count query would sufice, not sure how intensive count is when performed oved and over again.
Let's think about what we're keeping track of.
Information about kinds of items.
Current inventory.
Which item?
How many?
What's been loaned out.
To whom?
Which items?
How many?
What's been returned.
By whom?
Which items?
How many?
First cut might look something like this:
create table items (
id serial,
name text,
...
);
create table inventory (
id serial,
item integer references items(id),
quantity integer check(quantity >= 0)
);
create table loans (
id serial,
user integer references users(id),
item integer references items(id),
quantity integer check(quantity >= 0),
when_loaned timestamp not null default now(),
when_returned timestamp
);
When you loan something out, insert a row into loans. When its returned, set loans.when_returned. If loans.when_returned is null, it's still out.
You could decrement the inventory when items are loaned, and increment it when they're returned. To preserve data integrity this should be done as a trigger, not as a scheduled process.
Alternatively, don't change the inventory quantities. Instead, subtract the number of loaned items from the amount in inventory. That's select sum(quantity) from loans where item = ? and when_returned is null from select quantity from inventory where item = ?. This makes loans and returns simpler, and avoids the possibility of the inventory count being corrupted, but it might cost performance problems if there's many, many outstanding loans.
What happens if you loan out 5 items and they return 3? How do you track that? One simple option is to split a single loan into two loans.
-- Copy the loan row
insert into loans
select * from loans where id = :orig_id
-- Track that 3 were returned
update loans
set quantity = 3, returned = now()
where id = :orig_id
-- Two are now outstanding
update loans
set quantity = 2
where id = :new_id
Without knowing more about what you're using this for, what the use cases are, and what the scale is, that's about all I can say. It's a good starting point.

How to maintain a list of N most recent viewed items per user in a relational database

I would like to keep track of the last n items that a user has viewed in a PostgreSQL database. My first thought is to create a table such as
CREATE TABLE history (
id SERIAL PRIMARY KEY,
user_id integer REFERENCES users (id),
item_id integer REFERENCES items (id),
view_date timestamp DEFAULT current_timestamp
);
When a user views an object, a new row in the history table will record this view. But I only need to maintain the last n views for each user, and this approach will store every view that ever occurs.
Is there an efficient way to periodically drop all users' entries that are in excess of their n most recent?
EDIT: If there's a better way to store this data than using a SQL table, I'd be interested to hear about that.
delete from history
where id in (
select id
from (
select
id,
row_number() over(
partition by user_id
order by view_date desc
) as rn
from history
) s
where rn > n
)
Is there an efficient way to periodically drop all users' entries that are in excess of their n most recent?
Set up a job that groups, orders and drops every ten minutes or so. You aren't going to find a lot of room for improvement in that sort of query.
From a design perspective though I would favor creating an in memory data structure which you load/save at the start/end of the user's session. That way you don't beat up your database with this sort of work. But your requirements may make this strategy impossible.
Cheers!
If there's a better way to store this data than using a SQL table, I'd
be interested to hear about that.
Database is for persistence of values / object states in a fairly long period of time. If you need frequent access / update of the most recent items, use a cache.
You can listen to the cache notification, when the list expires or is evicted, capture , serialize and save it to database.
http://msdn.microsoft.com/en-us/library/ee808091(v=azure.10).aspx

SQL - Complex query using foreign keys

So, I am totally new to SQL, but the book I have from the courses I take is useless and I am trying to do a project for said course. Internet did not help me all that much (I do not know where to start exactly), so I want to ask for both links to good tutorials to check out as well as help with a very specific piece of query.
If anything I say is not clear enough, please ask me to explain! :)
Suppose two tables sale and p_sale in a database called jewel_store.
sale contains two columns: sale_CODE and sale_date
p_sale contains sale_CODE which references the above sale_ID, p_ID, p_sl_quantity and
p_sl_value
sale_CODE is the primary key of sale and sale_CODE,p_ID is the primary key of p_sale
For the time being p_ID is not of much use so just ignore it for the most part.
p_sl_quantityis int and p_sl_value is double(8,2). The first one is the quantity of the product bought and the second one is the value PER UNIT of the product.
As it probably is obvious a sale_CODE can be linked to a multitude of entries in the p_sale table (example for sale_CODE 1, I have 2 entries on p_sale).
All this is based on what I was given from the task and is correctly implemented and has some example values in.
What I now have to do is find the total income from sales in a specific month. My initial approach was to start structuring everything step by step so I have come to a point that looks like the follows:
SELECT
SUM(p_sl_value * p_sl_quantity) AS sales_monthly_income,
p_sale.sale_CODE
FROM jewel_store.p_sale
GROUP BY p_sale.sale_CODE
This is probably half way through as I can get the total money a sale generated for the store. So my next step was to use this query and SELECT from it. I messed it up a couple of times already and I am scratching my head now. What I did was like this:
SELECT
SUM(sales_monthly_income),
sales_monthly_income,
EXTRACT(MONTH FROM jewel_store.sale.sale_date) AS sales_month
FROM (
SELECT
SUM(p_sl_value * p_sl_quantity) AS sales_monthly_income,
sale_CODE
FROM jewel_store.p_sale
GROUP BY sale_CODE
) as code_income, jewel_store.sale
GROUP BY sales_month
First off, I only need to print the total_montly_income and the month columns in my final form, but I used this to clarify that everything went wrong in there. I think I need to somehow use the foreign key that references the other table, but my book is totally useless in helping me out. I would like someone to explain why this is wrong and what the right one would be and please point me to a good pdf, site or anything to learn how to do this kind of stuff. (I have checked W3SCHOOLS, it is good for the basics, but not for too advanced stuff)
Thanks in advance!
From the top of my head this could be it, group by month the sum of value times quantity.
SELECT
SUM(p.p_sl_value * p.p_sl_quantity) AS sales_monthly_income,
month(s.sale_date)
FROM p_sale p
inner join sale s on s.sale_code = p.sale_code
GROUP BY MONTH(s.sale_date)

What is a fast way of joining two tables and using the first table column to "filter" the second table?

I am trying to develop a SQL Server 2005 query but I'm being unsuccessful at the moment. I trying every different approach that I know, like derived tables, sub-queries, CTE's, etc, but I couldn't solve the problem. I won't post the queries I tried here because they involve many other columns and tables, but I will try to explain the problem with a simpler example:
There are two tables: PARTS_SOLD and PARTS_PURCHASED. The first contains products that were sold to customers, and the second contains products that were purchased from suppliers. Both tables contains a foreign key associated with the movement itself, that contains the dates, etc.
Here is the simplified schema:
Table PARTS_SOLD:
part_id
date
other columns
Table PARTS_PURCHASED
part_id
date
other columns
What I need is to join every row in PARTS_SOLD with a unique row from PARTS_PURCHASED, chose by part_id and the maximum "date", where the "date" is equal of before the "date" column from PARTS_PURCHASED. In other words, I need to collect some information from the last purchase event for the item for every event of selling this item.
The problem itself is that I didn't find a way of joining the PARTS_PURCHASED table with PARTS_SOLD table using the column "date" from PARTS_SOLD to limit the MAX(date) of the PARTS_PURCHASED table.
I could have done this with a cursor to solve the problem with the tools I know, but every table has millions of rows, and perhaps using cursors or sub-queries that evaluate a query for every row would make the process very slow.
You aren't going to like my answer. Your database is designed incorrectly which is why you can't get the data back out the way you want. Even using a cursor, you would not get good data from this. Assume that you purchased 5 of part 1 on May 31, 2010. Assume on June 1, you sold ten of part 1. Matching just on date, you would match all ten to the May 31 purchase even though that is clearly not correct, some parts might have been purchased on May 23 and some may have been purchased on July 19, 2008.
If you want to know which purchased part relates to which sold part, your database design should include the PartPurchasedID as part of the PartsSold record and this should be populated at the time of the purchase, not later for reporting when you have 1,000,000 records to sort through.
Perhaps the following would help:
SELECT S.*
FROM PARTS_SOLD S
INNER JOIN (SELECT PART_ID, MAX(DATE)
FROM PARTS_PURCHASED
GROUP BY PART_ID) D
ON (D.PART_ID = S.PART_ID)
WHERE D.DATE <= S.DATE
Share and enjoy.
I'll toss this out there, but it's likely to contain all kinds of mistakes... both because I'm not sure I understand your question and because my SQL is... weak at best. That being said, my thought would be to try something like:
SELECT * FROM PARTS_SOLD
INNER JOIN (SELECT part_id, max(date) AS max_date
FROM PARTS_PURCHASED
GROUP BY part_id) AS subtable
ON PARTS_SOLD.part_id = subtable.part_id
AND PARTS_SOLD.date < subtable.max_date