We're developing an application with one function of managing payments to people. A payment will be written to a row in a table, with the following fields:
PersonId (INT)
TransactionDate (DATETIME)
Amount (MONEY)
PaymentTypeId (INT)
...
...
...
It looks like we deal with around 8000 people who we send payments to, and a new transaction per person is added daily (Around 8,000 inserts per day). This means that after 7 years (The time we need to store the data for), we will have over 20,000,000 rows.
We get around 10% more people per year, so this number rises a bit.
The most common query would be to get a SUM(Amount), per person, where Transaction Date between a start date and an end date.
SELECT PersonId, SUM(Amount)
FROM Table
WHERE PaymentTypeId = x
AND TransactionDate BETWEEN StartDate AND EndDate
GROUP BY PersonId
My question is, is this going to be a performance problem for SQL Server 2012? Or is 20,000,000 rows not too bad?
I'd have assumed a clustered index on PersonID? (To group them), but this would cause very slow insert/updates?
An index on the TransactionDate?
If your query selects based on TransactionDate and PaymentTypeId and also needs PersonId and Amount at the same, I would recommend putting a nonclustered index on TransactionDate and PaymentTypeId and including those other two columns in the index:
CREATE NONCLUSTERED INDEX IX_Table_TransactionDate
ON dbo.Table (TransactionDate, PaymentTypeId)
INCLUDE (PersonId, Amount)
That way, your query can be satisfied from just this index - no need to go back to the actual complete data pages.
Also: if you have years that can be "finalized" (no more changes), you could possibly pre-compute and store certain of those summations, e.g. for each day, for each month etc. With this approach, certain queries might just pull pre-computed sums from a table, rather than having to again compute the sum over thousands of rows.
Related
I created a table named user_preferences where user preferences have been grouped by user_id and month.
Table:
Each month I collect all user_ids and assign all preferences:
city
district
number of rooms
the maximum price they can spend
The plan assumes displaying a graph showing users' shopping intentions like this:
The blue line is the number of interested users for the selected values in the filters.
The graph should enable filtering by parameters marked in red.
What you see above is a simplified form for clarifying the subject. In fact, there are many more users. Every month, the table increases by several hundred thousand records. The SQL query retrieving data (feeding) for chart lasts up to 50 seconds. It's far too much - I can't afford it.
So, I need to create a table (table/aggregation/data mart) where I will be able to insert the previously calculated numer of interested users for all combinations. Thanks to this, the end user will not have to wait for the data to count.
Details below:
Now the question is - how to create such a table in PostgreSQL?
I know how to write a SQL query that will calculate a specific example.
SELECT
month,
count(DISTINCT user_id) interested_users
FROM
user_preferences
WHERE
month BETWEEN '2020-01' AND '2020-03'
AND city = 'Madrid'
AND district = 'Latina'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
GROUP BY
1
The question is - how to calculate all possible combinations? Can I write multiple nested loop in SQL?
The topic is extremely important to me, I think it will also be useful to others for the future.
I will be extremely grateful for any tips.
Well, base on your query, you have the following filters:
month
city
distirct
rooms
price_max
You can try creating a view with the following structure:
SELECT month
,city
,distirct
,rooms
,price_max
,count(DISTINCT user_id)
FROM user_preferences
GROUP BY month
,city
,distirct
,rooms
,price_max
You can make this view materialized. So, the query behind the view will not be executed when queried. It will behave like table.
When you are adding new records to the base table you will need to refresh the view (unfortunately, posgresql does not support auto-refresh like others):
REFRESH MATERIALIZED VIEW my_view;
or you can scheduled a task.
If you are using only exact search for each field, this will work. But in your example, you have criteria like:
month BETWEEN '2020-01' AND '2020-03'
AND rooms IN (1,2)
AND price_max BETWEEN 400001 AND 500000
In such cases, I usually write the same query but SUM the data from the materialized view. In your case, you are using DISTINCT and this may lead to counting a user multiple times.
If this is a issue, you need to precalculate too many combinations and I doubt this is the answer. Alternatively, you can try to normalize your data - this will improve the performance of the aggregations.
I have a table which has an id and a date. (id, date) make up the composite key for the table.
What I am trying to do is delete all entries older than a specific date.
delete from my_table where date < '2018-12-12'
The query plan explains that it will do a sequential scan for the date column.
I somehow want to make use of the index present since the number of distinct ids are very very small compared to total rows in the table.
How do I do it ? I have tried searching for it but to no avail
In case your use-case involves data-archival on monthly basis or some time period, you can think of updating your DataBase table to use partitions.
Let's say you collect data on monthly basis and want to keep data for the last 5 months. It would be really efficient to create partition over the table based on month of the year.
This will,
optimise your READ queries (table scans will reduce to partition scans)
optimise your DELETE requests (just delete the complete partition)
You need an index on date for this query:
create index idx_mytable_date on mytable(date);
Alternatively, you can drop your existing index and add a new one with (date, id). date needs to be the first key for this query.
I have a task to design a SQL database that will record values for the company's commodity products (sku numbers). The company has 7600 unique product items that need to be tracked, and each product will have approximately 200 values over the period of a year (one value per product per day, over the period of a year).
My first guess is that the sku numbers go top to bottom (each sku has a row) and each date is a column.
The data will be used to view in chart / graph format and additional calculations will be displayed against those (such as percentage profit margin etc)
My question is:
- is this layout advisable?
- do I have to be cautious of anything, if this type of data goes back about 15 yrs (each table will represent a year)
Any suggestions?
It better to have 3 columns only - instead of many as you are suggesting:
sku date value
-------------------------
1 2011-01-01 10
1 2011-01-02 12
2 2011-01-01 5
This way you can easily add another column if you want to record something else about a given product per date.
I would suggest a table for your products, and a table for the historical values. Maybe create an index for the historical values based on date if you plan to select for specific time periods.
create table products (
id number primary key,
sku number,
name text,
desc text);
create table values (
id number primary key,
product_id number,
timestamp date,
value number,
foreign key fk_prod_price product_id on product.id);
create index idx_price on values.timestamp;
NOTE: not actual sql, you will have to write your own
If you do like #fiver wrote, you don't have to have a table for each year either. Everything in one table. And add indexes on sku/date for faster searching
I'd appreciate any help you can offer - I'm currently trying to decide on a schema for a voting app I'm building with PHP / MySQL, but I'm completely stuck on how to optimise it. The key elements are to allow only one vote per user per item, and be able to build a chart detailing the top items of the month – based on votes received that month.
So far the initial schema is:
Items_table
item_id
total_points
(lots of other fields unrelated to voting)
Voting_table
voting_id
item_id
user_id
vote (1 = up; 0 = down)
month_cast
year_cast
So I'm wondering if it's going to be a case of selecting all information from voting table where month = currentMonth & year = currentYear, somehow running a count and grouping by item_id; if so, how would I go about doing so? Or would I be better off creating a separate table for monthly charts which is updated with each vote, but then should I be concerned with the requirement to update 3 database tables per vote?
I'm not particularly competent – if it shows – so would really love any help / guidance someone could provide.
Thanks,
_just_me
I wouldn't add separate tables for monthly charts; to prevent users from casting more than one vote per item, you could use a unique key on voting_table(item_id, user_id).
As for the summary, you should be able to use a simple query like
select item_id, vote, count(*), month, year
from voting_table
group by item_id, vote, month, year
I would use a voting table similar to this:
create table votes(
item_id
,user_id
,vote_dtm
,vote
,primary key(item_id, user_id)
,foreign key(item_id) references item(item_id)
,foreign key(user_id) references users(user_id)
)Engine=InnoDB;
Using a composite key on a innodb table will cluster the data around the items, making it much faster to find the votes related to an item. I added a column vote_dtm which would hold the timestamp for when the user voted.
Then I would create one or several views, used for reporting purposes.
create view votes_monthly as
select item_id
,year(vote_dtm) as year
,month(vote_dtm) as month
,sum(vote) as score
,count(*) as num_votes
from votes
group
by item_id
,year(vote_dtm)
,month(vote_dtm);
If you start having performance issues, you can replace the view with a table containing pre-computed values without even touching the reporting code.
Note that I used both count(*) and sum(vote). The count(*) would return the number of cast votes, whereas the sum would return the number of up-votes. Howver, if you changed the vote column to use +1 for upvotes and -1 for downvotes, a sum(vote) would return a score much like the votes on stackoverflow are calculated.
I have a table as below
dbo.UserLogs
-------------------------------------
Id | UserId |Date | Name| P1 | Dirty
-------------------------------------
There can be several records per userId[even in millions]
I have clustered index on Date column and query this table very frequently in time ranges.
The column 'Dirty' is non-nullable and can take either 0 or 1 only so I have no indexes on 'Dirty'
I have several millions of records in this table and in one particular case in my application i need to query this table to get all UserId that have at least one record that is marked dirty.
I tried this query - select distinct(UserId) from UserLogs where Dirty=1
I have 10 million records in total and this takes like 10min to run and i want this to run much faster than this.
[i am able to query this table on date column in less than a minute.]
Any comments/suggestion are welcome.
my env
64bit,sybase15.0.3,Linux
my suggestion would be to reduce the amount of data that needs to be queried by "archiving" log entries to an archive table in suitable intervals.
You can still access all entries if you provide a union-view over current and archived log data, but accessing current logs would be much reduced.
Add an index containing both the UserId and Dirty fields. Put UserId before Dirty in the index as it has more unique values.