I have the interesting problem that I want to enforce a specific limit on how many offers a user can place. The offers are saved in a postgresql (version 10) database and should not exceed 1000.
I am using the following sql query to check how many offers a user has and check it against the limit:
select count(*) from offers where offers.userId = 'b27e1d2f-c2c1-4d0b-8451-287013d7b716';
In the performance metrics I see that most of the time is spent on this query. Therefore I looked it up and found this: https://wiki.postgresql.org/wiki/Slow_Counting
PostgreSQL will still need to read the resulting rows to verify that they exist;
In the query planner it can be seen that additionally to the index only scan some heap fetches are needed which I assume slows down the whole query:
Index Only Scan using offers_by_user_id_index on offers
Index Cond: (account_id = 'b27e1d2f-c2c1-4d0b-8451-287013d7b716'::uuid) | Heap Fetches: 650
- What are ways to speed this up?
Is tracking the row count a good approach to speed up the check?
Thanks for your help!
Edit: UserId is an UUID and an index exists on the column UUID
The number of heap fetches suggests that the table is not vacuumed often enough. If you manually VACUUM it, does that speed things up?
I'd say that the right tool for that is de-normalization:
In table users add column offersCount
Create index on users table userid, offersCount
Add two triggers to the table offers
Insert trigger - will update users table and increase offersCount column
Delete trigger - will update users table and decrease offserCount column
With this there will be almost no latency
note: if you don't want to update user's table, just create new one, with only two columns
First, your ids are presumably numbers, so the comparisons should not be to a string. So:
select count(*)
from offers
where offers.userId = 1;
For this query, I recommend a query on offers(userid). That might be a big help.
This might be a situation where storing the ids in an array is beneficial. Then you can just add:
alter table users add constraint chk_offers check (array_length(offers) <= 1000)
This also changes how to insert and delete values.
For many purposes this will work well. It does not work well if you care keeping lots of other information about the user/offer, such as creation date, offer date, channel, and so on.
Related
i have a huge table (200mln records). about 70% is not need now (there is column ACTIVE in a table and those records have value N ). There are a lot of multi-column indexes but none of them includes that column. Will removing that 70% records improve SELECT (ACTIVE='Y') performance (because oracle has to read table blocks with no active records and then exclude them from final result)? Is shrink space necessary?
It's really impossible to say without knowing more about your queries.
At one extreme, access by primary key would only improve if the height of the supporting index was reduced, which would probably require deletion of the rows and then a rebuild of the index.
At the other extreme, if you're selecting nearly all active records then a full scan of the table with 70% of the rows removed (and the table shrunk) would take only 30% of the pre-deletion time.
There are many other considerations -- selecting a set of data and accessing the table via indexes, and needing to reject 99% of rows after reading the table because it turns out that there's a positive correlation between the required rows and an inactive status.
One way of dealing with this would be through list partitioning the table on the ACTIVE column. That would move inactive records to a partition that could be eliminated from many queries, with no need to index the column, and would keep the time for full scans of active records down.
If you really do not need these inactive records, why do you just not delete them instead of marking them inactive?
Edit: Furthermore, although indexing a column with a 70/30 split is not generally helpful, you could try a couple of other indexing tricks.
For example, if you have an indexed column which is frequently used in queries (client_id?) then you can add the active flag to that index. You could also construct a partial index:
create index my_table_active_clients
on my_table (case when active = 'Y' then client_id end);
... and then query on:
select ...
from ...
where (case when active = 'Y' then client_id end) = :client_id
This would keep the index smaller, and both indexing approaches would probably be helpful.
Another edit: A beneficial side effect of partitioning could be that it keeps the inactive records and active records "physically" apart, and every block read into memory from the "active" partition of course only has active records. This could have the effect of improving your cache efficiency.
Partitioning, putting the active='NO' records in a separate partition, might be a good option.
http://docs.oracle.com/cd/B19306_01/server.102/b14223/parpart.htm
Yes it will most likely. But depending on your access schema the increase will most likely not as big. Setting an index including the column would be a better solution for future IMHO.
Most probably no. Delete will not reduce the size of the table's segment. Additional maintenance might help. After the DELETE execute also:
ALTER TABLE <tablename> SHRINK SPACE COMPACT;
ALTER INDEX <indexname> SHRINK SPACE COMPACT; -- for every table's index
Alternatively you can use old school approach:
ALTER TABLE <tablename> MOVE;
ALTER INDEX <indexnamename> REBUILD;
When delting 70% of table also consider possible solution CTAS (create table as select). It will be much faster.
Indexing plays a vital role in SELECT query. The performance will drastically increase
if you use those indexed columns in the query. Ya deleting rows will enhance the performance
for sure somewhat but not drastically.
I have a table that shows the users currently connected to a system, but I receive 100 new connections every minute, and also, about 100 users leave the site every minute. If I had to query for a particular column, is it convenient to create a secondary index for that given column? (considering that the table content changes every minute).
Would it make any difference if the query was an aggregation (like the count of users at a given hour)?
Thanks!
You should create indexes on columns as required by your queries.
Take into consideration that every index you have on a table will increase load on the server for update/insert queries. You must weigh up the performance benefits of having the index with the decrease in performance for modification queries.
Trial and error can be a good approach.
Friends,
I have already implemented paging in my SP -
with MyData As (
select ROW_NUMBER() over (order by somecolumn desc) AS [Row],
x,y,z,...
)
Select x,y,z,...
From MyData
Where [Row] between ((#currentPage - 1) * #pageSize + 1) and (#currentPage*#pageSize)
The problem here is that data is retried very fast if with clause return smaller number of rows but it takes long time when there are millions of records. Sometimes it times out.
Is there any other alternative?
Thanks for sharing your valuable time.
SQL server optimisation is a very broad subject and it is pretty much impossible to work out the issue with the limited amount of information you have posted. However since you're in a rush for a solution - Firstly I would suggest checking your actual execution plan, post it here, and making sure that the index is actually being used - if this is not the case then consider using the FASTFIRSTROW table hint to force the index to be used - check here and here on how it can improve things and here in how it can make things worse.
Next to consider is SQL parameter sniffing - it's unlikely from what you have said but possible check here for details enter link description here
For large scale performance gains you may need to look at architectural changes at the very least ensure that your transaction logs are on a different disk to your data.The reason you separate the database files from the log files is because database access is random and log access is sequential. Best practice dictates that you don't mix those two I/O types on the same disk
Also if you've got million of rows then you really need to consider splitting the data across multiple disks.
Finally I would strongly consider partioning either the table or the index see here for a start
The reason why your query is slow is that you have sort whole table on every request. To speed it up significantly you need to avoid sorting big chuck of data, at cost of CPU, HDD/Memory or limitations on pagination logic.
As there is not much information about how you table is sorted and if you insert in the middle / delete entries very often, I'll narrow down you question by making these assumptions:
I would imagine you have a table storing an archive of articles. New entries are mostly at the bottom of the table, entries from the middle of the table deleted rarely.
You sort always by the same column somecolumn and in the same order, e.g. descending.
You do not have any user entered filters (like article title or author).
This makes the table static in terms of the output: each article will be on the same place, unless a new one inserted. New one come to the top of your output. Then you can store ROW_NUMBER() OVER () as a column. A more convenient solution will be an IDENTITY column. It will speed up things if you create a clustered index on this column
alter table add [Record_Number] int null IDENTITY
This new column is added as null so you can populate values first time. Then you can make it not null.
On the other hand you can last row number very quickly by
select #Max_Row = SELECT MAX(row_number) from MyTable
Now when you have total number of rows, page size and page number you can select rows you need in one statement without sorting the whole lot.
Select * From MyTable
Where row_number between
(#Max_Row - #Page * #Page_Size) + 1 AND
#Max_Row -(#Page - 1) * #Page_Size
If you do have a filter in your CTE, then give some more information about how your data is structured, so we can think of a way to limit the scope of CTE.
I have a device I'm polling for lots of different fields, every x milliseconds
the device returns a list of ids and values which I need to store with a time stamp in a DB of sorts.
Users of the system need to be able to query this DB for historic logs to create graphs, or query the last timestamp for each value.
A simple approach would be to define a MySQL table with
id,value_id,timestamp,value
and let users select
Select value form t where value_id=x order by timestamp desc limit 1
and just push everything there with index on timestamp and id, But my question is what's the best approach performance / size wise for designing the schema? or using nosql? can anyone comment on possible design trade offs. Will such a design scale with millions of records?
When you say "... or query the last timestamp for each value" is this what you had in mind?
select max(timestamp) from T where value = ?
If you have millions of records, and the above is what you meant (i.e. value is alone in the WHERE clause), then you'd need an index on the value column, otherwise you'd have to do a full table scan. But if queries will ALWAYS have [timestamp] column in the WHERE clause, you do not need an index on [value] column if there's an index on timestamp.
You need an index on the timestamp column if your users will issue queries where the timestamp column appears alone in the WHERE clause:
select * from T where timestamp > x and timestamp < y
You could index all three columns, but you want to make sure the writes do not slow down because of the indexing overhead.
The rule of thumb when you have a very large database is that every query should be able to make use of an index, so you can avoid a full table scan.
EDIT:
Adding some additional remarks after your clarification.
I am wondering how you will know the id? Is [id] perhaps a product code?
A single simple index on id might not scale very well if there are not many different product codes, i.e. if it's a low-cardinality index. The rebalancing of the trees could slow down the batch inserts that are happening every x milliseconds. A composite index on (id,timestamp) would be better than a simple index.
If you rarely need to sort multiple products but are most often selecting based on a single product-code, then a non-traditional DBMS that uses a hashed-key sparse-table rather than a b-tree might be a very viable even a superior alternative for you. In such a database, all of the records for a given key would be found physically on the same set of contiguous "pages"; the hashing algorithm looks at the key and returns the page number where the record will be found. There is no need to rebalance an index as there isn't an index, and so you completely avoid the related scaling worries.
However, while hashed-file databases excel at low-overhead nearly instant retrieval based on a key value, they tend to be poor performers at sorting large groups of records on an attribute, because the data are not stored physically in any meaningful order, and gathering the records can involve much thrashing. In your case, timestamp would be that attribute. If I were in your shoes, I would base my decision on the cardinality of the id: in a dataset of a million records, how many DISTINCT ids would be found?
YET ANOTHER EDIT SINCE THE SITE IS NOT LETTING ME ADD ANOTHER ANSWER:
Simplest way is to have two tables, one with the ongoing history, which is always having new values inserted, and the other, containing only 250 records, one per part, where the latest value overwrites/replaces the previous one.
Update latest
set value = x
where id = ?
You have a choice of
indexes (composite; covering value_id, timestamp and value, or some combination of them): you should test performance with different indexes; composite and non-composite, also be aware that there are quite a few significantly different ways to get 'max per group' (search so, especially mysql version with variables)
triggers - you might use triggers to maintain max row values in another table (best performance of further selects; this is redundant and could be kept in memory)
lazy statistics/triggers, since your database is updated quite often you can save cycles if you update your statistics periodically (if you can allow the stats to be y seconds old and if you poll 1000 / x times a second, then you potentially save y * 100 / x potential updates; and this can be noticeable, especially in terms of scalability)
The above is true if you are looking for last bit of performance, if not keep it simple.
asp.net and sql server, have sqls for selecting a subset of rows, I need the count* frequently
Of course I can have a select count(*) for each of these sqls in each roundtrip but soon it will become too slow.
-How do you make it really fast?
Are you experiencing a problem that can't be solved by adding another index to your table? COUNT(*) operations are usually O(log n) in terms of total rows, and O(n) in terms of returned rows.
Edit: What I mean is (in case I misunderstood your question)
Given this structure:
CREATE TABLE emails (
id INT,
.... OTHER FIELDS
)
CREATE TABLE filters (
filter_id int,
filter_expression nvarchar(max) -- Or whatever...
)
Create the table
CREATE TABLE email_filter_matches (
filter int,
email int,
CONSTRAINT pk_email_filter_matches PRIMARY KEY(filter, email)
)
The data in this table would have to be updated every time a filter is updated, or when a new email is received.
Then, a query like
SELECT COUNT(*) FROM email_filter_matches WHERE filter = #filter_id
should be O(log n) with regard to total number of filter matches, and O(n) in regard to number of matches for this particular filter. Since your example shows only a small number of matches (which seems realistic when it comes to email filters), this could very well be OK.
If you really want to, of course you could create a trigger on the email_filter_matches table to keep a cached value in the filters table in sync, but that can be done the day you hit performance issues. It's not trivial to get these kinds of things right in concurrent systems.
Here are a few ideas for speeding up count(*) at the data tier:
Keep the table and the clustered index as narrow as possible, so that more rows fit per page
Keep the filtering criteria as simple as possible, so the counting goes fast
Do what you can to make sure the rows to be counted are in memory before you start to count them (perhaps using pre-caching)
Make sure your hardware is optimized (enough RAM, fast enough disks, etc)
Consider caching results in separate tables
As an alternative, if only the filters change frequently and not the data itself, you might consider building a cube using Analysis Services, and run your queries against that.