Efficient db design for retrieving 'most popular' table rows

Efficient db design for retrieving 'most popular' table rows - sql

I am planning to create a mysql 5 (myISAM) table that will contain x thousand rows of data. Each row will have a count field, & the rows with the 20 highest count values will be retrieved quite a lot, probably on a one-for-one ratio to every row update. The x thousand rows not in this 20 will not typically be retrieved.
What are the choices wrt choosing an efficient method of querying for the 20 top count values. Data does not have to be realtime live but it helps!

Add an index to the count field, then just a normal query:
SELECT * FROM myTable ORDER BY countField DESC LIMIT 20;

Create an index on the count field. Done.

Every five minutes, empty out and then copy the top 20 to a separate table. To get the top 20, just consult that table. Done!
(Or, you could spend the next few hours tweaking your query and parameters and indices and making it all "just right".)

Related

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 or will it get only specific records i.e 5

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 and then it will it get only specific records i.e 5? or do it only get 5 records from internal memory. suppose we have 1000 records in tbl2.

For future reference, SQL actually has a huge amount of amazing documentation/resources online. For instance: This google search.
As to your question, it will pull the top five results matching your criteria, so it depends. It'll go through, and find the first five results matching your criteria. If those are the last results it goes through, it'll still have to do the comparisons and filtering on all rows, the only difference will be that it will have to send less rows to your computer.
For example, let's say we have a table my_table with two columns: Customer_ID (which is unique) and First_Purchase_Date (which is a date value between 2015-01-01 and 2017-07-26). If we simply do SELECT TOP 5 * FROM my_table then it will go through and pull the first five rows it finds, without looking at the rest of the rows. On the other hand, if we do SELECT TOP 5 * FROM my_table WHERE First_Purchase_Date = '2017-05-17' then it will have to go through all the rows until it can find five rows with a First_Purchase_Date of 2017-05-17. If First_Purchase_Date is indexed, this should not be very expensive, as it'll more or less know where to look. If it's not, then it depends on your how SQL has decided to structure your table, and if it has created any useful statistics. Worst case, it could not have any statistics and the desired rows could be the last five in the database, in which case it will have to complete the comparison on all the rows in the database.
By the way, this is a somewhat poor idea, as the columns returned will not necessarily stay consistent over time. It may be a good idea to throw in an ORDER BY clause, to ensure you get the same records every time.

The SELECT TOP clause is used to specify the number of records to return.
It limits the rows returned in a query result set to a specified number of rows or percentage of rows. When TOP is used in conjunction with the ORDER BY clause, the result set is limited to the first N number of ordered rows; otherwise, it returns the first N number of rows in an undefined order.
SELECT TOP number|percent column_name(s)
FROM table_name
WHERE condition;

Does column of integers contains value

What is the fastest regarding performance way to check that integer column contains specific value?
I have a table with 10 million rows in postgresql 8.4. I need to do at least 10000 checks per sec.
Currently i am doing query SELECT id FROM table WHERE id = my_value and then checking does DataReader have rows. But it is quite slow. Is there any way to speed up without loading whole column into memory?

You can select COUNT instead:
SELECT COUNT(*) FROM table WHERE id = my_value
It will return just one integer value - number of rows matching your select condition.

You need two things,
As Marcin pointed out, you want to use the COUNT(*) if all you need is to know how many. You also need an index on that column. The index will have the answer pretty much right at hand. Without the index, Postgresql would still have to go through the entire table to count that one number.
CREATE INDEX id_idx ON table (id) ASC NULLS LAST;
Something of the sort should get you there. Whether it is enough to run the query 10,000/sec. will depend on your hardware...

If you use where id = X then all values matching X will be returned. Suppose 1000 values match X then 1000 values will be returned.
Now, if you only want to check if the value is at least once then after you matched the first value there is no need to process the other 999. Even if you count the values you are still going through all of them.
What I would do in this case is this:
SELECT 1 FROM table
WHERE id = my_value
LIMIT 1
Note that I'm not even returning the id itself. So if you get one record then the value is there.
Of course, in order to improve this query, make sure you have an index on the id column.

Is it possible to query a table without order columns by page

I've a big table which contains more than 100K records, in oracle. I want to get all of the records and save each row to a file with JDBC.
In order to make it faster, I want to create 100 threads to read the data from the table concurrently. I will get the total count of the records in the first sql, then split it to 100 pages, then get one page in a thread with a new connection.
But I've a problem, that there is no any column can be used to order. There is no column with sequence, no accurate timestamp. I can't use a sql query without order by clause to query, since there is no guarantee it will return the data with the same order every time (per this question).
So is it possible to solve it?

Finally, I used rowid to order:
select * from mytable order by rowid
It seems work well.

Index Created but doesn't speed-up on retrieval process

I have created table as bellow
create table T1(num varchar2(20))
then I inserted 3 lac numbers in above table so now it looks like below
num
1
2
3
.
.
300000
Now if I do
select * from T1
then it takes 1min 15sec to completely fetch the records and as I created index on column num and if I use below query then it should be faster to fetch 3 lac records but it takes also 1min15sec for fetch the records
select * from T1 where num between '1' and '300000'
So how the index has improved my retrieval process?

The index does not improve the retrieval process when you are trying to fetch all rows.
The index makes it possible to find a subset of rows much more quickly.

An index can help if you want to retrieve a few rows from a large table. But since you retrieve all rows and since your index contains all the columns of your table, it won't speed up the query.
Furthermore, you don't tell us what tool you use to retrieve the data. I guess you use SQL Developer or Toad. So what you measure is the time it takes SQL Developer or Toad to store 300,000 rows in memory in such a way that they can be easily displayed on screen in a scrollable table. You aren't really measuring how long it takes to retrieve them.

To get a test of the effects of having an index in place you might want to try a query such as
SELECT *
FROM T1
WHERE NUM IN ('288888', '188888', '88888')
both with with the index in place, and again after removing the index. You should also collect statistics on the table prior to running the query with the index in place or you may still get a query which performs a full table scan. Share and enjoy.

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.

Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.

SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient db design for retrieving 'most popular' table rows - sql

Add an index to the count field, then just a normal query: SELECT * FROM myTable ORDER BY countField DESC LIMIT 20;

Create an index on the count field. Done.

Every five minutes, empty out and then copy the top 20 to a separate table. To get the top 20, just consult that table. Done! (Or, you could spend the next few hours tweaking your query and parameters and indices and making it all "just right".)

Related

I am using select top 5 from(select * from tbl2) tbl, will it get all records from tbl2 or will it get only specific records i.e 5

Does column of integers contains value

Is it possible to query a table without order columns by page

Index Created but doesn't speed-up on retrieval process

Selecting 'highest' X rows without sorting

Categories

Resources