Does column of integers contains value - sql

What is the fastest regarding performance way to check that integer column contains specific value?
I have a table with 10 million rows in postgresql 8.4. I need to do at least 10000 checks per sec.
Currently i am doing query SELECT id FROM table WHERE id = my_value and then checking does DataReader have rows. But it is quite slow. Is there any way to speed up without loading whole column into memory?

You can select COUNT instead:
SELECT COUNT(*) FROM table WHERE id = my_value
It will return just one integer value - number of rows matching your select condition.

You need two things,
As Marcin pointed out, you want to use the COUNT(*) if all you need is to know how many. You also need an index on that column. The index will have the answer pretty much right at hand. Without the index, Postgresql would still have to go through the entire table to count that one number.
CREATE INDEX id_idx ON table (id) ASC NULLS LAST;
Something of the sort should get you there. Whether it is enough to run the query 10,000/sec. will depend on your hardware...

If you use where id = X then all values matching X will be returned. Suppose 1000 values match X then 1000 values will be returned.
Now, if you only want to check if the value is at least once then after you matched the first value there is no need to process the other 999. Even if you count the values you are still going through all of them.
What I would do in this case is this:
SELECT 1 FROM table
WHERE id = my_value
LIMIT 1
Note that I'm not even returning the id itself. So if you get one record then the value is there.
Of course, in order to improve this query, make sure you have an index on the id column.

Related

How to get tables for distinct value of a variable that is multiple times in the data

I didn't work with SQL much. I have a dataset with Variable A, Index, Fail and Country. Here A is unique, but we don't need that for the analysis. What I want is to find which Countries have most fail for distinct index number. So what I tried is
SELECT Index, Count(Fail), Country
FROM Data
GROUP BY Country
SORT BY Count(Fail) DESC
But the fact is for an index we might have multiple fails, but I want count only one fail for a single index number, so for instance the Index 1 has 2 fails, 2 has 1 fail, 3 has 4 fail, I want only The count(fail) to be 3, not (2+1+4=7). FYI in the table each row represent either one fail or not. So in the table the fail values are either 0 or 1. I think, I need to include sum/distinct clause, but not sure how to do it.
You can put DISTINCTin the COUNT() as follows:
SELECT Index, Count(DISTINCT Fail), Country
FROM Data
GROUP BY Index,Country
SORT BY Count(Fail) DESC
I've added Index in the GROUP BY to avoid it causing an error.

Two questions on PostgreSQL performance

1) What is the best way to implement paging in PostgreSQL?
Assume we need to implement paging. The simplest query is select * from MY_TABLE order by date_field DESC limit 10 offset 20. As far as I understand, we have 2 problems here: in case the dates may have duplicated values every run of this query may return different results and the more offset value is the longer the query runs. We have to provide additional column which is date_field_index:
--date_field--date_field_index--
12-01-2012 1
12-01-2012 2
14-01-2012 1
16-01-2012 1
--------------------------------
Now we can write something like
create index MY_INDEX on MY_TABLE (date_field, date_field_index);
select * from MY_TABLE where date_field=<last_page_date and not (date_field_index>=last_page_date_index and date_field=last+page_date) order by date_field DESC, date_field_index DESC limit 20;
..thus using the where clause and corresponding index instead of offset. OK, now the questions:
1) is this the best way to improve the initial query?
2) how can we populate that date_field_index field? we have to provide some trigger for this?
3) We should not use RowNumber() functions in Postgres because they are not using indexes and thus very slow. Is it correct?
2) Why column order in concatenated index is not affecting performance of the query?
My measurements show, that while searching using concatenated index (index consisting of 2 and more columns) there is no difference if we place the most selective column to the first place - or if we place it to the end. Why? If we place the most selective column to the first place - we run through a shorter range of the found rows which should have impact on performance. Am I right?
Use the primary key to untie in instead of the date_field_index column. Otherwise explain why that is not an option.
order by date_field DESC, "primary_key_column(s)" DESC
The combined index with the most unique column first is the best performer, but it will not be used if:
the distinct values are more than a few percent of the table
there aren't enough rows to make it worth
the range of dates is not small enough
What is the output of explain my_query?

What is the most efficient way to count rows in a table in SQLite?

I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)
The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.
To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.
If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.
The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.
I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.
sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.

How to check if all fields are unique in Oracle?

How to check if all fields are unique in Oracle?
SELECT myColumn, COUNT(*)
FROM myTable
GROUP BY myColumn
HAVING COUNT(*) > 1
This will return to you all myColumn values along with the number of their occurence if their number of occurences is higher than one (i.e. they are not unique).
If the result of this query is empty, then you have unique values in this column.
An easy way to do this is to analyze the table using DBMS_STATS. After you do, you can look at dba_tables... Look at the num_rows column. The look at dab_tab_columns. Compare the num_distinct for each column to the number of rows. This is a round about way of doing what you want without performing a full table scan if you are worried about affecting a production system on a huge table. If you want direct results, do what the the others suggest by running the query against the table with a group by.
one way is to create a unique index.
if the index creation fails, you have existing duplicate info, if an insert fails, it would have produced a duplicate...

Explanation of sqlite_stat1 table

I'm trying to diagnose why a particular query is slow against SQLite. There seems to be plenty of information on how the query optimizer works, but scant information on how to actually diagnose issues.
In particular, when I analyze the database I get the expected sqlite_stat1 table, but I don't know what the stat column is telling me. An example row is:
MyTable,ix_id,25112 1 1 1 1
What does the "25112 1 1 1 1" actually mean?
As a wider question, does anyone have any good resources on the best tools and techniques for diagnosing SQLite query performance?
Thanks
from analyze.c:
/* Store the results.
**
** The result is a single row of the sqlite_stmt1 table. The first
** two columns are the names of the table and index. The third column
** is a string composed of a list of integer statistics about the
** index. The first integer in the list is the total number of entires
** in the index. There is one additional integer in the list for each
** column of the table. This additional integer is a guess of how many
** rows of the table the index will select. If D is the count of distinct
** values and K is the total number of rows, then the integer is computed
** as:
**
** I = (K+D-1)/D
**
** If K==0 then no entry is made into the sqlite_stat1 table.
** If K>0 then it is always the case the D>0 so division by zero
** is never possible.
Remember that an index can be comprised of more than one column of a table. So, in the case of "25112 1 1 1 1", this would be described as a composite index that is made up of 4 columns of a table. The numbers mean as follows:
25112 is an estimate of the total number of rows in the index
The second integer (the first "1") is an estimate of the number of rows that have the same value in the 1st column on the index.
The third integer (the second "1") is an estimate of the number of rows that have the same value for the first TWO columns of the index. It is NOT the "distinctness" of column 2.
The forth integer (the third "1") is an estimate of the number of rows that have the same values for the first THREE columns on the index.
Same logic for the last integer..
The last integer should always be one. Consider a table that has two rows and two columns with a composite index made up of column1+column2. The data is the table is:
Apple,Red
Apple,Green
The stats would look like "2 2 1". Meaning, there are 2 rows in the index. There are two rows that would be returned if only using column1 of the index (Apple and Apple). And 1 unique row that would be returned using column1+column2 (Apple+Red is unique from Apple+Green)
Also, I = (K+D-1)/D means : K is supposed total number of rows, and D is distinct values for each column,
so if you table created with CREATE TABLE TEST (C1 INT, C2 TEXT, C3 INT, C4 INT);
and you create index like CREATE INDEX IDX on TEST(C1, C2)
Then you can manually INSERT or let sqlite automatically update the sqlite_stat1 table as:
"TEST"--> TABLE NAME, "IDX"--> INDEX NAME, "10000 1 1000", HERE, 10000 is your total number of rows in TABLE TEST, 1 means, for column C1, all the values seemed to be distinct, this sounds like C1 is something like IDs or whatever, 1000 means C2 has less distinct value, as you know, the higher the value is, the less distinct values the index refers to the specific column.
You can run ANALYZE or manually update the table. (Better do the first).
So what does the value uses for? sqlite will use these statistics, to find the best index they want to use, you can consider CREATE INDEX IDX2 ON TEST(C2)" AND the value in stat1 table is "10000 1, and CREATE INDEX IDX1 ON TEST(C1)" with value "10000 100";
Suppose we don't have index IDX which we defined before, when you issue
SELECT * FORM TEST WHERE C1=? AND C2=?, sqlite will choose IDX2, but not IDX1, why? It's simple, since IDX2 can minimize the query results but IDX1 not.
Clear?
Simply run explain QUERY PLAN + YOUR SQL STATEMENT, You shall find whether the tables referred on the statement uses the index you want, if not, try to rewrite the sql, if yes, figure out whether the correct index you want to use. More info plz refer to www.sqlite.org