To get count from Tables with huge amount of Data - sql

I am trying to get the count of records from a table with around 40 million records. My query is as follows:
Select count(*) from Employee
where code = '000111' and status = 'A' and rank = 'B'
There are around 2-3 million records which satisfy the condition. Status has just 2 values (A and C) and rank too has only two values( A and B)
Indexes have been added for the columns 'code', 'status' and 'rank' and all are VARCHAR.
In spite of this, The above query is taking a lot of time.
Is there a way to retrieve the count in quick time?
Note that I just want the count of records.
Edit: Column Details
EMPLOYEE
CODE. NOT NULL VARCHAR2(6)
STATUS NOT NULL VARCHAR2(1)
RANK NOT NULL VARCHAR2(1)
Indexes:
CODE_IDX - Normal Index (For Code)
STATUS_IDX - BITMAP Index (For Status)
TARIFF_IDX - Normal Index (For Tariff)

If the table is continuously updated and you need to continuously get the count, I would recommend that you create a key/value table (if you don't have one already) that stores the count as an entry in the database rather than getting the count each time. That is to say if you need your query to be faster... that would definitely speed it up. Keep the key the primary key on the table and you won't have to worry about indexing. Update the key/value pair when a new entry is inserted or removed by subtracting or adding 1 to the value. Then just periodically make sure your value is spot on by getting the count in the manner you are already doing in a cron or some other fashion.

have you tried:
Select count(1) from Employee where code = '000111' and status = 'A' and rank = 'B'

Related

Best way to compare two tables in SQL by matching string?

I have a program where the goal is to take data from an API, and capture the differences in data from minute to minute. It involves three tables: Table 1 (for new data), Table 2 (for previous minutes data), Results table (for the results).
The sequence of the program is like this:
Update table 1 -> Calculate the differences from table 2 and update a "Results" table with the differences -> Copy table 1 to table 2.
Then it repeats! It's simple and it works.
Here is my SQL query:
Insert into Results (symbol, bid, ask, description, Vol_Dif, Price_Dif, Time) Select * FROM(
Select symbol, bid, ask, description, Vol_Dif, Price_Dif, '$now' as Time FROM (
Select t1.symbol, t1.bid, t1.ask, t1.description, (t1.volume - t2.volume) AS Vol_Dif, (t1.totalPrice - t2.totalPrice) AS Price_Dif
FROM `Table_1` t1
Inner Join (
Select id, volume, ask, totalPrice FROM Table_2) t2
ON t2.id = t1.id) as test
The tables are identical in structure, obviously. The primary key is the 'id' field that auto-increments. And as you can see, I am comparing both tables on the basis of these 'id' fields being equal.
The PROBLEM is that the API seems to be inconsistent. One API call will have 50,000 entries. The next one will have 51,000 entries. And the entries are not just added to the end or added to the beginning, they are mixed into the middle.
So, comparing on equal ID's means I am comparing entries for DIFFERENT data, IF the API calls return a different number entries.
The data that I am trying to get the differences of is the 'bid', 'ask', 'Vol_Dif', 'Price_Dif' from minute to minute. There are many instances of the same 'symbol's, so I couldn't compare with this. The ONLY other way to compare entries from table to table, beside the matching ID's, would be matching the "description" fields.
I have tried this. The script is almost the same as above except the end of the query is
ON t2.description = t1.description
The problem is that looking for matching description fields takes 3 minutes for 50,000 entries, whereas looking for matching ID's takes 1 second.
Is there a better, faster way to do what I'm trying to do? Thanks in advance. Any help is appreciated.

Selecting records from PostgreSQL after the primary key has cycled?

I have a PostgreSQL 9.5 table that is set to cycle when the primary key ID hits the maximum value. For argument's sake, lets the maximum ID value can be 999,999. I'll add commas to make the numbers easier to read.
We run a job that deletes data from the table that is older than 45 days. Let's assume that the table now only contains records with IDs of 999,998 and 999,999.
The primary key ID cycles back to 1 and 20 more records have been written. I need to keep it generic so I won't make any assumptions about how many were written. In my real world needs, I don't care how many were written.
How can I select the records without getting duplicates with an ID of 999,998 and 999,999?
For example:
SELECT * FROM my_table WHERE ID >0;
Would return (in no particular order):
999,998
999,999
1
2
...
20
My real world case is that I need to publish every record that was written to the table to a message broker. I maintain a separate table that tracks the row ID and timestamp of the last record that was published. The pseudo-query/pseudo-algorithm to determine what new records to write is something like this. The IF statement handles when the primary key ID cycles back to 1 as I need to read the new record written after the ID cycled:
SELECT * from my_table WHERE id > last_written_id
PUBLISH each record
if ID of last record published == MAX_TABLE_ID (e.g 999,999):
??? What to do here? I need to get the newest records where ID >= 1 but less than the oldest record I have
I realise that the "code" is rough, but it's really just an idea at the moment so there's no code.
Thanks
Hmm, you can use the current value of the sequence to do what you want:
select t.*
from my_table t
where t.id > #last_written_id or
(currval(pg_get_serial_sequence('my_table', 'id')) < #last_written_id and
t.id <= currval(pg_get_serial_sequence('my_table', 'id'))
);
This is not a 100% solution. After all, 2,000,000 records could have been added so the numbers will all be repeated or the records deleted. Also, if you have inserts happening while the query is running -- particularly in a multithreaded environment.
Here is a completely different approach: You could completely fill the table, giving it a column for deletion time. So instead of deleting rows, you merely set this datetime. And instead of inserting a row you merely update the one that was deleted the longest time ago:
update my_table
set col1 = 123, col2 = 456, col3 = 'abc', deletion_datetime = null
where deletion_datetime =
(
select deletion_datetime
from my_table
where deletion_datetime is not null
order by deletion_datetime
limit 1
);

How to Never Retrieve Different Rows in a Changing Table

I have a table of millions of rows that is constantly changing(new rows are inserted, updated and some are deleted). I'd like to query 100 new rows(I haven't queried before) every minute but these rows can't be ones I've queried before. The table has a about 2 dozen columns and a primary key.
Happy to answer any questions or provide clarification.
A simple solution is to have a separate table with just one row to store the last ID you fetched.
Let's say that's your "table of millions of rows":
-- That's your table with million of rows
CREATE TABLE test_table (
id serial unique,
col1 text,
col2 timestamp
);
-- Data sample
INSERT INTO test_table (col1, col2)
SELECT 'test', generate_series
FROM generate_series(now() - interval '1 year', now(), '1 day');
You can create the following table to store an ID:
-- Table to keep last id
CREATE TABLE last_query (
last_quey_id int references test_table (id)
);
-- Initial row
INSERT INTO last_query (last_quey_id) VALUES (1);
Then with the following query, you will always fetch 100 rows never fetched from the original table and maintain a pointer in last_query:
WITH last_id as (
SELECT last_quey_id FROM last_query
), new_rows as (
SELECT *
FROM test_table
WHERE id > (SELECT last_quey_id FROM last_id)
ORDER BY id
LIMIT 100
), update_last_id as (
UPDATE last_query SET last_quey_id = (SELECT MAX(id) FROM new_rows)
)
SELECT * FROM new_rows;
Rows will be fetched by order of new IDs (oldest rows first).
You basically need a unique, sequential value that is assigned to each record in this table. That allows you to search for the next X records where the value of this field is greater than the last one you got from the previous page.
Easiest way would be to have an identity column as your PK, and simply start from the beginning and include a "where id > #last_id" filter on your query. This is a fairly straightforward way to page through data, regardless of underlying updates. However, if you already have millions of rows and you are constantly creating and updating, an ordinary integer identity is eventually going to run out of numbers (a bigint identity column is unlikely to run out of numbers in your great-grandchildren's lifetimes, but not all DBs support anything but a 32-bit identity).
You can do the same thing with a "CreatedDate" datetime column, but as these dates aren't 100% guaranteed to be unique, depending on how this date is set you might have more than one row with the same creation timestamp, and if those records cross a "page boundary", you'll miss any occurring beyond the end of your current page.
Some SQL system's GUID generators are guaranteed to be not only unique but sequential. You'll have to look into whether PostgreSQL's GUIDs work this way; if they're true V4 GUIDs, they'll be totally random except for the version identifier and you're SOL. If you do have access to sequential GUIDs, you can filter just like with an integer identity column, only with many more possible key values.

Moving large amounts of data instead of updating it

I have a large table (about 40M Rows) where I had a number of columns that are 0 which need to be null instead so we can better key the data.
I've written scripts to look chop the update into chunks of 10000 records, find the occurance of the columns with zero and update them to null.
Example:
update FooTable
set order_id = case when order_id = 0 then null else order_id end,
person_id = case when person_id = 0 then null else person_id end
WHERE person_id = 0
OR order_id = 0
This works great, but it takes for ever.
I thinking the better way to do this would be to create a second table and insert the data into it and then rename it to replace the old table with the columns having zero.
Question is - can I do a insert into table2 select from table1 and in the process cleanse the data from table1 before it goes in?
You can usually create a new, sanitised, table, depending on the actual DB server you are using.
The hard thing is that if there are other tables in the database, you may have issues with foreign keys, indexes, etc which will refer to the original table.
Whether making a new sanitised table will be quicker than updating your existing table is something you can only tell by trying it.
Dump the pk/clustered key of all the records you want to update into a temp table. Then perform the update joining to the temp table. That will ensure the lowest locking level and quickest access. You can also add an identity column to the temp table, than you can loop through and do the updates in batches.

sqlite: save array?

I have simple table with two columns: "id" INTEGER as a key, and "data" INTEGER.
One of the user's requirement is to save order in which he view data.
So I have to save order of records in table.
The simple solution as I see: id, data, order_id.
But in this case if user add record into the middle of his table's view,
we have to update many records.
Another idea: id, data, next_id, previous_id.
The insertion is fast, but extraction of record in defined order is slow.
So what is the best (fast) method to save order of records in table
using sqlite? fast = fast insert + fast extraction of records in defined order.
Update:
The problem with order_id is as I see in insertion of new record. I expect that we have 10 * 10^3 records. The insertion of new records will in the worst case update of all 10 * 10^3 records. sqlite database file is on flash memory. So it is not as fast as on PC, and will be better reduce "write" size, to increase life time of flash.
I think the order_id is better, you only need one update instruction
update table
set order_id = order_id + #newRecordOrder
where id = #id
and order_id > #newRecordOrder
I do wonder if that order is unique to all the table or to a subset, thus needing a second pk field.