Whether redis scan will scan data inserted after scan operation started? - redis

I am using the redis scan to iterate over the key in my redis DB.This is the code i am using
#example :1
start_cursor=0
r=redis.StrictRedis(host='localhost',port=6379)
cur, keys = r.scan(cursor=start_cursor, match='*', count=3)
records=len(keys)
values=[]
values.extend(i for i in keys)
print cur,records,values
while cur != 0:
cur, keys = r.scan(cursor=cur, match='*', count=3)
records+=len(keys)
data = r.mget(keys=keys)
values.extend(i for i in keys)
print cur, len(keys),values
print "Total records scanned :{}".format(records)
My DB has following values in the format (key,values) (1,1)(2,2)(3,3)(4,4)(5,5)
When I am scanning my records with
start_cursor=0
I am getting the following records
scan #=>Cursor,new records,all records
scan #1 =>5,3,['1', '2', '5']
scan #2 =>0,2,['1', '2', '5', '3', '4']
Total Records scanned :5
when I am starting the scanning with my cursor number from scan #1 say(5).I am getting the records from the scan #2 in my example 1
Example : 2
start_cursor=5
scan #=>Cursor,new records,all records
scan #1 =>0 2 ['3', '4']
Total Records scanned :2
Everything is fine upto now.when I insert four more new data say (6,6) (7,7) (8,8) (9,9) and start the scanning the data with start_cursor = 5 same as example :2
I am missing some of the data say (6) is missed here.
Example : 3
start_cursor=5
scan #=>Cursor,new records,all records
scan #1 =>3 3 ['3', '4', '9']
scan #1 =>0 2 ['3', '4', '9', '7', '8']
Total Records scanned :5
Whether redis scan will scan data inserted after scan operation started ? If not is there is any other way to achieve this?
Thanks in advance!!

From the doc:
Elements that were not constantly present in the collection during a full iteration, may be returned or not: it is undefined.
Since the data is inserted after the scan operation starts, these data is NOT CONSTANTLY PRESENT. So SCAN doesn't guarantee to return the newly inserted data.
If not is there is any other way to achieve this?
AFAIK, there's no other way to achieve that.

Related

What is the maximum number of scalars that can be used in an IN clause?

Whenever I use a small statement for example:
DELETE FROM c_ordertax WHERE (c_order_id,c_tax_id) IN ((183691598,1000862),(183691198,1000862));
It executes perfectly... but if I execute a lengthy statement to delete say 18755 records with these scalar values, it says "max_stack_depth" exceeded... this option in postgresql.conf has been set to 2MB and the query that threw the error doesn't even amount to 2MB, its just 300kb
Note: No Triggers are attached in the table
And one thing I noticed about other queries is, when I use single value in IN clause eg: DELETE FROM c_ordertax WHERE (c_order_id) IN ((183691598),(183691198)); they dont have any issues and however lengthy the query maybe, it executes perfectly...
My current options are:
I could increase the "max_stack_depth" value but it is limited to
8MB and increasing it further causes issues and postgresql server
couldn't restart... it can only restart properly of the option is
set to a value less than 8MB
I could Split up those statements but it might not be a graceful
solution and that too requires me to know the maximum scalar values
that can be accommodated in a single statement and if number of fields
increase in scalar value, the total number of values that can be
used in a single statement could reduce I fear...
So My Question is what is the maximum number of scalar values that can be used in an IN clause... if the number of fields in scalar value increases, is there a formula that can be used to determine the maximum number of scalar values that can be used eg:
5 values with 2 fields => ((1,2),(1,2),(1,2),(1,2),(1,2))
2 values with 3 fields => ((1,2,3),(1,2,3))
Any Database Mastermind encountered these kinda issues? If so How do I tackle it?
It should work if you rewrite the list of scalar values to a values() list:
DELETE FROM c_ordertax
using (
values
(183691598,1000862),
(183691198,1000862)
) as t(ord_id,tax_id)
WHERE c_order_id = t.ord_id
and c_tax_id = t.tax_id;
I tried this with 10000 pairs in the values list and it did not throw an error. That was with Postgres 11 however. I don't have 9.3 available right now.
The problem is that IN lists of pairs get transformed like this during the parse stage:
EXPLAIN DELETE FROM large WHERE (id, id) IN ((1, 1), (2, 2), (3, 3), (4, 4), (5, 5));
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on large (cost=0.00..39425.00 rows=1 width=6)
-> Seq Scan on large (cost=0.00..39425.00 rows=1 width=6)
Filter: (((id = 1) AND (id = 1)) OR ((id = 2) AND (id = 2)) OR ((id = 3) AND (id = 3)) OR ((id = 4) AND (id = 4)) OR ((id = 5) AND (id = 5)))
(3 rows)
If the list consists of scalars, PostgreSQL can do better:
EXPLAIN DELETE FROM large WHERE id IN (1, 2, 3, 4, 5);
QUERY PLAN
---------------------------------------------------------------
Delete on large (cost=0.00..20675.00 rows=5 width=6)
-> Seq Scan on large (cost=0.00..20675.00 rows=5 width=6)
Filter: (id = ANY ('{1,2,3,4,5}'::integer[]))
(3 rows)
The second version will run with large lists, but the first will run into the limit during a recursive parse procedure.
I am not sure if that can be improved, but it may probably not be seen as a case worth spending a lot of effort on. You can always rewrite your query like "a_horse_with_no_name" suggested.
Usually, if you have long IN lists like that, you are probably doing something wrong, like trying to perform a join outside the database.

Does SQL Server Table-Scan Time depend on the Query?

I observed that doing a full table scan takes a different time based on the query. I believed that under similar conditions (set of columns under select, column data types) a table scan should take a somewhat similar time. Seems like it's not the case. I just want to understand the reason behind that.
I have used "CHECKPOINT" and "DBCC DROPCLEANBUFFERS" before querying to make sure there is no impact from the query cache.
Table:
10 Columns
10M rows Each column has different densities ranging from 0.1 to 0.000001
No indexes
Queries:
Query A: returned 100 rows, time took: ~ 900ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL07 = 50000
Query B: returned 910595 rows, time took: ~ 15000ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL01 = 5
** Where column COL07 was randomly populated with integers ranging from 0 to 100000 and column COL01 was randomly populated with integers ranging from 0 to 10
Time Taken:
Query A: around 900 ms
Query B: around 18000 ms
What's the point I'm missing here?
Query A: (returned 100 rows, time took: ~ 900ms)
Query B: (returned 910595 rows, time took: ~ 15000ms)
I believe that what you are missing is that there are about x100 more rows to fetch in the second query. That only could explain why it took 20 times longer.
The two columns have different density of the data.
Query A, COL07: 10000000/100000 = 100
Query B, COL05: 10000000/10 = 1000000
The fact that both the search parameters are in the middle of the data range doesn't necessarily impact the speed of the search. This is depending on the number of times the engine scans the column to return the values of the search predicate.
In order to see if this is indeed the case, I would try the following:
COL04: 10000000/1000 = 10000. Filtering on WHERE COL04 = 500
COL08: 10000000/10000 = 1000. Filtering on WHERE COL05 = 5000
Considering the times from the initial test, you would expect to see COL04 at ~7200ms and COL05 at ~3600ms.
An interesting article about SQL Server COUNT() Function Performance Comparison
Full Table Scan (also known as Sequential Scan) is a scan made on a database where each row of the table under scan is read in a sequential (serial) order
Reference
In your case, full table scan scans sequentially (in ordered way) so that it does not need to scan whole table in order to advance next record because Col7 is ordered.
but in Query2 the case is not like that, Col01 is randomly distributed so full table scan is needed.
Query 1 is optimistic scan where as Query 2 is pessimistic can.

How to retrieve keys in large REDIS databases using SCAN

I have a large redis database where I query keys using SCAN using the syntax:
SCAN 0 MATCH *something* COUNT 50
I get the result
1) "500000"
2) (empty list or set)
but the key is there. If I call subsequent with the new key in 1) at some time I will get the result.
I was under the impression MATCH would return matching keys until the max number specified by COUNT, but it seems REDIS scans COUNT keys and return only if they match.
Do I miss something? How can I do: "give me the first (count) keys that match the match" ?

How to store and query version of same document in PostgreSQL?

I am storing versions of a document in PostgreSQL 9.4. Every time the user creates a new version, it inserts a row so that I can track all changes over time. Each row shares a reference_id column with the previous rows. Some of the rows get approved, and some remain as drafts. Each row also has a viewable_at time.
id | reference_id | approved | viewable_at | created_on | content
1 | 1 | true | 2015-07-15 00:00:00 | 2015-07-13 | Hello
2 | 1 | true | 2015-07-15 11:00:00 | 2015-07-14 | Guten Tag
3 | 1 | false | 2015-07-15 17:00:00 | 2015-07-15 | Grüß Gott
The most frequent query is to get the rows grouped by the reference_id where approved is true and viewable_at is less than the current time. (In this case, row id 2 would be included in the results)
So far, this is the best query I've come up with that doesn't require me to add additional columns:
SELECT DISTINCT ON (reference_id) reference_id, id, approved, viewable_at, content
FROM documents
WHERE approved = true AND viewable_at <= '2015-07-15 13:00:00'
ORDER BY reference_id, created_at DESC`
I have an index on reference_id and a multi-column index on approved and viewable_at.
At only 15,000 rows it's still averaging a few hundred milliseconds (140 - 200) on my local machine. I suspect that the distinct call or the ordering may be slowing it down.
What is the most efficient way to store this information so that SELECT queries are the most performant?
Result of EXPLAIN (BUFFERS, ANALYZE):
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=6668.86..6730.36 rows=144 width=541) (actual time=89.862..99.613 rows=145 loops=1)
Buffers: shared hit=2651, temp read=938 written=938
-> Sort (cost=6668.86..6699.61 rows=12300 width=541) (actual time=89.861..97.796 rows=13184 loops=1)
Sort Key: reference_id, created_at
Sort Method: external merge Disk: 7488kB
Buffers: shared hit=2651, temp read=938 written=938
-> Seq Scan on documents (cost=0.00..2847.80 rows=12300 width=541) (actual time=0.049..40.579 rows=13184 loops=1)
Filter: (approved AND (viewable_at < '2015-07-20 06:46:55.222798'::timestamp without time zone))
Rows Removed by Filter: 2560
Buffers: shared hit=2651
Planning time: 0.218 ms
Execution time: 178.583 ms
(12 rows)
Document Usage Notes:
The documents are manually edited and we're not yet autosaving the documents every X seconds or anything, so the volume will be reasonably low. At this point, there is an average of 7 versions and an average of only 2 approved versions per reference_id. (~30%)
On the min and max side, the vast majority of documents will have 1 or 2 versions and it seems unlikely that any document would have more than 30 or 40. There is a garbage collection process to clean out unapproved versions older than a week, so the total number of versions should stay pretty low.
For retrieving and practical usage, I could use limit / offset on the queries but in my tests that doesn't make a huge difference. Ideally this is a base query that populates a view or something so that I can do additional queries on top of these results, but I'm not entirely sure how that would affect the resulting performance and am open to suggestions. My impression is that if I can get this storage / query as simple / fast as possible then all other queries that start from this point could be improved, but it's likely that I'm wrong and that each query needs more independent thought.
Looking at your explain output, it looks like you're fetching most of the contents in the documents table so it's sensibly doing a sequential scan. Your rowcount estimates are reasonable, there doesn't seem to be any stats issue here.
It's doing an external merge sort on disk, so you might see a significant increase in performance by increasing work_mem in the session, e.g.
SET work_mem = '12MB'
It's possible that an index on (reference_id ASC, created_at DESC) WHERE (approved) might be useful, since it'll allow results to be fetched in the order required.
You could also experiment with adding viewable_at to the index. I think it might have to be the last column, but I'm not sure. Or even making it into a covering index by appending viewable_at, id, content and omitting the unnecessary approved column from the result set. This may permit an index-only scan, though with DISTINCT ON involved I'm not sure.
#Craig already covers most options to make this query faster. More work_mem for the session is probably the most effective item.
Since:
There is a garbage collection process to clean out unapproved versions
older than a week
A partial index excluding unapproved versions won't amount to much. If you use an index, you would still exclude those irrelevant rows, though.
Since you seem to have very few versions per reference_id:
the vast majority of documents will have 1 or 2 versions
You already have the best query technique with DISTINCT ON:
Select first row in each GROUP BY group?
With a growing number of versions, other techniques would be increasingly superior:
Optimize GROUP BY query to retrieve latest record per user
The only slightly unconventional element in your query is that the predicate is on viewable_at, but you then take the row with the latest created_at, which is why your index would actually be:
(reference_id, viewable_at ASC, created_at DESC) WHERE (approved)
Assuming all columns to be defined NOT NULL. The alternating sort order between viewable_at and created_at is important. Then again, while you have so few rows per reference_id I don't expect any index to be of much use. The whole table has to be read anyway, a sequential scan will be about as fast. The added maintenance cost of the index may even outweigh its benefit.
However, since:
Ideally this is a base query that populates a view or something so
that I can do additional queries on top of these results
I have one more suggestion: Create a MATERIALIZED VIEW from your query, giving you a snapshot of your project for the given point in time. If disk space is not an issue and snapshot might be reused, you might even collect a couple of those to stick around:
CREATE MATERIALIZED VIEW doc_20150715_1300 AS
SELECT DISTINCT ON (reference_id)
reference_id, id, approved, viewable_at, content
FROM documents
WHERE approved -- simpler expression for boolean column
AND viewable_at <= '2015-07-15 13:00:00'
ORDER BY reference_id, created_at DESC;
Or, if all additional queries happen in the same session, use a temp table instead (which dies at the end of the session automatically):
CREATE TEMP TABLE doc_20150715_1300 AS ...;
ANALYZE doc_20150715_1300;
Be sure to run ANALYZE on the temp table (and also on the MV if you run queries immediately after creating it):
Are regular VACUUM ANALYZE still recommended under 9.1?
PostgreSQL partial index unused when created on a table with existing data
Either way, it may pay to create one or more indexes on the snapshots supporting subsequent queries. Depends on data and queries.
Note, the current version 1.20.0 of pgAdmin does not display indexes for MVs. That's already been fixed and is waiting to be released with the next version.

Postgres - Is this the right way to create a partial index on a boolean column?

I have the following table:
CREATE TABLE recipemetadata
(
--Lots of columns
diet_glutenfree boolean NOT NULL,
);
Most every row will be set to FALSE unless someone comes up with some crazy new gluten free diet that sweeps the country.
I need to be able to very quickly query for rows where this value is true. I've created the index:
CREATE INDEX IDX_RecipeMetadata_GlutenFree ON RecipeMetadata(diet_glutenfree) WHERE diet_glutenfree;
It appears to work, however I can't figure out how to tell if indeed it's only indexing rows where the value is true. I want to make sure it's not doing something silly like indexing any rows with any value at all.
Should I add an operator to the WHERE clause, or is this syntax perfectly valid? Hopefully this isn't one of those super easy RTFM questions that will get downvoted 30 times.
UPDATE:
I've gone ahead and added 10,000 rows to RecipeMetadata with random values. I then did an ANALYZE on the table and a REINDEX just to be sure. When I run the query:
select recipeid from RecipeMetadata where diet_glutenfree;
I get:
'Seq Scan on recipemetadata (cost=0.00..214.26 rows=5010 width=16)'
' Filter: diet_glutenfree'
So, it appears to be doing a sequential scan on the table even though only about half the rows have this flag. The index is being ignored.
If I do:
select recipeid from RecipeMetadata where not diet_glutenfree;
I get:
'Seq Scan on recipemetadata (cost=0.00..214.26 rows=5016 width=16)'
' Filter: (NOT diet_glutenfree)'
So no matter what, this index is not being used.
I've confirmed the index works as expected.
I re-created the random data, only this time set diet_glutenfree to random() > 0.9 so there's only a 10% chance of an on bit.
I then re-created the indexes and tried the query again.
SELECT RecipeId from RecipeMetadata where diet_glutenfree;
Returns:
'Index Scan using idx_recipemetadata_glutenfree on recipemetadata (cost=0.00..135.15 rows=1030 width=16)'
' Index Cond: (diet_glutenfree = true)'
And:
SELECT RecipeId from RecipeMetadata where NOT diet_glutenfree;
Returns:
'Seq Scan on recipemetadata (cost=0.00..214.26 rows=8996 width=16)'
' Filter: (NOT diet_glutenfree)'
It seems my first attempt was polluted since PG estimates it's faster to scan the whole table rather than hit the index if it has to load over half the rows anyway.
However, I think I would get these exact results on a full index of the column. Is there a way to verify the number of rows indexed in a partial index?
UPDATE
The index is around 40k. I created a full index of the same column and it's over 200k, so it looks like it's definitely partial.
An index on a one-bit field makes no sense. For understanding the decisions made by the planner, you must think in terms of pages, not in terms of rows.
For 8K pages and an (estinated) rowsize of 80, there are 100 rows on every page. Assuming a random distribution, the chance that a page consist of only rows with a true value is neglectable, pow (0.5, 100), about 1e-33, IICC. (and the same for 'false' of course) Thus for a query on gluten_free == true, every page has to be fetched anyway, and filtered afterwards. Using an index would only cause more pages (:the index) to be fetched.