"or" match pattern in redis scan - redis

Is there something like "or" in SCAN match patterns in Redis? I need to match a complicated keys like:
00,All+cities|123, 00,Paris|234, 00,London|345
Is it possible to match Paris or All cities using SCAN?
Something like:
SCAN 0 match 00,[Paris] || [All+cities]*
Thanks.

Related

Index for comparing to beginning of every word in a column

So I have a table
id | name | gender
---+-----------------+-------
0 | Markus Meskanen | M
1 | Jack Jackson | M
2 | Jane Jackson | F
And I've created an index
CREATE INDEX people_name_idx ON people (LOWER(name));
And then I query with
SELECT * FROM people WHERE name LIKE LOWER('Jack%');
Where %(name)s is the user's input. However, it now matches only to the beginning of the whole column, but I'd like it to match to the beginning of any of the words. I'd prefer not to use '%Jack%' since it would also result into invalid results from the middle of the word.
Is there a way to create an index so that each word gets a separate row?
Edit: If the name is something long like 'Michael Jackson's First Son Bob' it should match to the beginning of any of the words, i.e. Mich would match to Michael and Fir would match to First but ackson wouldn't match to anything since it's not from the beginning.
Edit 2: And we have 3 million rows so performance is an issue, thus I'm looking at indexes mostly.
Postgres has two index types to help with full text searches: GIN and GIST indexes (and I think GIN is the more commonly used one).
There is a brief overview of the indexes in the documentation. There is more extensive documentation for each index class, as well as plenty of blogs on the subject (here is one and here is another).
These can speed up the searches that you are trying to do.
The pg_trgm module does exactly what you want.
You need to create either:
CREATE INDEX people_name_idx ON people USING GIST (name gist_trgm_ops);
Or:
CREATE INDEX people_name_idx ON people USING GIN (name gin_trgm_ops);
See the difference here.
After that, these queries could use one of the indexes above:
SELECT * FROM people WHERE name ILIKE '%Jack%';
SELECT * FROM people WHERE name ~* '\mJack';
As #GordonLinoff answered, full text search is also capable of searching by prefix matches. But FTS is not designed to do that efficiently, it is best in matching lexemes. Though if you want to achieve the best performace, I advise you to give it a try too & measure each. In FTS, your query looks something like this:
SELECT * FROM people WHERE to_tsvector('english', name) ## to_tsquery('english', 'Jack:*');
Note: however if your query filter (Jack) comes from user input, both of these queries above needs some protection (i.e. in the ILIKE one you need to escape % and _ characters, in the regexp one you need to escape a lot more, and in the FTS one, well you'll need to parse the query with some parser & generate a valid FTS' tsquery query, because to_tsquery() will give you an error if its parameter is not valid. And in plainto_tsquery() you cannot use a prefix matching query).
Note 2: the regexp variant with name ~* '\mJack' will work best with english names. If you want to use the whole range of unicode (i.e. you want to use characters, like æ), you'll need a slightly different pattern. Something like:
SELECT * FROM people WHERE name ~* '(^|\s|,)Jack';
This will work with most of the names, plus this will work like a real prefix match with some old names too, like O'Brian.
You can use Regex expressions to find text inside name:
create table ci(id int, name text);
insert into ci values
(1, 'John McEnroe Blackbird Petrus'),
(2, 'Michael Jackson and Blade');
select id, name
from ci
where name ~ 'Pe+'
;
Returns:
1 John McEnroe Blackbird Petrus
Or can use something similar where substring(name, <regex exp>) is not null
Check it here: http://rextester.com/LHA16094
If you know that the words are space separated, You can do
SELECT * FROM people WHERE name LIKE LOWER('Jack%') or name LIKE LOWER(' Jack%') ;
For more control you can use RegEx with MySQl
see https://dev.mysql.com/doc/refman/5.7/en/regexp.html

How to retrieve keys in large REDIS databases using SCAN

I have a large redis database where I query keys using SCAN using the syntax:
SCAN 0 MATCH *something* COUNT 50
I get the result
1) "500000"
2) (empty list or set)
but the key is there. If I call subsequent with the new key in 1) at some time I will get the result.
I was under the impression MATCH would return matching keys until the max number specified by COUNT, but it seems REDIS scans COUNT keys and return only if they match.
Do I miss something? How can I do: "give me the first (count) keys that match the match" ?

How can I find matches between two tables using Hive?

I am trying to find matches between two tables of URLs using Hive :
blacklist.url siem.url
a.com d.fr
b.net f.es
c.ru a.com
... ...
When using :
SELECT blacklist.url FROM blacklist
INNER JOIN siem ON (blacklist.url = siem.url);
I get no match (the only case where I have a match is when I put "a.com" on the same row of the two tables, e.g. when the siem table looks like {a.com,...,...} in my example).
So I was thinking I could use a nested loop of this form:
for each line1 in blacklist do
for each line2 in siem do
if line1 = line 2
then print line1
I couldn't find any documentation in the Apache LanguageManual for nested loops and very few on condition statements so if anyone has an idea it would be of great help.

Postgres preferring costly ST_Intersects() over cheap index

I'm executing a rather simple query on a full planet dump of OSM using Postgres 9.4. What I want to do is fetching all ways which belong to the A8 autobahn in Germany. In a preparation step, I've created multipolygons for all administrative boundary relations and stored them in the table polygons so I can do a more easy spatial intersection test. To allow for a fast query processing, I also created an index for the 'ref' hstore tags:
CREATE INDEX idx_ways_tags_ref ON planet_20141222.ways USING btree (lower(tags->'ref'));
Additionally, I have already obtained the id of the administrative boundary of Germany by a previous query (result id = 51477).
My db schema is the normal API 0.6 schema, the data was imported via the dump approach into Postgres (using the pgsnapshot_schema_0.6*.sql scripts which come with osmosis). VACUUM ANALYZE was also performed for all tables.
The problematic query looks like this:
SELECT DISTINCT wy.id FROM planet_20141222.ways wy, planet_20141222.polygons py
WHERE py.id=51477 AND ST_Intersects(py.geom,wy.linestring) AND ((wy.tags->'highway') is not null) AND (lower(wy.tags->'ref') like lower('A8'));
The runtime of this query is terrible because Postgres prefers the costly ST_Intersects() test over the cheap (and highly selective) index on 'ref'. When removing the intersection test, the query returns in some milliseconds.
What can I do so that Postgres first evaluates the parts of the query where an index exists instead of testing each way in the entire planet for an intersection with Germany?
My current solution is to split the SQL query in two separate queries. The first does the index-supported tag tests and the second does the spatial intersection test. I suppose that Postgres can do better, but how?
Edit:
a) the OSM 0.6 import scripts create the following indexes on the ways table:
CREATE INDEX idx_ways_bbox ON ways USING gist (bbox);
CREATE INDEX idx_ways_linestring ON ways USING gist (linestring);
b) Additionally, I created another index on polygons:
CREATE INDEX polygons_geom_tags on polygons using gist(geom, tags);
c) The EXPLAIN ANALYZE output of the query without ST_Intersects() looks like this:
"Index Scan using ways_tags_ref on ways (cost=0.57..4767.61 rows=1268 width=467) (actual time=0.064..0.267 rows=60 loops=1)"
" Index Cond: (lower((tags -> 'ref'::text)) = 'a8'::text)"
" Filter: (((tags -> 'highway'::text) IS NOT NULL) AND (lower((tags -> 'ref'::text)) ~~ 'a8'::text))"
" Rows Removed by Filter: 5"
"Total runtime: 0.300 ms"
The runtime of the query with ST_Intersects() is more than 15 minutes, so I cancelled it.
maybe try something like this..?
WITH wy AS (
SELECT * FROM planet_20141222.ways
WHERE ((tags->'highway') IS NOT null)
AND (lower(tags->'ref') LIKE lower('A8'))
)
SELECT DISTINCT wy.id
FROM wy, planet_20141222.polygons py
WHERE py.id=51477
AND ST_Intersects(py.geom,wy.linestring);

PostgreSQL, find strings differ by n characters

Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.