How can I find matches between two tables using Hive? - hive

I am trying to find matches between two tables of URLs using Hive :
blacklist.url siem.url
a.com d.fr
b.net f.es
c.ru a.com
... ...
When using :
SELECT blacklist.url FROM blacklist
INNER JOIN siem ON (blacklist.url = siem.url);
I get no match (the only case where I have a match is when I put "a.com" on the same row of the two tables, e.g. when the siem table looks like {a.com,...,...} in my example).
So I was thinking I could use a nested loop of this form:
for each line1 in blacklist do
for each line2 in siem do
if line1 = line 2
then print line1
I couldn't find any documentation in the Apache LanguageManual for nested loops and very few on condition statements so if anyone has an idea it would be of great help.

Related

Parsing location from a search query in postgresql

I have a table of location data that is stored in json format with an attributes column that contains data as below:-
{
"name" : "Common name of a place or a postcode",
"other_name":"Any aliases",
"country": "country"
}
This is indexed as follows:-
CREATE INDEX location_jsonb_ts_vector
ON location
USING gin (jsonb_to_tsvector('simple'::regconfig, attributes,'["string","numeric"]'::jsonb));
I can search this for a location using the query:-
SELECT *
FROM location
WHERE jsonb_to_tsvector('simple'::regconfig, attributes, '["string", "numeric"]'::jsonb) ## plainto_tsquery('place name')
This works well if just using place names. But I want to search using more complex text strings such as:-
'coffee shops with wifi near charing cross'
'all restaurants within 10 miles of swindon centre'
'london nightlife'
I want to get the location found first and then strip it from the search text and go looking for the items in other tables using my location record to narrow down the scope.
This does not work with my current search mechanism as the intent and requirement pollute the text search vector and can cause odd results. I know this is a NLP problem and needs proper parsing of the search string, but this is for a small proof of concept and needs to work entirely in postgres via SQL or PL/PGSQL.
How can I modify my search to get better matches? I've tried splitting into keywords and looking for them individually, but they risk not bring back results unless combined. For example; "Kings Cross" will bring back "Kings".
I've come up with a cheap and cheerful solution:-
WITH tsv AS (
SELECT to_tsquery('english', 'football | matches | in | swindon') AS search_vector,
'football matches in swindon' AS search_text
)
SELECT * FROM
(
SELECT attributes,
position(lower(ATTRIBUTES->>'name1') IN lower(search_text)) AS name1_position
FROM location,tsv
WHERE jsonb_to_tsvector('simple'::regconfig, attributes, '["string", "numeric"]'::jsonb) ## search_vector
) loc
ORDER BY name1_position DESC

SQL query find few strings in diferent columns in a table row (restrictive)

I have a table like this one (in a SQL SERVER):
field_name
field_descriptor
tag1
tag2
tag3
tag4
tag5
house
your home
home
house
null
null
null
car
first car
car
wheel
null
null
null
...
...
...
...
...
...
...
I'm developing a WIKI with a searchbar, which should be able to handle a query with more than one string for search. As an user enters a second string (spaced) the query should be able to return results that match restrictively the two strings (if exists) in any column, and so with a three string search.
Easy to do for one string with a simple SELECT with ORs.
Tried in the fronted in JS with libraries like match-sorter but it's heavy with a table with more than 100,000 results and more in the future.
I thought the query should do the heavy work, but maybe there is no simple way doing it.
Thanks in advance!
Tried to do the heavy work with all results in frontend with filtering and other libraries like match-sorter. Works but take several seconds and blocks the front.
Tried to create a simple OR/AND query but the posibilities with 3 search-strings (could be 1, 2 or 3) matching any column to any other possibility is overwhelming.
You can use STRING_SPLIT to get a separate row per search word from the search words string. Then only select rows where all search words have a match.
The query should look like this:
select *
from mytable t
where exists
(
select null
from (select value from string_split(#search, ' ')) search
having min(case when search.value in (t.tag1, t.tag2, t.tag3, t.tag4, t.tag5) then 1 else 0 end) = 1
);
Unfortunately, SQL Server seems to have a flaw (or even a bug) here and reports:
Msg 8124 Level 16 State 1 Line 8
Multiple columns are specified in an aggregated expression containing an outer reference. If an expression being aggregated contains an outer reference, then that outer reference must be the only column referenced in the expression.
Demo: https://dbfiddle.uk/kNL1PVOZ
I don't have more time at hand right now, so you may use this query as a starting point to get the final query.

Index for comparing to beginning of every word in a column

So I have a table
id | name | gender
---+-----------------+-------
0 | Markus Meskanen | M
1 | Jack Jackson | M
2 | Jane Jackson | F
And I've created an index
CREATE INDEX people_name_idx ON people (LOWER(name));
And then I query with
SELECT * FROM people WHERE name LIKE LOWER('Jack%');
Where %(name)s is the user's input. However, it now matches only to the beginning of the whole column, but I'd like it to match to the beginning of any of the words. I'd prefer not to use '%Jack%' since it would also result into invalid results from the middle of the word.
Is there a way to create an index so that each word gets a separate row?
Edit: If the name is something long like 'Michael Jackson's First Son Bob' it should match to the beginning of any of the words, i.e. Mich would match to Michael and Fir would match to First but ackson wouldn't match to anything since it's not from the beginning.
Edit 2: And we have 3 million rows so performance is an issue, thus I'm looking at indexes mostly.
Postgres has two index types to help with full text searches: GIN and GIST indexes (and I think GIN is the more commonly used one).
There is a brief overview of the indexes in the documentation. There is more extensive documentation for each index class, as well as plenty of blogs on the subject (here is one and here is another).
These can speed up the searches that you are trying to do.
The pg_trgm module does exactly what you want.
You need to create either:
CREATE INDEX people_name_idx ON people USING GIST (name gist_trgm_ops);
Or:
CREATE INDEX people_name_idx ON people USING GIN (name gin_trgm_ops);
See the difference here.
After that, these queries could use one of the indexes above:
SELECT * FROM people WHERE name ILIKE '%Jack%';
SELECT * FROM people WHERE name ~* '\mJack';
As #GordonLinoff answered, full text search is also capable of searching by prefix matches. But FTS is not designed to do that efficiently, it is best in matching lexemes. Though if you want to achieve the best performace, I advise you to give it a try too & measure each. In FTS, your query looks something like this:
SELECT * FROM people WHERE to_tsvector('english', name) ## to_tsquery('english', 'Jack:*');
Note: however if your query filter (Jack) comes from user input, both of these queries above needs some protection (i.e. in the ILIKE one you need to escape % and _ characters, in the regexp one you need to escape a lot more, and in the FTS one, well you'll need to parse the query with some parser & generate a valid FTS' tsquery query, because to_tsquery() will give you an error if its parameter is not valid. And in plainto_tsquery() you cannot use a prefix matching query).
Note 2: the regexp variant with name ~* '\mJack' will work best with english names. If you want to use the whole range of unicode (i.e. you want to use characters, like æ), you'll need a slightly different pattern. Something like:
SELECT * FROM people WHERE name ~* '(^|\s|,)Jack';
This will work with most of the names, plus this will work like a real prefix match with some old names too, like O'Brian.
You can use Regex expressions to find text inside name:
create table ci(id int, name text);
insert into ci values
(1, 'John McEnroe Blackbird Petrus'),
(2, 'Michael Jackson and Blade');
select id, name
from ci
where name ~ 'Pe+'
;
Returns:
1 John McEnroe Blackbird Petrus
Or can use something similar where substring(name, <regex exp>) is not null
Check it here: http://rextester.com/LHA16094
If you know that the words are space separated, You can do
SELECT * FROM people WHERE name LIKE LOWER('Jack%') or name LIKE LOWER(' Jack%') ;
For more control you can use RegEx with MySQl
see https://dev.mysql.com/doc/refman/5.7/en/regexp.html

SQL query to find records with specific prefix

I'm writing SQL queries and getting tripped up by wanting to solve everything with loops instead of set operations. For example, here's two tables (lists, really - one column each); idPrefix is a subset of idFull. I want to select every full ID that has a prefix I'm interested in; that is, every row in idFull which has a corresponding entry in idPrefix.
idPrefix.ID idFull.ID
---------- ----------
12 8
15 12
300 12-1-1
12-1-2
15
15-1
300
Desired result would be everything in idFull except the value 8. Super-easy with a for each loop, but I'm just not conceptualizing it as a set operation. I've tried a few variations on the below; everything seems to return all of one table. I'm not sure if my issue is with how I'm doing joins, or how I'm using LIKE.
SELECT f.ID
FROM idPrefix AS p
JOIN idFull AS f
ON f.ID LIKE (p.ID + '%')
Details:
Values are varchars, prefixes can be any length but do not contain the delimiter '-'.
This question seems similar, but more complex; this one only uses one table.
Answer doesn't need to be fast/optimized/whatever.
Using SQL Server 2008, but am more interested in conceptual understanding than a flavor-specific query.
Aaaaand I'm coming back to both real coding & SO after ~3 years, so sorry if I'm rusty on any etiquette.
Thanks!
You can join the full table to the prefix table with a LIKE
SELECT idFull.ID
FROM idFull full
INNER JOIN idPrefix pre ON full.ID LIKE pre.ID + '%'

Neo4j: WITH and HAVING in the same query?

I am trying to understand if I can perform a query with Neo4j that contains both WITH and HAVING clauses. I have this so far:
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (m:LABEL)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100
RETURN p,r2,q;
I'd now need to add in the same query something that in SQL would look
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (m:LABEL)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100
AND WHERE count(q)=3
RETURN m,r2,q;
I know that Cypher doesn't let me use that without using something like the HAVING clause but when I try to add it to my query it conflicts with the previous WITH clause.
Is this feasible or it is too nested that Cypher won't allow me to do it?
You can have as many with statements as you want, it is just piping query results from one part to the next. Actually WITH + WHERE = `HAVING``
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (m:LABEL)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100
WITH m,collect([r2,q]) as paths
WHERE length(paths) = 3
RETURN m,paths;
Btw. I don't know where your p comes from.
Not sure what your reference is for HAVING in cypher, but that's not the problem with the query.
drop the second WHERE - in cypher, you WHERE once and then you can expand that with all the binary fun you want
your first filter condition tests individual relationships (r2), but the second tests an aggregate (count(q)). You can't test a flat pattern and an aggregate from the same pattern at the same time
return things that you have actually bound (what is p?)
You may also want to change the second MATCH, m is already bound but you are re-matching it with the just created label. All in all, try something like
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (m)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100
WITH m, collect(r2) as rr, collect(q) as qq
WHERE length(qq) = 3
RETURN p,rr,qq;
for filtering first on flat relationship r2 then on size of aggregate, or for a flat WHERE .. AND .. try something like
MATCH (n)-[r:RELATIONSHIP*1..3]->(m)
SET m:LABEL
WITH m
MATCH (m)-[r2:RELATIONSHIP]->(q:OTHERLABEL)
WHERE r2.time<100 AND q.someProp = 10
RETURN m,r2,q;