PostgreSQL - Query keyword patterns in columns in table

PostgreSQL - Query keyword patterns in columns in table - sql

we all know in SQL we can query a column (lets say, column "breeds") for a certain word like "dog" via a query like this:
select breeds
from myStackOverflowDBTable
where breeds = 'dog'
However, say I had many more columns with much more data, say millions of records, and I did not want to find a word, but rather the most common keyword pattern or wildcard expression, a query like this:
SELECT *
FROM myStackOverflowDBTable
WHERE address LIKE '%alb%'"
Is there an efficient way to find these 'patterns' inside the columns using SQL? I need to find the most common substring so-to-speak, per the query above, say the wildcard string "alb" appeared the most in a "location" column that had words like Albany, Albuquerque, Alabama, obviously querying the words directly would yield 0 results but querying on that wildcard keyword pattern would yield many, but I want to find the most repeating or most frequent wildcard/keyword pattern/regex expression/substring (however you want to define it) for a given column - is there an easy way to do this without querying a million test queries and doing it manually???

Well, if you want to find three character patterns, you could extract all 3-character patterns, aggregate and count:
select substr(t.address, gs.i, 3) as ngram_3, count(*)
from t cross join lateral
generate_series(1, length(address) - 3, 1) gs(i)
group by ngram_3
order by count(*) desc
limit 100;

Related

How to rotate a two-column table?

This might be a novice question – I'm still learning. I'm on PostgreSQL 9.6 with the following query:
SELECT locales, count(locales) FROM (
SELECT lower((regexp_matches(locale, '([a-z]{2,3}(-[a-z]{2,3})?)', 'i'))[1])
AS locales FROM users)
AS _ GROUP BY locales
My query returns the following dynamic rows:
locales
count
en
10
fr
7
de
3
n additional locales (~300)...
n-count
I'm trying to rotate it so that locale values end up as columns with a single row, like this:
en
fr
de
n additional locales (~300)...
10
7
3
n-count
I'm having to do this to play nice with a time-series db/app
I've tried using crosstab(), but all the examples show better defined tables with 3 or more columns.
I've looked at examples using join, but I can't figure out how to do it dynamically.

Base query
In Postgres 10 or later you could use the simpler and faster regexp_match() instead of regexp_matches(). (Since you only take the first match per row anyway.) But don't bother and use the even simpler substring() instead:
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}' -- eliminate NULL, allow index support
GROUP BY 1
ORDER BY 2 DESC, 1
Simpler and faster than your original base query.
About those ordinal numbers in GROUP BY and ORDER BY:
Select first row in each GROUP BY group?
Subtle difference: regexp_matches() returns no row for no match, while substring() returns null. I added a WHERE clause to eliminate non-matches a-priori - and allow index support if applicable, but I don't expect indexes to help here.
Note the prefixed (?i), that's a so-called "embedded option" to use case-insensitive matching.
Added a deterministic ORDER BY clause. You'd need that for a simple crosstab().
Aside: you might need _ in the pattern instead of - for locales like "en_US".
Pivot
Try as you might, SQL does not allow dynamic result columns in a single query. You need two round trips to the server. See;
How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?
You can use a dynamically generated crosstab() query. Basics:
PostgreSQL Crosstab Query
Dynamic query:
PostgreSQL convert columns to rows? Transpose?
But since you generate a single row of plain integer values, I suggest a simple approach:
SELECT 'SELECT ' || string_agg(ct || ' AS ' || quote_ident(locale), ', ')
FROM (
SELECT lower(substring(locale, '(?i)[a-z]{2,3}(?:-[a-z]{2,3})?')) AS locale
, count(*)::int AS ct
FROM users
WHERE locale ~* '[a-z]{2,3}'
GROUP BY 1
ORDER BY 2 DESC, 1
) t
Generates a query of the form:
SELECT 10 AS en, 7 AS fr, 3 AS de, 3 AS "de-at"
Execute it to produce your desired result.
In psql you can append \gexec to the generating query to feed the generated SQL string back to the server immediately. See:
My function returned a string. How to execute it?

SQL count query number of lines per row

I am trying to count the amount of urls we have in field in sql I have googled but cannot find anything !
So for example this could be in field "url" row 1 / id 1
url/32432
url/32434
So for example this could be field "url" in row 2 / id 2
url/32432
url/32488
url/32477
So if you were to run the query the count would be 5. There is no comma in between them, only space.
Kind Regards
Scott

This is a very bad layout for data. If you have multiple urls per id, then they should be stored as separate rows in another table.
But, sometimes we are stuck with other people's bad design decisions. You can do something like this:
select (length(replace(urls, 'url', 'urlx')) - length(urls)) as num_urls
Note that the specific functions for length() and replace() might vary, depending on the database.

Sql Server Contains search not Giving Result as expected

select * from table1 where contains(searchWord,"*comfort*")
I want result as
Uncomfortable
with search Word in between but it is showing
comfort xyz
only

You do not need contains function here. Searches for precise or fuzzy (less precise) matches to single words and phrases, words within a certain distance of one another, or weighted matches in SQL Server
You need simple predicate for the required result.
select * from table1 where searchWord like '%comfort%'

search criteria difference between Like vs Contains() in oracle

I created a table with two columns.I inserted two rows.
id name
1 narsi reddy
2 narei sia
one is simply number type and another one is CLOB type.So i decided to use indexing on that. I queried on that by using contains.
query:
select * from emp where contains(name,'%a%e%')>0
2 narei sia
I expected 2 would come,but not. But if i give same with like it's given what i wanted.
query:
select * from emp where name like '%a%e%'
ID NAME
1 (CLOB) narsi reddy
2 (CLOB) narei sia
2 rows selected
finally i understood that like is searching whole document or paragraph but contains is looking in words.
so how can i get required output?

LIKE and CONTAINS are fundamentally different methods for searching.
LIKE is a very simple string pattern matcher - it recognises two wildcards (%) and (_) which match zero-or-more, or exactly-one, character respectively. In your case, %a%e% matches two records in your table - it looks for zero or more characters followed by a, followed by zero or more characters followed by e, followed by zero or more characters. It is also very simplistic in its return value: it either returns "matched" or "not matched" - no shades of grey.
CONTAINS is a powerful search tool that uses a context index, which builds a kind of word tree which can be searched using the CONTAINS search syntax. It can be used to search for a single word, a combination of words, and has a rich syntax of its own, such as boolean operators (AND, NEAR, ACCUM). It is also more powerful in that instead of returning a simple "matched" or "not matched", it returns a "score", which can be used to rank results in order of relevance; e.g. CONTAINS(col, 'dog NEAR cat') will return a higher score for a document where those two words are both found close together.

I believe that your CONTAINS query is matching 'narei sia' because the pattern '%a%e%' matches the word 'narei'. It does not match against 'narsi reddy' because neither word, taken individually, matches the pattern.
I assume you want to use CONTAINS instead of LIKE for performance reasons. I am not by any means an expert on CONTAINS query expressions, but I don't see a simple way to do the exact search you want, since you are looking for letters that can be in the same word or different words, but must occur in a given order. I think it may be best to do a combination of the two techniques:
WHERE CONTAINS(name,'%a% AND %e%') > 0
AND name LIKE '%a%e%'
I think this would allow the text index to be used to find candidate matches (anything which has at least one word containing 'a' and at least one word containing 'e'). These would would then be filtered by the LIKE condition, enforcing the requirement that 'a' precede 'e' in the string.

Weighted Keyword Search

Hello: I want to do a "weighted search" on product that are tagged with keywords.
(So: not fulltext search, but n-to-m-relation). So here it is:
Table 'product':
sku - the primary key
name
Table 'keywords':
kid - keyword idea
keyword_de - German language String (e.g. 'Hund','Katze','Maus')
keyword_en - English language String (e.g. 'Dog','Cat','Mouse')
Table 'product_keyword' (the cross-table)
sku \__ combined primary key
kid /
What I want is to get a score for all products that at least "contain" one relevant keyword. If I search for ('Dog','Elephant','Maus') I want that
Dog credits a score of 1.003,
Elephant of 1.002
Maus of 1.001
So least important search term starts at 1.001, everything else 0.001++. That way, a lower score limit of 3.0 would equal "AND" query (all three keywords must be found), a lower score limit of 1.0 would equal an "OR". Anything in between something more or less matching. In particular by sorting according to this score, most relevant search results would be first (regardless of lower limit)...
I guess I will have to do something with
IF( keyword1 == 'dog', 1.001, 0) + IF...
maybe inside a SUM() and probably with a GROUP BY at the end of a JOIN over the cross table, eh? But I am fairly clueless how to tackle this.
What would be feasible, is to get the keyword id's from the keywords beforehand. That's a cheap query. So the keywords table can be left ignored and it's all about the other of the cross and product table...
I have PHP at hand to automatically prepare a fairly lengthy PHP statement, but I would like to avoid further multiple SQL statements. In particular since I will limit the query outcome (most often to "LIMIT 0, 20") for paging mode results, so looping a very large number of in between results through a script would be no good...
DANKESCHÖN, if you can help me on this :-)

I think a lot of this is in the Lucene engine (http://lucene.apache.org/java/docs/index.html), which is available for PHP in the Zend Framework: http://framework.zend.com/manual/en/zend.search.lucene.html.
EDIT:
If you want to do the weighted thing you are talking about, I guess you could use something like this:
select p.sku, sum(case k.keyword_en when 'Dog' then 1001 when 'Cat' then 1002 when 'Mouse' then 1003 else 0 end) as totalscore
from products p
left join product_keyword pk on p.sku = pk.sku
inner join keywords k on k.kid = pk.kid
where k.keyword_en in ('Dog', 'Cat', 'Mouse')
group by p.sku
(Edit 2: forgot the group by clause.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas