searching in postgres tsvector for something with `-` in does not work

searching in postgres tsvector for something with `-` in does not work - sql

I have a postgres tsvector label field which contains
'hello-world' 'hungry'
now we let the user search for these terms and brings them back for example they can search for explict labels..
Everytime a user tries to search with a - it does not return
does not work:
SELECT
labels
FROM tbt
WHERE labels ## to_tsquery('hello-world | jobs')
works:
SELECT
labels
FROM tbt
WHERE labels ## to_tsquery('hungry | jobs')
How can i make - work in the tsquery search as well?
I would like to achieve this in the query and not have to change any data in the tables.
thanks in advance!
explained all in the above

Since you seem to have bypassed the to_tsvector in constructing your tsvector, you should also bypass to_tsquery and generate the tsquery directly. For example, this yields true:
select 'hello-world'::tsvector ## 'hello-world | foo'::tsquery;

Related

How to make pie chart of these values in Splunk

Have the following query index=app (splunk_server_group=bex OR splunk_server_group=default) sourcetype=rpm-web* host=rpm-web* "CACHE_NAME=RATE_SHOPPER" method = GET | stats count(eval(searchmatch("true))) as Hit, count(eval(searchmatch("found=false"))) as Miss
Need to make a pie chart of two values "Hit and Miss rates"
The field where it is possible to distinguish the values is Message=[CACHE_NAME=RATE_SHOPPER some_other_strings method=GET found=false]. or found can be true

With out knowing the structure of your data it's harder to say what exactly you need todo but,
Pie charts is a single data series so you need to use a transforming command to generate a single series. PieChart Doc
if you have a field that denotes a hit or miss (You could use an Eval statement to create one if you don't already have this) you can use it to create the single series like this.
Lets say this field is called result.
|stats count by result
Here is a link to the documentation for the Eval Command
Good luck, hope you can get the results your looking for

Since you seem to be concerned only about whether "found" equals either "hit" or "miss", try this:
index=app (splunk_server_group=bex OR splunk_server_group=default) sourcetype=rpm-web* host=rpm-web* "CACHE_NAME=RATE_SHOPPER" method=GET found IN("hit","miss")
| stats count by found

Pie charts require a single field so it's not possible to graph the Hit and Miss fields in a pie. However, if the two fields are combined into one field with two possible values, then it will work.
index=app (splunk_server_group=bex OR splunk_server_group=default) sourcetype=rpm-web* host=rpm-web* "CACHE_NAME=RATE_SHOPPER" method = GET
| eval result=if(searchmatch("found=true"), "Hit", "Miss")
| stats count by result

How can I put several extracted values from a Json in an array in Kusto?

I'm trying to write a query that returns the vulnerabilities found by "Built-in Qualys vulnerability assessment" in log analytics.
It was all going smoothly I was getting the values from the properties Json and turning then into separated strings but I found out that some of the terms posses more than one value, and I need to get all of them in a single cell.
My query is like this right now
securityresources | where type =~ "microsoft.security/assessments/subassessments"
| extend assessmentKey=extract(#"(?i)providers/Microsoft.Security/assessments/([^/]*)", 1, id), IdAzure=tostring(properties.id)
| extend IdRecurso = tostring(properties.resourceDetails.id)
| extend NomeVulnerabilidade=tostring(properties.displayName),
Correcao=tostring(properties.remediation),
Categoria=tostring(properties.category),
Impacto=tostring(properties.impact),
Ameaca=tostring(properties.additionalData.threat),
severidade=tostring(properties.status.severity),
status=tostring(properties.status.code),
Referencia=tostring(properties.additionalData.vendorReferences[0].link),
CVE=tostring(properties.additionalData.cve[0].link)
| where assessmentKey == "1195afff-c881-495e-9bc5-1486211ae03f"
| where status == "Unhealthy"
| project IdRecurso, IdAzure, NomeVulnerabilidade, severidade, Categoria, CVE, Referencia, status, Impacto, Ameaca, Correcao
Ignore the awkward names of the columns, for they are in Portuguese.
As you can see in the "Referencia" and "CVE" columns, I'm able to extract the values from a specific index of the array, but I want all links of the whole array

Without sample input and expected output it's hard to understand what you need, so trying to guess here...
I think that summarize make_list(...) by ... will help you (see this to learn how to use make_list)
If this is not what you're looking for, please delete the question, and post a new one with minimal sample input (using datatable operator), and expected output, and we'll gladly help.

Full text search returning too many irrelevant results and causing poor performance

I'm using the full text search feature from Postgres and for the most part it works fine.
I have a column in my database table called documentFts that is basically the ts_vector version of the body field, which is a text column, and that's indexed with GIN index.
Here's my query:
select
count(*) OVER() AS full_count,
id,
url,
(("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) as "finalScore",
ts_headline(\'english_unaccent\', title, websearch_to_tsquery($4, $1)) as title,
ts_headline(\'english_unaccent\', body, websearch_to_tsquery($4, $1)) as body,
"possibleEncoding",
"responseYear"
from "Entries"
where
"language" = $3 and
"documentFts" ## websearch_to_tsquery($4, $1)
order by (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) desc limit 20 offset $2;
The dictionary is english_unaccent because I created one based on english that uses the unaccent extension by using:
CREATE TEXT SEARCH CONFIGURATION english_unaccent (
COPY = english
);
ALTER TEXT SEARCH CONFIGURATION english_unaccent
ALTER MAPPING FOR hword, hword_part, word WITH unaccent,
english_stem;
I did the same for other languages.
And then I did this to my Entries db:
ALTER TABLE "Entries"
ADD COLUMN "documentFts" tsvector;
UPDATE
"Entries"
SET
"documentFts" = (setweight(to_tsvector('english_unaccent', coalesce(title)), 'A') || setweight(to_tsvector('english_unaccent', coalesce(body)), 'C'))
WHERE
"language" = 'english';
I have a column in my table with the language of the entry, hence the "language" = 'english'.
So, the problem I'm having is that for words like animal, anime or animation, they all go into the vector as anim, which means that if I search for any of those words I get results with all of those variations.
That returns a HUGE dataset that causes the query to be quite slow compared to searches that return fewer items. And also, if I search for Anime, my first results contain Animal, Animated and the first result that has the word Anime is the 12th one.
Shouldn't animation be transformed to animat in the vector and animal just be animal as the other variations for it are animals or animalia?
I've been searching for a solution to this without much luck, is there any way I can improve this, I'm happy to install extensions, reindex the column or whatever.

There are so many little details to this. The best solution depends on the exact situation and exact requirements.
Two simple options:
Simple tweak 1
If you want to sort rows where title or body have a word starting with 'Anime' (exactly) in it, matched case-insensitively, add an ORDER BY expression like:
ORDER BY unaccent(concat_ws(' ', title, body) !~* ('\m' || f_regexp_escape($4))
, (("urlScore" / 100) + ts_rank("documentFts", websearch_to_tsquery($4, $1))) DESC
Where the auxiliary function f_regexp_escape() escapes special regexp characters and is defined here:
Escape function for regular expression or LIKE patterns
That expression is rather expensive, but since it's only applied to filtered results, the effect is limited.
You may have to fine-tune, as other search terms present other difficulties. Think of 'body' / 'bodies' stemming to 'bodi' ...
Simple tweak 2
To remove English stemming completely, base yours on the 'simple' TEXT SEARCH CONFIGURATION:
CREATE TEXT SEARCH CONFIGURATION simple_unaccent (
COPY = simple
);
Etc.
Then the actual language of the text is irrelevant.The index gets substantially bigger, and the search is done on literal spellings. You can now widen the search with prefix matching like:
WHERE "documentFts" ## to_tsquery('simple_unaccent', ($1 || ':*')
Again, you'll have to fine-tune. The simple example only works for single-word patterns. And I doubt you want to get rid of stemming altogether. Probably too radical.
See:
Get partial match from GIN indexed TSVECTOR column
Proper solution: Synonym dictionary
You need access to the installation drive of the Postgres server for this. So typically not possible with most hosted services.
To overrule some of the stemmer decisions, overrule with your own set of synonym(rule)s. Create a mapping file in $SHAREDIR/tsearch_data/my_synonyms.syn. That's /usr/share/postgresql/13/tsearch_data/my_synonyms.syn in my Linux installation:
Let it contain (case insensitive by default):
anime anime
Then:
CREATE TEXT SEARCH DICTIONARY my_synonym (
TEMPLATE = synonym,
SYNONYMS = my_synonyms
);
There is a chapter with instructions in the manual. One quote:
A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word “Paris” to “pari”. It is enough to have a Paris paris line in the synonym dictionary and put it before the english_stem dictionary.
Then:
CREATE TEXT SEARCH CONFIGURATION my_english_unaccent (
COPY = english
);
ALTER TEXT SEARCH CONFIGURATION my_english_unaccent
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, my_synonym, english_stem; -- added my_synonym!
You have to update your column "documentFts" with my_english_unaccent. While being at it, use a proper lower-case column name like document_fts, and consider a GENERATED column. See:
Computed / calculated / virtual / derived columns in PostgreSQL
Are PostgreSQL column names case-sensitive?
Now, searching for Anime (or ánime, for that matter) won't find animal any more. And searching for animal won't find Anime.

group by part of url using regex splunk

I have multiple url's all start with /api/net, I want to group by next couple of strings that are separated by / like
/api/net/abc/def?key=value
/api/net/c/d?key1=value1
/api/net/j/h?key2=value2
I have below regular expression which parses all url's but I explicitly have to specify required in regular expression .
| rex field=requestPath "(?<volga>.+?(\/abc\/def)|(\/c\/d)|(\/j\/h).+?)"
volga is a named capturing group, I want to do a group by on volga without adding /abc/def, /c/d,/j/h in regular expression so that I would know number of expressions in there instead of hard coding.
There are other expressions I would not know to add, So I want to group by on next 2 words split by / after "net" and do a group by , also ignore rest of the url. Let me know if you did not understand, I could explain more.

If I understand the question correctly, this regex will parse the URL and return the two domains as 'dom1' and 'dom2', respectively. Then you can group/sort on them.
... | rex field=requestPath "\/api\/net\/(?<dom1>[^\/]+)\/(?<dom2>[^\/\?]+)"
| stats values(*) as * by dom1,dom2

regular expression to pull words beginning with #

Trying to parse an SQL string and pull out the parameters.
Ex: "select * from table where [Year] between #Yr1 and #Yr2"
I want to pull out "#Yr1" and "#Yr2"
I have tried many patterns, but none has worked, such as:
matches = Regex.Matches(sSQL, "\b#\w*\b")
and
matches = Regex.Matches(sSQL, "\b\#\w*\b")
Any help?

You're trying to put a word boundary after the #, rather than before. Maybe this:
\w(#[A-Z0-9a-z]+)
or
\w(#[^\s]+)

I would have gone with
/^|\s(#\w+)\s|$/
or if you didn't want to include the #
/^|\s#(\w+)\s|$/
though I also like joel's above, so maybe one of these
/^|\s(#[^\s]+)\s|$/
/^|\s#([^\s]+)\s|$/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

searching in postgres tsvector for something with `-` in does not work - sql

Since you seem to have bypassed the to_tsvector in constructing your tsvector, you should also bypass to_tsquery and generate the tsquery directly. For example, this yields true: select 'hello-world'::tsvector ## 'hello-world | foo'::tsquery;

Related

How to make pie chart of these values in Splunk

How can I put several extracted values from a Json in an array in Kusto?

Full text search returning too many irrelevant results and causing poor performance

group by part of url using regex splunk

regular expression to pull words beginning with #

Categories

Resources