I have a table with Id and Text fields. The Text field holds sentences, averaging 50 words. There are >1,000,000 rows.
This is part of a web app where users need to be able to search through these sentences. Here's the twist though - I need to be able to run a custom search function written in C# that uses Machine Learning instead.
From what I understand, this means I'll have to download the entire database of >1,000,000 rows every time a user makes a search! This seems really inefficient to me.
How would you implement this in the most efficient/fast way possible?
If this is relevant, I'm using EF Core with LINQ .Where(my_custom_search_function), with a PostgreSQL database
I think I've found the solution. Postgresql full-text search currently provides two ranking functions. In this case "sorting" in the question and "ranking" here refer to the same thing.
Postgresql docs state:
However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
These functions can any of the four kinds of supported postgresql functions.
Then they answer this exact question:
Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
Credits to #Used_By_Already for pointing me to Postgresql full-text search.
Related
In my project, I am asked to implement a text query service on the database we are using; Postgresql. I have used Postgresql Full Text Search features, which works fairly fine in terms of time. One problem about full text search is, it does not have fuzzy search abilities. On the other hand, there is an extension named pgtrgm providing functions and operators for determining the similarity of alphanumeric text. Also there are several examples of text search using pgtrgm like:
select actor
from products
where actor % 'tomy';
As you know example of postgres FTS also here;
SELECT title
FROM pgweb
WHERE to_tsvector(body) ## to_tsquery('friend');
So, the main question is, what is the difference between these two search strategies? Which one is more appropriate way for searching texts? Is it possible to mix them? I also need to say that performance is an important concern as well. Thanks in advance!
They do completely different things. About the only thing that is not different between them is that they operate on text and can benefit from use of indexes. From you question, it seems like you already have a good sense of the differences. The appropriate one is the one that does what you want. If one of them was always appropriate, we probably wouldn't have created the other one.
You can mix them, but you will need different indexes for each one, they cannot share an index. Also, you probably need different tables as well, as full text search is more appropriate for sentences or paragraphs while trigram for individual words or short phrases.
One way to mix them would be to have one table of full texts, and another table which lists only each distinct word present in any of the full texts. The 2nd table could be used to detect probable typos in the query, and then once those are fixed by suggestions from trigram searching, run the fixed query against the 1st table.
The difference is quite huge - in fuzzy search, you're searching for a similar result, in full-text search - for the exact same. If one is more appropriate than the other is the matter of use-case.
If you don't need fuzziness, don't use it, it's a huge performance overhead because it has to match the text not exactly, but also try other combinations.
I used Oracle for the half past year and learned some tricks of sql tuning,but now our DB is moving to greenplum and the project manager suggest us to change some of the codes that writted in Oracle sql for their efficiency or grammar.
I am curious that Are sql tuning ways same for different DB engine,like oracle,postgresql,mysql and so on?if yes or not,why?Any suggestion are welcomed!
some like:
in or exists
count(*) or count(column)
use index or not
use exact column instead of select *
For the most part the syntax that is used will remain the same, there may be small differences from one engine to another and you may run into different terms to achieve some of the more specific output or do more complex tasks. In order to achieve parity you will need to learn those new terms.
As far as tuning, this will vary from system to system. Specifically going from Oracle to Greenplum you are looking at moving from a database where efficiency in a query if often driven by dropping an index on the data. Where Greenplum is a parallel execution system where efficiency is gained by effectively distributing the data across multiple systems and querying them in parallel. In Greenplum indexing is an additional layer that usually does not add benefit, just additional overhead.
Even within a single system using changing the storage engine type can result in different ways to optimize a query. In practice queries are often moved to a new platform and work, but are far from optimal as they don't take advantage of optimizations of that platform. I would strongly suggest getting an understanding of the new platform and you should not go in assuming a query that is optimized for one platform is the optimal way to run it in another.
Getting specifics in why they differ requires someone to be an expert in bother to be able to compare both. I don't claim to know much of greenplum.
The basic principles which I would expect all developers to learn over time dont really change. But there are "quirks" of individual engines which make specific differences. From your question I would personally anticipate 1 and 4 to remain the same.
Indexing is something which does vary. For example the ability to use two indexes was not (is not?) Ubiquitous. I wouldn't like to guess which DBMS can / can't count columns from the second field in a composite index. And the way indexes are maintained is very different from one DBMS to the next.
From my own experience I've also seen differences caused by:
Different capabilities in the data access path. As an example, one optimisation is for a DBMS to create a bit map of rows (matching and not matching) the combine multiple bitmaps to select rows. A DBMS with this feature can use multiple indexes in a single query. One without it can't.
Availability of hints / lack of hints. Not all DBMS support them. I know they are very common in Oracle.
Different locking strategies. This is a big one and can really affect update and insert queries.
In some cases DBMS have very specific capabilities for certain types of data such as geographic data or searchable free text (natural language). In these cases the way of working with the data is entirely different from one DBMS to the next.
I am writing a prototype of a new app for an enterprise. I want to include a great search engine, which is something they have never had before. What I am looking for is something that can translate a lucene style query language into SQL statements on a key value pair data model. (three fields, grouping id, key, value)
Ive been looking for a while now and havn't had any luck. Im about to open the source for lucene and see if I can pull the query algorithms out and have them generate sql instead of index search commands. but im not very hopefull.
I can't just run lucene or any other indexing system on this enterprise for political and regulatory reasons so thats not an option.
Does this type of system exist?
see if I can pull the query algorithms out and have them generate sql instead
Don't waste your time. SQL and Lucene queries work in a completely different way; this is because they use different underlying data structures, algorithms, etc.
The best you can do is to write SQL query parser and rewrite those queries into Lucene queries. But you'd have to be naive to think you can write full-blown SQL query parser. You can easily solve simple cases, but what are you going to do when somebody sends you a JOIN? Or a GROUP BY bar HAVING foo>3?
If you can't jump over political hurdles, just use one of the full text indexing algorithms databases can offer; this is better than nothing.
Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.
Can you advise on whether I can use just the Query functionality from Lucene to generate SQL queries? Something like an SQLQueryBuilder?
I have a massive SQL database of logs from a webserver cluster containing the original request and response strings plus some other useful/less bits and bobs. What I need to do is analyse the parameters in the original request and compare with the generated responses, looking at ratios, volatility, variability, consistency etc.
This question does not relate to the analysis stage, but only the retrieval of data from database which matches the parameters I'm interested in. So, I could just do this in good old sql queries, manually building the exact queries I need on a case-by-case basis. But that's kinda lame; I reckon we can be a bit smarter than that. Particularly as I can already see large numbers of similar but subtly different queries being useful. And as I'm hoping that I can expose a single search box via a web interface to non-technical end-users, adding sql queries seems like a bad idea... and a recipe for permanent maintenance requests (and can I be the first to say, er no thanks!).
In an ideal world I expose a search form, with the option to write simple queries like
request:"someAttribute=\"someValue\"" AND response="some hoped for result" AND daterange:30
which would then hopefully find all instances of requests which contain someAttribute="someValue" over the last 30 days. The results will then be put through standard statistical analyses on the given response text and printed out on-screen. At least, that's the idea.
Much of the actual logic to determine how to handle custom field definitions or special words I'll need to write myself, and that's ok. And NB, my non-technical end users are familiar enough with xml that they can handle a bit of attr="value" syntax, at least for the first iteration of the tool :D
In summary, I want to:
1) allow users to use google-like search syntax (e.g. via Lucene's QueryAPI) to specify text to match in the logs
2) allow a layer to manipulate the query based on special words or fields (e.g. this layer could be during a Java object phase)
3) convert the final query into an sql query appropriate for my database schema
4) query the database and spit back the resultset for statistical analysis
5) pretty-print on website:)
Am I completely barking up the wrong tree? It looks like it should be possible, but I can't seem to find much on it. I've been googling for a bit on this, for example trying "Lucene SQLQueryBuilder" as a possible start but didn't really find much by way of a lead.
So, my questions are:
Has anyone tried using Lucene's QueryAPI like this before? Did it work? Any gotchas?
Are there better query api libraries out there?
Examples, finished discussions and open-source implementations would be most helpful.
Many thanks.
NB: I don't think I want Lucene's search capabilities as such, as I'm only ever looking for exact matches. I just need a query layer on top of the database.
Lucene and SQL have very little in common as they're using totally different syntax (as HefferWolf mentioned) and different underlying data models. As you said yourself, I'm afraid you're barking the wrong tree.
There are however attempts, such as Hibernate Search to bridge this gap. These are interesting experiments as such, but I would be very careful to use any of that code in production.
You could possibly use Full Text Search features available in some SQL databases, or reindex all data in Lucene and use it without database.
I doubt you can reuse any code from lucene for this. Lucene does an internal rewrite of such queries but into a syntax which wouldn't be of much help for SQL I think.
name: Phil AND lastname: Miller AND NOT age: 26
would be rewritten to
+name Phil +lastname: Miller -age: 26
So I think you would have to write your on transition into a SQL Query syntax.
But maybe you can use Lucene as such for this. Have a look into hibernate-search which is quite handy to easily create a lucene index of a sql table.