How to get all the records matching a regex in Aerospike? - aerospike

I have millions of records in a set. I would like to retrieve all the records that match the same pattern.
For example I may have :
id=4444?mode=mode1?fieldA=abc
id=4444?mode=mode1?fieldA=azerty
id=4444?mode=mode1?fieldA=qwerty
id=4444?mode=mode1?fieldA=foo
id=4444?mode=mode1?fieldA=bar
Is it possible to make a query to get all the above records without knowing in advance the value of the fieldA ? Something like this in regex :
id=4444?mode=mode1?fieldA=[\w]*
Thanks for you time.

Yes, it can be done. You would need to query by a secondary index first to narrow the result set to a manageable size first, then write a filter using Lua which filters out the ones you don't want. This filter could take the regex you want to match against (passed in dynamically) and return only those records that match.
Whilst this would work, it would not be as performant as the key-value operations in Aerospike. You would definitely want to benchmark such a solution before putting it into production.

Predicate filtering was added in release 3.12 on March 15. You can use the stringRegex method of the PredExp class of the Java client to build complex filters such as the one you mentioned. It also currently exists for the C, C# and Go clients.
There's a similar example in the Aerospike Java client:
Statement stmt = new Statement();
stmt.setNamespace(params.namespace);
stmt.setSetName(params.set);
stmt.setFilter(Filter.range(binName, begin, end));
stmt.setPredExp(
PredExp.stringBin("bin3"),
PredExp.stringValue("prefix.*suffix"),
PredExp.stringRegex(RegexFlag.ICASE | RegexFlag.NEWLINE)
);
The RegexFlag class in com.aerospike.client.query defines which regular expressions you can use, and how they'd behave.

Related

Keyword based JPA query with statuses as Enums and with NOT clause [Kotlin]

I have a Keyword based JPA query I need to modify in order to exclude records with a particular status. Currently, I have the following:
findAllByLatestVersion_Entity_DataFieldGreaterThanEqualAndLatestVersion_AnotherFieldNull(datefield: Istant, pageable: Pageable)
I do not want to parameterise, therefore I would like to have the query to work as there was a WHERE clause stating that the status IS NOT C, for example. I am struggling to find clear documentation on how to go about. Is it possible to write something along these lines:
findAllByLatestVersion_Entity_DataFieldGreaterThanEqualAndLatestVersion_AnotherFieldNullAndLatestVersion_StatusCNot(datefield: Istant, pageable: Pageable)
Thank you
No this is not possible with query derivation, i.e. the feature you are using here. And even if it were possible you shouldn't do it.
Query derivation is intended for simple queries where the name of the repository method that you would choose anyway perfectly expresses everything one needs to know about the query to generate it.
It is not intended as a replacement for JPQL or SQL.
It should never be used when the resulting method name isn't a good method name.
So just formulate the query as a JPQL query and use a #Query annotation to specify it.

How to keep SQL data and Elasticsearch in-sync, and which to search from?

I've seen two solutions mentioned, and was wondering what most people do.
Use logstash
Code your application to make writes to Elasticsearch alongside SQL. For example,
public saveRecord() {
saveToElasticsearch();
saveToSQL();
}
Another question is how to handle actually searching the entity? Do you ONLY use Elasticsearch?
If not, I would assume you fetch from Elasticsearch based on keywords and use the IDs returned to filter your SQL query. My question then, is how do you handle pagination? For example let's say you only want results 50 to 100. First you query Elasticsearch which returns 50-100. Then the SQL query reduces that to 20 results - the other 30 results are in what would've been the next Elasticsearch query (100 - 150 for example). Do you keep going back and forth?
As for your first question check here
As for the second question, if you plan to use elasticsearch as your search layer then better do it for all the searchable/filterable fields. As you've described, the alternative will get very messy very soon. Use elasticsearch for all your searches/filters and even aggregations if it suits your needs. Use the sql database as your point of truth and just get the full payload from there.
In general, if you will need to paginate then your search should better be in one place otherwise it will get ugly.

SQL-ish : how to change enormous code into an elegant one?

I have just one abhorrend table, no index, no keys, no IDs, no order, 25 columns, 19 million rows.
I am using the SQL-ish language named TaQL ("Table Query Language").
I need to select-from-where ... It sounds no problem!
However the WHERE conditions are 1683 sets of simple conditions:
set#1: columnA>num1 and columnB>num2 and columnC<num3 and columnD>=num4 ...
or
set#2: columnA>num189 and columnB>num274 and columnC<num321 and columnD>=num457 ...
or
set#n: ...
or
set#1683: ....
My current code is working fine, but it has 1683 lines in the WHERE statement. I created it by awk and regular expressions.
Is there an elegant way to reduce such enormous code?
Have you ever tried the approach of creating a new data format the DID have some normalization to it so you could import into your own database, then you can add indexes, clean criteria keep, etc. Maybe even adding a new column to the end of your table (or columns) as "KeepThis".
Then, apply an update YourTable set KeepThis = 1 where your criteria.. maybe even set the value equal to the condition it qualified with. Then you could query based on those values, or even all those NOT assigned to a value and see if any merit with those records not previously realized.
Sounds like a task no matter what, but might be a nice approach to have things pre-stamped and in a database you can manage vs other source.

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

RESTful API Design OR Predicates

I'm designing a RESTful API and I'm trying to work out how I could represent a predicate with OR an operator when querying for a resource.
For example if I had a resource Foo with a property Name, how would you search for all Foo resources with a name matching "Name1" OR "Name2"?
This is straight forward when it's an AND operator as I could do the following:
http://www.website.com/Foo?Name=Name1&Age=19
The other approach I've seen is to post the search in the body.
You will need to pick your own approach, but I can name few that seem to be pretty logical (although not without disadvantages):
Option 1.: Using | operator:
http://www.website.com/Foo?Name=Name1|Name2
Option 2.: Using modified query param to allow selection by one of the values from the set (list of possible comma-separated values):
http://www.website.com/Foo?Name_in=Name1,Name2
Option 3.: Using PHP-like notation to provide list instead of single string:
http://www.website.com/Foo?Name[]=Name1&Name[]=Name2
All of the above mentioned options have one huge advantage: they do not interfere with other query params.
But as I mentioned, pick your own approach and be consistent about it across your API.
Well one quick way to fixing that is to add an additional parameter that is identifying the relationship between your parameters wether they're an and or an or for example:
http://www.website.com/Foo?Name=Name1&Age=19&or=true
Or for much more complex queries just keep a single parameter and in it include your whole query by making up your own little query language and on the server side you would parse the whole string and extract the information and the statement.