Not in List Operation - AeroSpike - aerospike

Does AeroSpike support not in operation on List ? I couldn't see it in the document or find any reference to it. Could somebody confirm if that's the case ?

Your understanding is correct. Aerospike does not support "not in" operation on list. It will be an operation based on a search value, if my understanding is correct. There are no list read operations by value.

Related

Splunk : Record deduplication using an unique field

We are considering moving out log analytics solution from ElasticSearch/Kibana to Splunk.
We currently use "document id" in ElasticSearch to deduplicate records when indexing :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
In Splunk, I found the internal field "_cd" which is unique to each record in Splunk index: https://docs.splunk.com/Documentation/Splunk/8.1.0/Knowledge/Usedefaultfields
However, using HTTP Event Collector to ingest records, I couldn't find any way to embed this "_cd" field in the request :
https://docs.splunk.com/Documentation/Splunk/8.1.0/Data/HECExamples
Any tips on how to achieve this in Splunk ?
What are you trying to achieve?
If you're sending "unique" events to the HEC, or you're running UFs on "unique" logs, you'll never get duplicate "records when indexing".
It sounds like you (perhaps routinely?) resend the same data to your aggregation platform - which is not a problem with the aggregator, but with your sending process.
Almost like you're doing a MySQL/PostgreSQL "insert if not exists" operation. If that is a correct understanding of your situation, based on your statement
We currently use "document id" in ElasticSearch to deduplicate records when indexing:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
We generate the id using hash of the content of the each log-record.
then you need to evaluate what is going "wrong" in your sending process that you feel you need to pre-clean the data before ingesting it.
It is true that Splunk won't "deduplicate records when indexing" - because it presumes the data coming-in to be 'correct' from whatever is submitting it.
How are you getting duplicate data in the first place?
Fields in Splunk which begin with the underscore (eg _time, _cd, etc) are not editable/sendable - they're generated by Splunk when it receives data. IOW, they're all internal fields. Searchable. Usable. But not overrideable.
If you really have a problem with [lots of/too much] duplicate data, and there is no way to fix your sending process[es], then you'll need to rely on deduplication operations in SPL when searching for/reporting on whatever you've ingested (primarily by using stats and, when absolutely necessary/unavoidable, dedup).
HEC inputs don't go through the usual ingestion pipeline so not all internal fields are present.
Not that it matters, really, because Splunk doesn't deduplicate at index time. There is no provision for searching data to see if a given record is already present. Any deduplication must be done at search time.
One cannot use the _cd field to deduplicate at search time because two identical records will have different _cd values.
Consider using a tool such as Cribl to add a hash to each ingested record and use that hash in Splunk to deduplicate in your searches.
Good call #RichG. Cribl has some nice options for this use case.
https://cribl.io/blog/streaming-data-deduplication-with-cribl/
Be aware you can add other fields to HEC data if you are using Cribl LogStream. You get many more options using LogStream. It saved my old team so much time and effort.

Get range of elements from redis using ReJSON

So inside json we have a key called items which actually contain an array of elements, now we can get selective index from that array using JSON.GET employees-list .items[1].
But in our case what we need is to get a range from this array say elements with index 0-10, 10-20 etc for pagination purpose so that we don't have to get entire data in code and then filter results.
The reason I am looking for it because if we are reading entire list then as data size would be huge data transfer latency would increase as the API using it and redis server aren't on the same instance plus it makes more sense to not have to do it in code if possible.
So first thing is it even possible and if yes how can we it be achieved?
RedisJSON doesn't support full JSONPath syntax and only supports simple single paths.
But, you can utilize Redis pipeline support to achieve a good enough result, sending the following in a non-blocking way:
JSON.GET employees-list .items[1]
JSON.GET employees-list .items[2]
JSON.GET employees-list .items[3]
RedisJSON2 on the other hand has full JSONPath support and does support such queries, but currently for backward compatibility it only returns the first element (like RedisJSON). This kind of support is about to be added probably in the next week.

Does a snapahot of elasticsearch that ended with ‘success’ state, can have missing index documents?

I have a snapahot I did that ended in 'success' state.
but I wonder, is that necessarily means that all the documents inside the index that have been snapshotted, were indeed snapshotted?
it has been said in their documentation that it means all shards were available and got snapshoted,
which means that all indices were snapshotted but what about the docs inside? is there any
validation about them? any index docs count check??
elasticsearch team member confirms:
Yes, "success" means that all the data from the index was successfully snapshotted.

Python Apache Beam: BigQuery streaming deduplication by row_id

According to BigQuery docs, you can ensure data consistency providing an insertId (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency). If it's not provided, BQ will try to ensure consistency based on internals Ids and best-effort.
Using the BQ API you can do that with the row_ids param (https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.insert_rows_json.html#google.cloud.bigquery.client.Client.insert_rows_json) but I can't find the same for the Apache Beam Python SDK.
Looking into the SDK I have noticed that a 'unique_row_id' property exist, but I really don't know how to pass my param to WriteToBigQuery()
How can I write into BQ (streaming) providing a row Id for deduplication?
Update:
If you use WriteToBigQuery then it will automatically create and
insert a unique row id called insertId for you, which will be inserted to bigquery. It's handled for you, you don't need to worry about it. :)
WriteToBigQuery is a PTransform, and in it's expand method calls BigQueryWriteFn
BigQueryWriteFn is a DoFn, and in it's process method calls _flush_batch
_flush_batch is a method that then calls the BigQueryWrapper.insert_rows method
BigQueryWrspper.insert_rows creates a list of bigquery.TableDataInsertAllRequest.RowsValueListEntry objects which contain the insertId and the row data as a json object
The insertId is generated by calling the unique_row_id method which returns a value consisting of UUID4 concatenated with _ and with an auto-incremented number.
In the current 2.7.0 code, there is this happy comment; I've also verified it is true :)
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1182
# Prepare rows for insertion. Of special note is the row ID that we add to
# each row in order to help BigQuery avoid inserting a row multiple times.
# BigQuery will do a best-effort if unique IDs are provided. This situation
# can happen during retries on failures.
* Don't use BigQuerySink
At least, not in it's current form as it doesn't support streaming. I guess that might change.
Original (non)answer
Great question, I also looked and couldn't find a certain answer.
Apache Beam doesn't appear to use that google.cloud.bigquery client sdk you've linked to, it has some internal generated api client, but it appears to be up-to-date.
I looked at the source:
The insertall method is there https://github.com/apache/beam/blob/18d2168ee71a1b1b04976717f0f955199bb00961/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py#L476
I also found the insertid mentioned
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_messages.py#L1707
So if you can make an InsertAll call it will use a TableDataInsertAllRequest and pass a RowsValueListEntry
class TableDataInsertAllRequest(_messages.Message):
"""A TableDataInsertAllRequest object.
Messages:
RowsValueListEntry: A RowsValueListEntry object.
The RowsValueListEntry message is where the insertid is.
Here's the API docs for insert all
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
I will look some more at this because I don't see the WriteToBigQuery() exposing this.
I suspect that the 'bigquery will remember this for at least one minute` is a pretty loose guarantee for de-duping. The docs suggest using datastore if you need transactions. Otherwise you might need to run SQL with window functions to de-dupe at runtime, or run some other de-duping jobs on bigquery.
Perhaps using batch_size parameter of WriteToBigQuery(), and running a combine (or at worst a GroupByKey) step in dataflow is a more stable way to de-dupe prior to writing.

endeca returning zero results in refinements when none of refinements available in ref app?

I am using Endeca 3.1.2 Assembler API. When I am hitting the Endeca query, its giving me some bunch of refinements which contains zero counts and some positive counts .
Example:
category
**category1(0)**
category2(25)
**category3(0)**
Like this result I am getting. When I am hitting the same query in jspref application I am not getting any refinements which contains zero count.
My expectation is that I don't want to get that zero count refinements on the available refinements.
Please help me to get out from this.
You might have disabled refinements enabled in your query.
Check whether you have the Ndr parameter in Dgraph request log file
If so, ensure your code doesn't have: ENEQuery.setNavDisabledRefinementsConfig() method.
Endeca has one of the features called implicit dimensions. There might be the case that implicit dimension is being displayed to the front-end. Endeca provides implicit dimension as part of the query response.
Following code is being used to get implicit dimension.
Navigation.getCompleteDimensions().getDimension(dimensionid)