Aerospike AQL How to calculate sum of records in stream - aerospike

How do I calculate the sum of values with where clause in Aerospike.
I am a newbie in Aerospike. Any good reference documentations that I could follow?

Either aggregate in client as records are returned in callback from Secondary Index query or use a Stream UDF.
You can use the Stream UDF approach with AQL. But, you should really write a standalone application using one of the clients, such as the Java client.
Using Java Client:
For SI query approach see code example here: https://www.aerospike.com/docs/client/java/examples/application/queries.html
For Stream UDF approach, see code example here: https://www.aerospike.com/docs/client/java/examples/application/aggregate.html

Related

BigQueryIO.write() use SQL functions

I have a Dataflow streaming job. I am using BigqueryIO.write library to insert rows into BigQuery tables. There is a column in the BQ table, which is supposed to store the row creation timestamp. I need to use the SQL function "CURRENT_TIMESTAMP()" to update the value of this column.
I can not use any of the java's libraries (like Instant.now()) to get the current timestamp. Because that will derive the value during the job execution. I am using a BigQuery load job, whose triggering frequency is 10 mins. So if I use any java libraries to derive the timestamp, then it won't return the expected output.
I could not find any method in BigqueryIO.write, which takes any SQL function as input. So what's the solution to this issue?
It sounds like you want BigQuery to assign a timestamp to each row, based on when the row was inserted. The only way I can think of to accomplish this is to submit a QueryJob to BigQuery that contains an INSERT statement that includes CURRENT_TIMESTAMP() along with the values of the other columns. But this method is not particularly scalable with data volume, and it's not something that BigQueryIO.write() supports.
BigQueryIO.write supports batch loads, the streaming inserts API, and the Storage Write API, none of which to my knowledge provide a method to inject a BigQuery-side timestamp like you are suggesting.

Select random bin from a set in Aerospike Query Language?

I want to select a sample of random 'n' bins from a set in the namespace. Is there a way to achieve this in Aerospike Query Language?
In Oracle, we achieve something similar with the following query:
SELECT * FROM <table-name> sample block(10) where rownum < 101
The above query fetches blocks of size of 10 rows from a sample size of 100.
Can we do something similar to this in Aerospike also?
Rows are like records in Aerospike, and columns are like bins. You don’t have a way to sample random columns from a table, do you?
You can sample random records from a set using ScanPolicy.maxRecords added to a scan of that set. Note the new (optional) set indexes in Aerospike version 5.6 may accelerate that operation.
Each namespace has its data partitioned into 4096 logical partitions, and the records in the namespace evenly distributed to each of those using the characteristics of the 20-byte RIPEMD-160 digest. Therefore, Aerospike doesn't have a rownum, but you can leverage the data distribution to sample data.
Each partition is roughly 0.0244% of the namespace. That's a sample space you can use, similar to the SQL query above. Next, if you are using the ScanParition method of the client, you can give it the ScanPolicy.maxRecords to pick a specific number of records out of that partition. Further you can start after an arbitrary digest (see PartitionFilter.after) if you'd like.
Ok, now let's talk data browsing. Instead of using the aql tool, you could be using the Aerospike JDBC driver, which works with any JDBC compatible data browser like DBeaver, SQuirreL, and Tableau. When you use LIMIT on a SELECT statement it will basically do what I described above - use partition scanning and a max-records sample on that scan. I suggest you try this as an alternative.
AQL is a tool written using Aerospike C Client. Aerospike does not have a SQL like query language per se that the server understands. What ever functionality that AQL provides is documented - type HELP on the aql> prompt.
You can write an application in C or Java to achieve this. For example, in Java, you can do a scanAll() API call with maxRecords defined in the ScanPolicy. I don't see AQL tool offering that option for scans. (It just allows you to specify a scan rate, one of the other ScanPolicy options.)

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

Python Apache Beam: BigQuery streaming deduplication by row_id

According to BigQuery docs, you can ensure data consistency providing an insertId (https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency). If it's not provided, BQ will try to ensure consistency based on internals Ids and best-effort.
Using the BQ API you can do that with the row_ids param (https://google-cloud-python.readthedocs.io/en/latest/bigquery/generated/google.cloud.bigquery.client.Client.insert_rows_json.html#google.cloud.bigquery.client.Client.insert_rows_json) but I can't find the same for the Apache Beam Python SDK.
Looking into the SDK I have noticed that a 'unique_row_id' property exist, but I really don't know how to pass my param to WriteToBigQuery()
How can I write into BQ (streaming) providing a row Id for deduplication?
Update:
If you use WriteToBigQuery then it will automatically create and
insert a unique row id called insertId for you, which will be inserted to bigquery. It's handled for you, you don't need to worry about it. :)
WriteToBigQuery is a PTransform, and in it's expand method calls BigQueryWriteFn
BigQueryWriteFn is a DoFn, and in it's process method calls _flush_batch
_flush_batch is a method that then calls the BigQueryWrapper.insert_rows method
BigQueryWrspper.insert_rows creates a list of bigquery.TableDataInsertAllRequest.RowsValueListEntry objects which contain the insertId and the row data as a json object
The insertId is generated by calling the unique_row_id method which returns a value consisting of UUID4 concatenated with _ and with an auto-incremented number.
In the current 2.7.0 code, there is this happy comment; I've also verified it is true :)
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L1182
# Prepare rows for insertion. Of special note is the row ID that we add to
# each row in order to help BigQuery avoid inserting a row multiple times.
# BigQuery will do a best-effort if unique IDs are provided. This situation
# can happen during retries on failures.
* Don't use BigQuerySink
At least, not in it's current form as it doesn't support streaming. I guess that might change.
Original (non)answer
Great question, I also looked and couldn't find a certain answer.
Apache Beam doesn't appear to use that google.cloud.bigquery client sdk you've linked to, it has some internal generated api client, but it appears to be up-to-date.
I looked at the source:
The insertall method is there https://github.com/apache/beam/blob/18d2168ee71a1b1b04976717f0f955199bb00961/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_client.py#L476
I also found the insertid mentioned
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/internal/clients/bigquery/bigquery_v2_messages.py#L1707
So if you can make an InsertAll call it will use a TableDataInsertAllRequest and pass a RowsValueListEntry
class TableDataInsertAllRequest(_messages.Message):
"""A TableDataInsertAllRequest object.
Messages:
RowsValueListEntry: A RowsValueListEntry object.
The RowsValueListEntry message is where the insertid is.
Here's the API docs for insert all
https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
I will look some more at this because I don't see the WriteToBigQuery() exposing this.
I suspect that the 'bigquery will remember this for at least one minute` is a pretty loose guarantee for de-duping. The docs suggest using datastore if you need transactions. Otherwise you might need to run SQL with window functions to de-dupe at runtime, or run some other de-duping jobs on bigquery.
Perhaps using batch_size parameter of WriteToBigQuery(), and running a combine (or at worst a GroupByKey) step in dataflow is a more stable way to de-dupe prior to writing.

Apache Beam Stream from KafkaIO - Window need

I am streaming messages from Kafka topic using KafkaIO API
https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
The pipeline flow is as follows:
KafkaStream --> Decode Message using transformer -->Save to BigQuery
I decoding the message and save to BigQuery using BigQueryIO. I would like to know do I need to use window or not.
Window.into[Array[Byte]](FixedWindows.of(Duration.standardSeconds(10)))
.triggering(
Repeatedly
.forever(
AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))
)
)
.withAllowedLateness(Duration.standardSeconds(0))
.discardingFiredPanes()
)
as per documenattion Window is require in case we are doing any computation like GroupByKey,etc. Since I am just decoding Array Byte message and storing them into BigQuery, it may not require.
Please let me know, do I need to use window or not?
There is an answer already posted to a similar question, where the data is being stream from PubSub. The main ideas is that it is impossible to collect all of the elements of an unbounded PCollections since new elements are being constantly added, and therefore one of two strategies must be implemented:
Windowing: you should first set a non-global windowing function.
Triggers: you can set up a trigger for an unbounded PCollection in such a way that it provides periodic updates on an unbounded dataset, even if the data in the subscription is still flowing
It might also be necessary to enable Streaming in the Pipeline by setting the appropriate arg parameter of the option using the following command:
pipeline_options.view_as(StandardOptions).streaming = True