Aerospike Management Console: No batch query throughput - aerospike

How to see batch query throughput in AMC (Aerospike Management Console) ? I am using community edition. I am able to see read request made without batch but not made through batch. Or any other tool that can be used for same?

The AMC dashboard includes the ability to filter for and visualize statistics. This includes a list of batch-index related metrics. The cumulative metric batch_index_initiate and the realtime metric batch_index_queue are probably interesting to you.
However, the number of records read per-second through batch-index is not something you can see from AMC. In the batch index protocol, the records in those batch requests get split and placed on the single record transaction queues and threads. What you can do is initiate the batch_index_reads micro-benchmark using asinfo then analyze it with asloglatency.

Related

How to get query time and space information from BigQuery API

I'm going to build a web app and use BigQuery as a part of backend database, and I want to show the query cost information (ex. 1.8 sec elapsed, 264.9 MB processed) in the app.
I know we can check the BigQuery's query information inside GCP, but how do we I get that information from BigQuery API?
The information you are interested in is present in the job statistics.
See jobs.get for more details: https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/get
The dry-run sample may be of interest as well, though you can get the stats from a real invocation as well (dry run is for estimating costs without executing the query):
https://cloud.google.com/bigquery/docs/samples/bigquery-query-dry-run

Google Pub/Sub + Cloud Run scalability

I have a python application writing pubsub msg into Bigquery. The python code use the google-cloud-bigquery library and the TableData.insertAll() method quota is 10,000 requests per second per table.Quotas documentation.
Cloud Run container auto scaling is set to 100 with 1000 requests per container.So technically, I should be able to reach 10 000 requests/sec right? With the BQ insert API being the biggest bottleneck.
I only have a few 100 requests per sec at the moment, with multiple service running at the same time.
CPU and RAM at 50%.
Now confirming your project structure, and a few details given in the comments; I would then review the Pub/Sub quotas and limits, especially the Quota and the Resource limits, both tables where you can check this information depending on the size and the Throughput quota units sections tells you how to calculate quota usage.
I would answer your question as a yes, you are able to reach 10,000 req/sec. And as in this question depending on the byte size you can have 10,000 row inserts unless the recommendation is 500.
The concurrency in Cloud Run can be modified in case you need to change it.

PubSub topic with binary data to BigQuery

I'm expected to have thousands of sensors sending telemetry data at 10FPS with around 1KB of binary data per frame, using IOT Core, meaning I'll get it via PubSub. I'd like to get that data to BigQuery, and no processing is needed.
As Dataflow don't have a template capable of dealing with binary data, and working with it seems a bit cumbersome, I'd like to try to avoid it and go full serverless.
Question is, what's my best alternative?
I've thought about Cloud Run service running an express app to accept the data from PubSub, and using global variable to accumulate around 500 rows in ram, then dump it using BigQuery's insert() method (NodeJS client).
How reasonable is that? Will I gain something from accumulation, or should I just insert to bigquery every single incoming row?
Streaming Ingestion
If your requirement is to analyze high volumes of continuously arriving data with near-real-time dashboards and queries, streaming inserts would be a good choice. The quotas and limits for streaming inserts can be found here.
Since you are using the Node.js client library, use the BigQuery legacy streaming API's insert() method as you have already mentioned. The insert() method streams one row at a time irrespective of accumulation of rows.
For new projects, the BigQuery Storage Write API is recommended as it is cheaper and has an enriched feature set than the legacy API does. The BigQuery Storage Write API only supports Java, Python and Go(in preview) client libraries currently.
Batch Ingestion
If your requirement is to load large, bounded data sets that don’t have to be processed in real-time, prefer batch loading. BigQuery batch load jobs are free. You only pay for storing and querying the data but not for loading the data. Refer to quotas and limits for batch load jobs here. Some more key points on batch loading jobs have been quoted from this article.
Load performance is best effort
Since the compute used for loading data is made available from a shared pool at no cost to the user,
BigQuery does not make guarantees on performance and available
capacity of this shared pool. This is governed by the fair scheduler
allocating resources among load jobs that may be competing with loads
from other users or projects. Quotas for load jobs are in place to
minimize the impact.
Load jobs do not consume query capacity
Slots used for querying data are distinct from the slots used for ingestion. Hence, data
ingestion does not impact query performance.
ACID semantics
For data loaded through the bq load command, queries will either reflect the presence of all or none of the data .
Queries never scan partial data.

Trapping All Batch Job from MVS

I'm trying to trap all the batch Job from MVS.
I want to transmit all the batch job information (start,end,error) to an external system in order to conduct further analysis.
Has anyone got any idea on how to do this ?
Write an IEFACTRT exit (or whatever its modern day equivalent is) and have the systems programmers install it.
IBM actually provides a facility for this. You can have it write SMF (System Management Facility) records for all jobs. The record layouts are available and you can write code to do analysis on them or you can get 3rd party products like OmegaMon that will do the analysis and reporting for you.
as in my shop, we print the job info into plain files, and ftp down to some file servers and from where we run extract/format with some scripts and pull the data into BI platform for later analysis/visualisation.
Currently, we are studying to utilise the power of graph db like Neo4j to deeper understand our batch job relationship/better present the job relationship with people who interested. and for now we think graph db is a very neat tool for such kind of thing(batch job management)...
Hope my answer can give you some inspiration/reminders...
Typically, installations cut SMF type 30 records. Subtype 1 is written when a new transaction is started. transaction means, System Resources Manager (SRM) transaction. Don't confuse it with transactions in the context of e.g. a database system. A batch job that begins execution is such a transaction. Subtype 5 is written when a transaction ends. Along with subtype 5, there is a completion section that reports the job termination status.
Now, SMF processing is traditionally done in batch as you have to prepare the SMF records first either by extracting them from the log stream or from one of the SYS1.MANx data sets.
But recently, capabilities have been added to z/OS that allow you to hook into the process when SMF records are written. A product like the IBM Common Data Provider for z/OS can be used transform the data in the way you want it to be and to stream it to a destination of choice, for instance logstash. Following such a technique allows to process SMF records almost in real time.

Is there a way to leverage Hadoop tools to mange parallel REST API calls to external sources?

I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?
It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well