Select random bin from a set in Aerospike Query Language? - aerospike

I want to select a sample of random 'n' bins from a set in the namespace. Is there a way to achieve this in Aerospike Query Language?
In Oracle, we achieve something similar with the following query:
SELECT * FROM <table-name> sample block(10) where rownum < 101
The above query fetches blocks of size of 10 rows from a sample size of 100.
Can we do something similar to this in Aerospike also?

Rows are like records in Aerospike, and columns are like bins. You don’t have a way to sample random columns from a table, do you?
You can sample random records from a set using ScanPolicy.maxRecords added to a scan of that set. Note the new (optional) set indexes in Aerospike version 5.6 may accelerate that operation.
Each namespace has its data partitioned into 4096 logical partitions, and the records in the namespace evenly distributed to each of those using the characteristics of the 20-byte RIPEMD-160 digest. Therefore, Aerospike doesn't have a rownum, but you can leverage the data distribution to sample data.
Each partition is roughly 0.0244% of the namespace. That's a sample space you can use, similar to the SQL query above. Next, if you are using the ScanParition method of the client, you can give it the ScanPolicy.maxRecords to pick a specific number of records out of that partition. Further you can start after an arbitrary digest (see PartitionFilter.after) if you'd like.
Ok, now let's talk data browsing. Instead of using the aql tool, you could be using the Aerospike JDBC driver, which works with any JDBC compatible data browser like DBeaver, SQuirreL, and Tableau. When you use LIMIT on a SELECT statement it will basically do what I described above - use partition scanning and a max-records sample on that scan. I suggest you try this as an alternative.

AQL is a tool written using Aerospike C Client. Aerospike does not have a SQL like query language per se that the server understands. What ever functionality that AQL provides is documented - type HELP on the aql> prompt.
You can write an application in C or Java to achieve this. For example, in Java, you can do a scanAll() API call with maxRecords defined in the ScanPolicy. I don't see AQL tool offering that option for scans. (It just allows you to specify a scan rate, one of the other ScanPolicy options.)

Related

BigQueryIO.write() use SQL functions

I have a Dataflow streaming job. I am using BigqueryIO.write library to insert rows into BigQuery tables. There is a column in the BQ table, which is supposed to store the row creation timestamp. I need to use the SQL function "CURRENT_TIMESTAMP()" to update the value of this column.
I can not use any of the java's libraries (like Instant.now()) to get the current timestamp. Because that will derive the value during the job execution. I am using a BigQuery load job, whose triggering frequency is 10 mins. So if I use any java libraries to derive the timestamp, then it won't return the expected output.
I could not find any method in BigqueryIO.write, which takes any SQL function as input. So what's the solution to this issue?
It sounds like you want BigQuery to assign a timestamp to each row, based on when the row was inserted. The only way I can think of to accomplish this is to submit a QueryJob to BigQuery that contains an INSERT statement that includes CURRENT_TIMESTAMP() along with the values of the other columns. But this method is not particularly scalable with data volume, and it's not something that BigQueryIO.write() supports.
BigQueryIO.write supports batch loads, the streaming inserts API, and the Storage Write API, none of which to my knowledge provide a method to inject a BigQuery-side timestamp like you are suggesting.

Aerospike AQL How to calculate sum of records in stream

How do I calculate the sum of values with where clause in Aerospike.
I am a newbie in Aerospike. Any good reference documentations that I could follow?
Either aggregate in client as records are returned in callback from Secondary Index query or use a Stream UDF.
You can use the Stream UDF approach with AQL. But, you should really write a standalone application using one of the clients, such as the Java client.
Using Java Client:
For SI query approach see code example here: https://www.aerospike.com/docs/client/java/examples/application/queries.html
For Stream UDF approach, see code example here: https://www.aerospike.com/docs/client/java/examples/application/aggregate.html

Camel Sql Consumer Performance for Large DataSets

I am trying to cache some static data in Ignite cache in order to query faster so I need to read the data from DataBase in order to insert them into cache cluster.
But number of rows is like 3 million and it causes OutOfMemory error normally because SqlComponent is trying to process all the data as one and it tries to collect them once and for all.
Is there any way to split them when reading result set (for ex 1000 items per Exchange)?
You can add a limit in the SQL query depending on what SQL database you use.
Or you can try using jdbcTemplate.maxRows=1000 to use that option. But it depends on the JDBC driver if it supports limiting using that option or not.
And also mind you need some way to mark/delete the rows after processing, so they are not selected in the next query, such as using the onConsume option.
You can look in the unit tests to find some examples with onConsume etc: https://github.com/apache/camel/tree/master/components/camel-sql/src/test

Why does DevCenter of Datastax has row restrictions to 1000?

There is a limit of displaying 1000 rows for the tables in Datastax Devcenter. Any reason for having this option?
Because when queried as SELECT count(*) FROM tablename;
the performance from Cassandra is going to be same whether displaying 1000 records or complete records set.
DevCenter version 1.6.0 introduces result set paging which allows you to browse all the rows in your result set.
In DevCenter 1.6.0 the "with limit" value sets the paging size, e.g. the number of records to view per page and is still limited to 1000 maximum. However, now you can page forward (and back) through all of the query results.
A related new feature allows you to export all results to a file, either as CSV or INSERT statements. Right-click in the results view area and select "Export all results to File as [CSV|Insert]".
This is by design; consider it as a safeguard that prevents you from potentially fetching thousands or millions of rows by accident, which, among other problems, could have a serious impact on your network's bandwidth usage.
When you run the query in Datastax DevCenter 1.6 it displays the 1000 record in result as selected limit but if you export the same result to CSV it will give you all the record which you are looking for.
I run Datastax Devcenter 1.4.
I run the query with a limit and it provides me the actual count.
But LIMIT is limited to maximum value of signed integer (2147483647)
select count(*) from users LIMIT 2147483647;-- ALLOW FILTERING;

long running queries: observing partial results?

As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.