Spark SQL: Get Total Matching Rows - apache-spark-sql

I am using Spark SQL to build a query UI on top of json logs stored on Amazon S3. In the UI, most queries use limit to bring back the top results. Usually just the first ten.
Is there a way with spark sql to show the total number of rows that matched the query without re-running the query as a count?

Related

Laravel Nova How to display a resource with millions of rows?

I have a resource with 5 million rows in an InnoDB table. Nova times out when executing a COUNT(*) query. I imagine this is used in pagination. Is there a way to disable that behavior (just have a NEXT PREV in pagination) for only one particular resource?
The query it times out on is:
select count(*) as aggregate from `queue_articles`
I expect it to display the resource as a Nova resource index.

Redshift query output too big for RAM

I have a redshift query the output of which is too big to fit into the RAM of my EC2 instance. I am using psycopg2 to execute the query. If I use the limit keyword will the rows repeat if I increment the limit ?
Say I enforce a limit of 0,1000 at first and get a block, then I enforce a limit of 1001,2000. Will there be a repetition of rows in those two blocks considering redshift fetches data parallelly ?
Is there a better alternative to this ?
You want to DECLARE a cursor to store the full results on Redshift and then FETCH rows from the cursor in batches as you need. This way the query only runs once when the cursor is filled. See https://docs.aws.amazon.com/redshift/latest/dg/declare.html (example is at the bottom of the page).
This is exactly how BI tools like Tableau get data from Redshift - in blocks of 10,000 rows. Using cursors prevents the tool/system/network from being overwhelmed by data as you possibly select very large result sets.

BigQuery rows get swapped

I have been trying to create a table in BigQuery from a .csv file stored in my bucket. The table is created and the data is loaded with correct number of rows and columns, however, the rows get swapped in BigQuery for some reason.
I tried to use R connector to push the data from my local machine to BigQuery and the same problem occurs.
So when I do SELECT * FROM ,
it shows me the complete table inside BigQuery but the rows are swapped (i.e. row 21 becomes row 1, row 4000 becomes row 3 for example).
I will appreciate your response.
As in most SQL-related databases, data stored in BigQuery has no natural order. When you store data in BigQuery it will be automatically sorted in ways that can optimize the execution time of queries.
If you need to preserve the original order, you might need to add an additional column noting the relative order - and then call it out with an ORDER BY on queries.

Datastore for aggregations

What is a preferred datastore for fast aggregating of data?
I have data that I pull from other systems regularly, and the data store should support queries like:
What is the number of transactions done by a user in a time range.
What is the total sum of successful transactions done by a user in a time range.
Queries should support sql constructs like group by, count, sum etc over a large set of data.
Right now, I'm using a custom data model in Redis, and data is fetched in memory, and then aggregates are run over it. The problem with this model is that this is closely tied to my pivots (columns) and any additional pivot, if added will cause my data to explode leading to huge memory consumption on my redis boxes.
I've explored elasticsearch, but elasticsearch queries with aggregations are taking longer than 200ms, for the kind of data that I have.
Are there any other alternatives, I'm also looking at Aerospike now. Can someone throw some light on how does aerospike aggregates work in this scenario?
Aerospike supports aggregations on top of secondary index queries. Seems most of your queries are pivoted on user. You can build a secondary index on top of userid and query for all the data corresponding to a user. You can then slap the aggregation logic and filter the stuff based on desired time range. you need to do this because Aerospike does not yet support multiple where clause where you query for a user and a time range at same time.
Your queries 1 & 2 can be done by writing an aggregation UDF based on a secondary index query on userid as above.
I am not very clear about your 3 questions. Aerospike does not provide group by, sum, count etc as native queries. But you can always write an aggregation UDF to achieve it. http://www.aerospike.com/docs/guide/aggregation.html

How did my BigQuery bill get calculated

I've been playing with Bigquery and building some aggregates for a reporting product. To my horror I noticed that my bill so far this month is over $4000! There doesn't seem to be any way of breaking down my usage - is there a report I can get of queries by data processed/cost?
BigQuery data processing is billed at $0.035 per gigabyte. (see pricing here). You can see how much data you're processing by looking at your query jobs.
If you're using the UI, it will tell you how much data your query processed next to the 'Run Query' button. If you're using the bq command-line tool, you can see the jobs you've run by running bq ls -j and then show how much data has been processed by each job by running bq show -j job_0bb47924271b433b895b690726099f69 (substitute your own job id here). If you're running queries by using the API directly, the number of bytes scanned is returned on the job in the statistics.totalBytesProcessed field.
If you'd like to reduce the amount you're spending, you can either use fewer columns in your queries, break your tables up into smaller pieces (e.g. daily tables), or use the batch query mode for non-time-sensitive queries which is only $0.020/GB processed.
If you break tables into smaller pieces, you can always do queries over multiple tables when needed using the ',' syntax. E.g. SELECT foo from table1, table2, table3.