AWS athena, can we write query results in distributed manner upon query? - sql

I'm trying to compare performance of SELECT Vs. CTAS.
The reason CTAS is faster for bigger data is b.c. data format and its ability to write query results in distributed manner into multiple parquet files.
All athena queries are written to S3 then read from there (I may be wrong), is there way to distributed writing query result of regular select into single file? So without bucketing nor partioning.

Related

Small single parquet file on Data Lake, or relational SQL DB?

I am designing a Data Lake in Azure Synapse and, in my model, there is a table that will store a small amount of data (like 5000 rows).
The single parquet file that stores this data will surely be smaller than the smallest recommended size for a parquet file (128 MB) and I know that Spark is not optimized to handle small files. This table will be linked to a delta table, and I will insert/update new data by using the MERGE command.
In this scenario, regarding performance, is it better to stick with a delta table, or should I create a SQL relational table in another DB and store this data there?
It depends on multiple factors like the types of query you will be running and how often you want to run merge command to upsert data to delta.
But even if you do perform analytical queries, looking at the size of data I would have gone with relational DB.

AWS Equivalent For Incremental SQL Query

I can create a materialised view in RDS (postgreSQL) to keep track of the 'latest' data output from a SQL query, and then visualise this in QuickSight. This process is also very 'quick' as it doesn't result in calling additional AWS services and/or re-processing all data again (through the SQL query). My assumption is how this works is it runs a SQL, re-runs the SQL but not for the whole data again, so that if you structure the query correctly, you can end up having a 'real time running total' metric for example.
The issue is, creating materialised views (per 5 seconds) for 100's of queries, and having them all stored in a database is not scalable. Imagine a DB with 1TB data, creating an incremental/materialised view seems much less painful than using other AWS services, but eventually won't be optimal for processing time/cost etc.
I have explored various AWS services, none of which seem to solve this problem.
I tried using AWS Glue. You would need to create 1 script per query and output it to a DB. The lag between reading and writing the incremental data is larger than creating a materialised view; because you can incrementally process data, but then to append it to the current 'total' metric is another process.
I explored using AWS Kinesis followed by a Lambda to run a SQL on the 'new' data in the stream, and store the value in S3 or RDS. Again, this adds latency and doesn't work as well as a materialised view.
I read that AWS Redshift does not have materialised views therefore stuck to RDS (PostgreSQL).
Any thoughts?
[A similar issue: incremental SQL query - except I want to avoid running the SQL on "all" data to avoid massive processing costs.]
Edit (example):
table1 has schema (datetime, customer_id, revenue)
I run this query: select sum(revenue) from table1.
This would scan the whole table to come up with a metric per customer_id.
table1 now gets updated with new data as the datetime progresses e.g. 1 hour extra data.
If I run select sum(revenue) from table1 again, it scans all the data again.
A more efficient way is to just compute the query on the new data, and append the result.
Also, I want the query to actively run where there is a change in data, not have to 'run it with a schedule' so that my front end dashboards basically 'auto update' without the customer doing much.

Data Movement from generic-ODBC database to BigQuery

I would save the result of a query on an external db to Bigquery.
I am using pyodbc to manage the odbc connection.
What is the most efficient way to perform such operation?
Should I fetchOne each cursor row and then insert in BigQuery?
Does the result have large amount of data?
If the result is small, you can just read all rows and insert into BigQuery. The benefit is the result is immediately available to BigQuery queries. However, for large results, the streaming insert might be expensive (see https://cloud.google.com/bigquery/pricing).
For large results I would just save the result to a file (commonly CSV), upload it to GCP and run load job.

How does Hive execute a query?

Hive accepts SQL like queries. Under the hood, how is a query executed? Is it the same as how a RDBMS executes a SQL query?
Hive query processing has significant similarities and differences from standard RDBMS.
Some key Similarities:
Support for a SQL grammar. Though not full ANSI SQL 92 it is a fair subset.
Query Parser
Query Optimizer
Execution planner
Some key Differences:
Support for HDFS loading and features
Hive specific functions such as explode, regexp_*, split
Accepting/ processing Hadoop / Cluster configuration/tuning parameters
Managing Input/Output Formats such as for HDFS, S3, Avro, etc
Creation of DAG of Map/Reduce stages/jobs
Coordination with JobTracker for Management of Map/Reduce jobs including job lifecyle: submission /monitoring /

Using Hive with Pig

My hive query has multiple outer joins and takes very long to execute. I was wondering if it would make sense to break it into multiple smaller queries and use pig to work the transformations.
Is there a way I could query hive tables or read hive table data within a pig script?
Thanks
The goal of the Howl project is to allow Pig and Hive to share a single metadata repository. Once Howl is mature, you'll be able to run PigLatin and HiveQL queries over the
same tables. For now, you can try to work with the data as it is stored in HDFS.
Note that Howl has been renamed to HCatalog.