Is there a way for us to check how frequently a table has been accessed/queried in AWS redshift?
Frequency can be daily/monthly/every hour or whatever.. Can some one help me?
It could be sql queries using system tables from AWS Redshift or some python script. What is the best way?
Related
I want to use Delta Lake tables in my Hive Metastore on Azure Data Lake Gen2 as basis for my company's lakehouse.
Previously, I used "regular" hive catalog tables. I would load data from parquet into a spark dataframe, and create a temp table using df.CreateOrReplaceTempView("TableName"), so I could use Spark SQL or %%sql magic to do ETL. After doing this, I can use spark.sql or %%sql on the TableName. When I was done, I would write my tables to the hive metastore.
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
I know I can persist Delta Tables in the Hive Metastore through a multitude of ways, for instance by creating a Managed catalog table through df.write.format("delta").saveAsTable("LakeHouseDB.TableName")
I also know that I can create a DeltaTable object through the DeltaTable(spark, table_path_data_lake), but then I can only use the Python API and not sql.
Does there exist some equivalent of CreateOrReplaceTempView(), or is there a better way to achieve ETL with SQL without 'writing' to the data lake first?
However, what If I don't want to perform this saveAsTable operation, and write to my Data Lake? What would be the best way to perform ETL with SQL?
Not possible with Delta Lake since it relies heavily on a transaction log (_delta_log) under the data directory of a delta table.
I'm looking at storing data in S3 in ORC format for querying with Athena.
I want to partition the data like so ...
.../year=2019/month=7/
... and bucketing the data further by id (each id will have multiple records for each month, there are lots of id's)
I want to be able to insert new data into this structure daily... I understand that I can't use the INSERT INTO statement from Athena because bucketed tables are not supported.
What would be the best way to insert data daily into a table of this structure? Is it even possible to do with bucketed data?
Cheers
Presto allows inserts into existing partitions of bucketed partitioned tables since Presto 312. If Athena does not support this, you can very easily run a Presto cluster yourself, e.g. using Starburst Presto AWS integration (I can recommend this for other reasons too, as it can be way cheaper than using Athena if you run more than just few queries. Disclaimer: I'm from Starburst)
I have two mysql databases each on their own ec2 instance. Each database has a table ‘report’ under a schema ‘product’. I use a crawler to get the table schemas into the aws glue data catalog in a database called db1. Then I’m using aws glue to copy the tables from the ec2 instances into an s3 bucket. Then I’m querying the tables with redshift. I get the external schema in to redshift from the aws crawler using the script below in query editor. I would like to union the two tables together in to one table and add a column ’source’ with a flag to indicate the original table each record came from. Does anyone know if it’s possible to do that with aws glue during the etl process? Or can you suggest another solution? I know I could just union them with sql in redshift but my end goal is to create an etl pipeline that does that before it gets to redshift.
script:
create external schema schema1 from data catalog
database ‘db1’
iam_role 'arn:aws:iam::228276743211:role/madeup’
region 'us-west-2';
You can create a view that unions the 2 tables using Athena, then that view will be available in Redshift Spectrum.
CREATE OR REPLACE VIEW db1.combined_view AS
SELECT col1,cole2,col3 from db1.mysql_table_1
union all
SELECT col1,cole2,col3 from db1.mysql_table_2
;
run the above using Athena (not Redshift)
I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.
I am currently having Hadoop-2, PIG, HIVE and HBASE.
I have an inputdata. I have loaded that data in HDFS.
I want to create staging data in this environment.
My query is -
In which BigData component, I should create Staging Table(Pig/HIVE/HBASE) ; this will have data coming in based on a condition? Later, we might want to run MapReduce Jobs with complex logic on it.
Please assist
Hive: If you have OLAP kind of workload and dont need realtime read/write.
HBase: If you have OLTP kind of workload. You need to do realtime/streaming read/write. Some batch or OLAP processing can be done by using MapReduce. SQL-like querying is possible by using Apache Phoenix.
You can run MapReduce job on HIVE and HBase both.
Anywhere you want. Pig is not an option as it does not have a metastore. Hive if you want SQL Like queries. HBase based on your access patterns.
When you run a Hive query on top of data it is converted into MR.
When you create it in Hive use Hive Queries & not MR. If you are using MR then use Pig. You will not benefit creating a Hive table on top of data.