Athena Create External Table for different payloads - hive

Hi I am working in AWS Athena Editor. I am trying to create external table for a pool of kafka events saved in S3. The kafka events have different payloads, for example as shown below
{
"type":"job_started",
"jobId":"someId",
}
{
"type":"equipment_taken",
"equipmentId":"equipmentId"
}
and I was wondering if there is a way to do something like
Create External table _table()... where type='job_started'..
LOCATION 'some-s3-bucket'
TBLPROPERTIES ('has_encrypted_data' = 'false')
I know that I can create a table which has a schema with all the attributes of the event types (job_started and equipment_taken), however there will be a lot of nulls; and each time I have a new "type" of event then I have to expand the schema and it keeps growing. So instead I want to have two tables (table_for_job_started and table_for_equipment_taken) mapping to the data in the s3 bucket for each type, and so only relevant data is populated. Can you help with this?

Related

How to create a blank "Delta" Lake table schema in Azure Data Lake Gen2 using Azure Synapse Serverless SQL Pool?

I have a file with data integrated from 2 different sources using Azure Mapping Data Flow and loaded into an ADLS2 datalake container/folder i.e. for example :- /staging/EDW/Current/products.parquet file.
I now need to process this file in staging using Azure Mapping Data Flow and load into it's corresponding dimension table using SCD type2 method to maintain history.
However, I want to try creating & process this dimension table as "Delta" table in Azure Data Lake using Azure Mapping Data Flow only. However, since SCD type 2 requires a source lookup to check if there are any existing records/rows and if not insert all or if changed records do updates etc etc. (let's say during first time load).
For that, I need to first create a default/blank "Delta" table in Azure data lake folder i.e. for example :- /curated/Delta/Dimension/Products/. Just like we would have done if it were in Azure SQL DW (Dedicated Pool) in which we could have first created a blank dbo.dim_products table with just the schema/structure and no rows.
I am trying to implement a DataLake-House architecture implementation by utilizing & evaluating the best features of both Delta Lake and Azure Synapse Serverless SQL pool using Azure Mapping data flow - for performance, cost savings, ease of development (low code) & understanding. However, at the same time want to avoid a Logical Datawarehouse (LDW) kind of architecture implementation at this time.
For this, tried creating a new database under built-in Azure Synapse Serverless SQL pool, defined data source, format and a blank delta table/schema structure (without any rows); but no luck.
create database delta_dwh;
create external data source deltalakestorage
with ( location = 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/' );
create external file format deltalakeformat
with (format_type = delta);
drop external table products;
create external table dbo.products
(
product_skey int,
product_id int,
product_name nvarchar(max),
product_category nvarchar(max),
product_price decimal (38,18),
valid_from date,
valid_to date,
is_active char(1)
)
with
(
location='https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products',
data_source = deltalakestorage,
file_format = deltalakeformat
);
However, this fails since a Delta table/file requires _delta_log/*.json folder/file to be present which maintains transaction log. That means, I have to first write few (dummy) rows as in Delta format to the said target folder and then only I can read it and perform following queries used in for SCD type 2 implementation:
select isnull(max(product_skey), 0)
FROM OPENROWSET(
BULK 'https://aaaaaaaa.dfs.core.windows.net/curated/Delta/Dimensions/Products/*.parquet',
FORMAT = 'DELTA') as rows
Any thoughts, inputs, suggestions ??
Thanks!
You may try to create initial /dummy data_flow + pipiline to create this empty delta files.
It's only simple workaround.
Create CSV with your sample table data.
Create dataflow with name =initDelta
Use this CSV as source in data flow
In projection panel set up correct data types.
Add filtering after source and setup dummy filter 1=2 etc.
Add sink with delta output.
Put your initDelta dataflow into dummy pipeline and run it.
Folder structure for delta should created.
You mentioned the your initial data is in parque file. You can use this file. Schema of table(columns and data types) will be imported from file. Filter out all rows and save result as delta.
I think it should work or I missed something in your problem
I don't think you can use Serverless SQL pool to create a delta table........yet. I think it is coming soon though.

AWS - How to extract CSV reports from a set of JSON files in S3

I have a RDS database with the following structure: CustomerId|Date|FileKey.
FileKey points to a JSON file in S3.
Now I want to create CSV reports with a costumer, date range filters and columns definition (ColumnName + JsonPath), like that:
Name => data.person.name
OtherColumn1 => data.exampleList[0]
OtherColumn2 => data.exampleList[2]
I often need to add and remove columns from the columns definition.
I know I can run a SQL SELECT on RDS, get each S3 file (JSON), extract the data and create my CSV file, but, this is not a good solution because I need to query my RDS instance and make millions of requests to S3 for each report request or each columns definition change.
Saving all data on RDS table instead on S3 is also not a good solution because JSON file contains a lot of data and columns not the same for costumers.
Any idea?

Mapping AWS glue table columns to target RDS instance table columns

i have created a glue job, that takes data from S3 bucket and insert into **RDS postgres instance**.
In the S3 bucket I have created different folder (partition).
Can I map different columns in different partitions to same target RDS instance?
When you say partition in s3, is it indicated using the hive style ?? eg: bucket/folder1/folder2/partion1=xx/partition2=xx/partition3=yy/..
If so, you shouldnt be storing data with different structures of information in s3 partitions and then mapping them to a single table. However if its just data in different folders, like s3://bucket/folder/folder2/folder3/.. and these are actually genuinely different datasets, then yes it is possible to map those datasets to a single table. However you cannot do it via the ui. You will need to read these datasets as separate dynamic/data frames and join them using a key in glue/spark and load them to rds

How does overwrite existing insert mode work in redshiftcopyactivity for aws data pipeline

I am new to aws data pipeline. We have a use case where we copy updated data into redshift . I wanted to know whether I can use OVERWRITE_EXISTING insert mode for redshiftcopyactivity. Also, please explain the internal working of OVERWRITE_EXISTING.
Data Pipelines are used to move data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.
"OVERWRITE_EXISTING", over writes the already existed data in to the destination table but with a constraint of unique identifier (Primary Key) in RedShift cluster.
You can use "TRUNCATE", if you dont want your table structure to be changed due to the addition of PK.
Though, you can find things here: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html

Can I issue a query rather than specify a table when using the BigQuery connector for Spark?

I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
copies the entirety of the named table into input_directory. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?
Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment:
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract