How to map dynamic dynamoDB columns in EMR Hive - hive

I have a table in Amazon dynamoDB with a record structure like
{"username" : "joe bloggs" , "products" : ["1","2"] , "expires1" : "01/01/2013" , "expires2" : "01/02/2013"}
where the products property is a list of products belonging to the user and the expires n properties relate to the products in the list, the list of products is dynamic and there are many. I need to transfer this data to S3 in a format like
joe bloggs|1|01/01/2013
joe bloggs|2|01/02/2013
Using hive external tables I can map the username and products columns in dynamoDB, however I am unable to map the dynamic columns. Is there a way that I could extend or adapt the org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler in order to interpret and structure the data retrieved from dynamo before hive ingests it? or is there an alternative solution to convert the dynamo data to first normal form?
One of my key requirements is that i maintain the throttling provided by the dynamodb.throughput.read.percent setting so that I do not compromise operational use of the table.

You could build a specific UDTF(User defined table-generating functions) for that case.
I'm not sure how Hive handles asterisk(probably for your case) as an argument for the function.
Something like what Explode (source) does.

Related

How to update insert new record with updated value from staging table in Azure Data Explorer

I have requirement, where data is indigested from the Azure IoT hub. Sample incoming data
{
"message":{
"deviceId": "abc-123",
"timestamp": "2022-05-08T00:00:00+00:00",
"kWh": 234.2
}
}
I have same column mapping in the Azure Data Explorer Table, kWh is always comes as incremental value not delta between two timestamps. Now I need to have another table which can have difference between last inserted kWh value and the current kWh.
It would be great help, if anyone have a suggestion or solution here.
I'm able to calculate the difference on the fly using the prev(). But I need to update the table while inserting the data into table.
As far as I know, there is no way to perform data manipulation on the fly and inject Azure IoT data to Azure Data explorer through JSON Mapping. However, I found a couple of approaches you can take to get the calculations you need. Both the approaches involve creation of secondary table to store the calculated data.
Approach 1
This is the closest approach I found which has on-fly data manipulation. For this to work you would need to create a function that calculates the difference of Kwh field for the latest entry. Once you have the function created, you can then bind it to the secondary(target) table using policy update and make it trigger for every new entry on your source table.
Refer the following resource, Ingest JSON records, which explains with an example of how to create a function and bind it to the target table. Here is a snapshot of the function the resource provides.
Note that you would have to create your own custom function that calculates the difference in kwh.
Approach 2
If you do not need a real time data manipulation need and your business have the leniency of a 1-minute delay, you can create a query something similar to below which calculates the temperature difference from source table (jsondata in my scenario) and writes it to target table (jsondiffdata)
.set-or-append jsondiffdata <| jsondata | serialize
| extend temperature = temperature - prev(temperature,1), humidity, timesent
Refer the following resource to get more information on how to Ingest from query. You can use Microsoft Power Automate to schedule this query trigger for every minute.
Please be cautious if you decide to go the second approach as it is uses serialization process which might prevent query parallelism in many scenarios. Please review this resource on Windows functions and identify a suitable query approach that is better optimized for your business needs.

How to write dynamic query in Amazon Athena?

I have created an Athena table which contains access logs (source s3).
Now I have a working query to check when account specific data was downloaded by a different account than the account itself. It looks like the query below (account id is used in s3 as prefix):
SELECT * FROM s3_access_logs_db.mybucket_logs WHERE requester LIKE '%account-id-A%' AND operation = 'REST.GET.OBJECT' AND key NOT LIKE '%account-id-A%';
Now I want to make this more dynamic. I don't want to hardcode account IDs as I'm doing now. Is there a way to make this more dynamic. I don't care what happens as long as the account ID doesn't match.
WHERE requester LIKE '%account-id-A%' AND key NOT LIKE '%account-id-A%';
How can I achieve this?

Airflow Pipeline CSV to BigQuery with Schema Changes

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

BigQuery: Too many total leaf fields 10852

I am importing some data from Google Cloud Datastore with about 980 columns. I have it exported first to Bucket and attempting to import it into BigQuery (using the GCP guide here). However, I get the error Too many total leaf fields: 10852.
I know for certain that none of the entities have more than 1000 fields. Is there a possibility that the import process is transforming my data and creating additional fields?
The schema's generated by the Managed Import/Export service will not contain more than 10k fields. So, it looks like you are importing into a BigQuery table that already has data. BigQuery will take the union of the existing schema and the new schema. So even if any given entity has less than 1000 fields, if the union of all field names in all your entities of a kind, plus the existing fields in the BigQuery schema.
Some options you have include:
1) Use a new table for each import into BigQuery.
2) Try using projectionFields to limit the fields loaded into BigQuery.
Jim Morrison's solution (using projectionFields) solved the issue for me.
I ended up passing a list of entity columns I was interested in and only exporting this subset to BigQuery. The following command line instruction achieves this.
bq --location=US load --source_format=DATASTORE_BACKUP --projection_fields="field1, field4, field2, field3" --replace mydataset.table gs://mybucket/2019-03-25T02:56:02_47688/default_namespace/kind_data/default_namespace_datakind.export_metadata

How to retrieve Hive table Partition Location?

Show Partitions --> In Hive/Spark, this command only provides the Partition, without providing the location information on hdfs/s3
Since we maintain different location for each partition in a table, is there a way to retrieve the location information along using Hive/Spark without querying the Metastore tables?
DESCRIBE FORMATTED <db>.table will return will give you the location, among a lot of other data. There will be a line in the output that starts with LOCATION.
You can use the query:
show table extended like 'your_table_name' partition (partition_name);
This provides a more concise information and also in a format that is easy to pass if you want to extract information using a shell script.