AWS DMS CDC to S3 target - amazon-s3

So I was playing around seeing what could be achieved using Data Migration Service Chance Data Capture, to take data from MSSQL to S3 & also Redshift.
The redshift testing was fine, if I delete a record in my source DB, a second or two later the record disappears from Redshift. Same with Insert/update etc ..
But S3 ...
You get the original record from the first full load.
Then if you update a record in source, S3 receives a new copy of the record, marked with an 'I'.
If I delete a record, I get another copy of the record marked with a 'D'.
So my question is - what do I do with all this ?
How would I query my S3 bucket to see the 'current' state of my data set as reflecting the source database ?
Do I have to script some code myself to pick up all these files and process them, performing the inserts/updates and deletes until I finally resolve back to a 'normal' data set ?
Any insight welcomed !

The records containing 'I', 'D' or 'U' are actually CDC data (change data capture). This is called sometimes called "history" or "historical data". This type of data has some applications in data warehousing and also this can also be used in many Machine learning uses cases.
Now coming to the next point, in order to get 'current' state of data set you have to script/code yourself. You can use AWS Glue to perform the task. For-example, This post explains something similar.
If you do not want to maintain the glue code, then a shortcut is not to use s3 target with DMS directly, but use Redshift target and once all CDC is applied offload the final copy to S3 using Redshift unload command.

As explained here about what 'I', 'U' and 'D' means.
What we do to get the current state of the db? An alternate is to first of all add this additional column to the fullload files as well, i.e. the initial loaded files before CDC should also have this additional column. How?
Now query the data in athena in such a way where we exclude the records where Op not in ("D", "U") or AR_H_OPERATION NOT IN ("DELETE", "UPDATE"). Thus you get the correct count (ONLY COUNT as 'U' would only come if there is already an I for that entry).
SELECT count(*) FROM "database"."table_name"
WHERE Op NOT IN ('D','U')
Also to get all the records you may try something in athena with a complex sql, where Op not in ('D') and records when Op IN = 'I' and count 1 or else if count 2, pick the latest one or Op = 'U'.

Related

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

How to store and serve coupons with Google tools and javascript

I'll get a list of coupons by mail. That needs to be stored somewhere somehow (bigquery?) where I can request and send it to the user. The user should only be able to get 1 unique code, that was not used beforehand.
I need the ability to get a code and write, that it was used, so the next request gets the next code...
I know it is a completely vague question but I'm not sure how to implement that, anyone has any ideas?
thanks in advance
Thr can be multiples solution for same requirement, one of them is given below :-
Step 1. Try to get coupons over a file (CSV, JSON, and etc) as per your preference/requirement.
Step 2. Load Source file to GCS (storage).
Step 3. Write a Dataflow code which read data from GCS (file) an load data to a different Bigquery table (tentative name: New_data). Sample code.
Step 4. Create a Dataflow code to read data from Bigquery table New_data and compare it with History_data and identify new coupons and write data to a file on GCS or Bigquery table. Sample code.
Step 5. Schedule entire process over an orchestrator/Cloud scheduler/Cron tab job.
Step 6. Once you have data you can send it to consumers through any communication channel.

Incompatible parquet type in Impala

I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?

Airflow Pipeline CSV to BigQuery with Schema Changes

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

Optimal way to add / update EF entities if added items may or may not already exist

I need some guidance on adding / updating SQL records using EF. Lets say I am writing an application that stores info about files on a hard disk, into an EF4 database. When you press a button, it will scan all the files in a specified path (maybe the whole drive), and store information in the database like the file size, change date etc. Sometimes the file will already be recorded from a previous run, so its properties should be updated; sometimes a batch of files will be detected for the first time and will need to be added.
I am using EF4, and I am seeking the most efficient way of adding new file information and updating existing records. As I understand it, when I press the search button and files are detected, I will have to check for the presence of a file entity, retrieve its ID field, and use that to add or update related information; but if it does not exist already, I will need to create a tree that represents it and its related objects (eg. its folder path), and add that. I will also have to handle the merging of the folder path object as well.
It occurs to me that if there are many millions of files, as there might be on a server, loading the whole database into the context is not ideal or practical. So for every file, I might conceivably have to make a round trip to the database on disk to detect if the entry exists already, retrieve its ID if it exists, then another trip to update. Is there a more efficient way I can insert/update multiple file object trees in one trip to the DB? If there was an Entity context method like 'Insert If It Doesnt Exist And Update If It Does' for example, then I could wrap up multiple in a transaction?
I imagine this would be a fairly common requirement, how is it best done in EF? Any thoughts would be appreciated.(oh my DB is SQLITE if that makes a difference)
You can check if the record already exists in the DB. If not, create and add the record. You can then set the fields of the record which will be common to insert and update like the sample code below.
var strategy_property_in_db = _dbContext.ParameterValues().Where(r => r.Name == strategy_property.Name).FirstOrDefault();
if (strategy_property_in_db == null)
{
strategy_property_in_db = new ParameterValue() { Name = strategy_property.Name };
_dbContext.AddObject("ParameterValues", strategy_property_in_db);
}
strategy_property_in_db.Value = strategy_property.Value;