Airflow Pipeline CSV to BigQuery with Schema Changes - google-bigquery

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.

After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)

Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

Related

How to import updated records from XML files into SQL Database?

So from last few weeks I was trying to design a SSIS package that would read some XML files that I have and move the data from it to the multiple tables I want.
These file contains different nodes like Individual (parent node) and Address, Alias, Articles (all child nodes of Individual) etc.
Data in those files look like this:
<Individuals>
<Individual>
<UniqueID>1001</UniqueID>
<Name>Ben</Name>
<Soft_Delete>N</Soft_Delete>
<Soft_Delete_Date>NULL</Soft_Delete_Date>
</Individual>
<Addresses>
<Address>
<Address_Line_1>House no 280</Address_Line_1>
<Address_Line_2>NY</Address_Line_2>
<Country>US</Country>
<Soft_Delete>N</Soft_Delete>
<Soft_Delete_Date>NULL</Soft_Delete_Date>
</Address>
<Address>
<Address_Line_1>street 100</Address_Line_1>
<Address_Line_2>California</Address_Line_2>
<Country>US</Country>
<Soft_Delete>N</Soft_Delete>
<Soft_Delete_Date>NULL</Soft_Delete_Date>
</Address>
</Addresses>
</Individuals>
I was successful in designing it and now I have a different task.
The files I had were named like this: Individual_1.xml,Individual_2.xml,Individual_3.xml etc.
Now I have received some new files which are named like this:
Individual_UPDATE_20220716.xml,Individual_UPDATE_20220717.xml,Individual_UPDATE_20220718.xml,Individual_UPDATE_20220720.xml etc
Basically these files contains the updated information of previously inserted records
OR
There are totally new records
For example:
A record or a particular information like Address of an Individual was Soft Deleted.
Now I am wondering how would I design or modify my current SSIS package to update the data from these new files into my database?
Any guidance would be appreciated....
Thank you...
It looks like you have no problem reading the XML so I won't really talk about that. #Yitzack's comment to his prior answer is "a" way to do it. However, his answer assumes you can create staging tables. To do this entirely inside SSIS, the way to do it is as follows...
I would treat all the files the same (as long as they have the same data structure and it seems like the case.
Read the XML as a source.
Proceed to a lookup.
Set the lookup to ignore errors (this is handled in the next step)
Set your lookup to the destination table and lookup uniqueID and add to the data flow. Since you said ignore errors, SSIS will insert a null in that field if the lookup fails to find a match.
Condition Split based on the destination.UniqueID == null and call that output inserts and change the default to updates.
Add a SQL statement to update the existing record and map your row to it (this is somewhat slow and why the merge is better with large data sets).
Connect from Condition Split to update SQL statement and map appropriately.
Add an insert, connect the proper output from Conditional Split and map.
Note: It looks like you are processing from a file system, and it is likely that order is important. You may have to order you foreach loop. I will provide a simple example that you can modify.
Create a filesInOrder variable of type object.
Create a script task.
Add filesInOrder as a read/write variable
Enter script...
var diFiles = new System.IO.DirectoryInfo(#"path to your folder").GetFiles("*.xml");
var files = diFiles.OrderBy(o => o.CreationTime).Select(s => s.FullName);
Dts.Variables["filesInOrder"].Value = files.ToArray();
Make sure you add Using System.Linq; to your code.
Finally, use filesInOrder as your base to a foreach component based on an ADO enumerator (filesInOrder).

Incompatible parquet type in Impala

I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?

AWS DMS CDC to S3 target

So I was playing around seeing what could be achieved using Data Migration Service Chance Data Capture, to take data from MSSQL to S3 & also Redshift.
The redshift testing was fine, if I delete a record in my source DB, a second or two later the record disappears from Redshift. Same with Insert/update etc ..
But S3 ...
You get the original record from the first full load.
Then if you update a record in source, S3 receives a new copy of the record, marked with an 'I'.
If I delete a record, I get another copy of the record marked with a 'D'.
So my question is - what do I do with all this ?
How would I query my S3 bucket to see the 'current' state of my data set as reflecting the source database ?
Do I have to script some code myself to pick up all these files and process them, performing the inserts/updates and deletes until I finally resolve back to a 'normal' data set ?
Any insight welcomed !
The records containing 'I', 'D' or 'U' are actually CDC data (change data capture). This is called sometimes called "history" or "historical data". This type of data has some applications in data warehousing and also this can also be used in many Machine learning uses cases.
Now coming to the next point, in order to get 'current' state of data set you have to script/code yourself. You can use AWS Glue to perform the task. For-example, This post explains something similar.
If you do not want to maintain the glue code, then a shortcut is not to use s3 target with DMS directly, but use Redshift target and once all CDC is applied offload the final copy to S3 using Redshift unload command.
As explained here about what 'I', 'U' and 'D' means.
What we do to get the current state of the db? An alternate is to first of all add this additional column to the fullload files as well, i.e. the initial loaded files before CDC should also have this additional column. How?
Now query the data in athena in such a way where we exclude the records where Op not in ("D", "U") or AR_H_OPERATION NOT IN ("DELETE", "UPDATE"). Thus you get the correct count (ONLY COUNT as 'U' would only come if there is already an I for that entry).
SELECT count(*) FROM "database"."table_name"
WHERE Op NOT IN ('D','U')
Also to get all the records you may try something in athena with a complex sql, where Op not in ('D') and records when Op IN = 'I' and count 1 or else if count 2, pick the latest one or Op = 'U'.

Copy one database table's contents into another on same database

I'm a bit of a newbie with the workings of phpmyadmin. I have a database and now there are 2 parts within it - the original tables jos_ and the same again but with a different prefix, say let's ****_ that will be the finished database.
This has come about because I am upgrading my Joomla 1.5 site to 2.5. I used a migration tool for the bulk of the new database but one particular piece of information did not transfer because the new database has a different structure.
I want to copy the entire contents of jos_content, attribs, keyref= across to ****_content, metadata, "xreference"."VALUE" if that makes sense. This will save manually typing in the information contained within 1000s of articles.
jos_content, attribs currently contains
show_title=
link_titles=
show_intro=
show_section=
link_section=
show_category=
link_category=
show_vote=
show_author=
show_create_date=
show_modify_date=
show_pdf_icon=
show_print_icon=
show_email_icon=
language=
keyref=41.126815,0.732623
readmore=
****_content, metadata currently contains
{"robots":"all","author":""}
but I want it to end up like this
{"robots":"","author":"","rights":"","xreference":"41.126815,0.732623","marker":""}
Could anyone tell me the SQL string that I would need to run to achieve this please?
If it makes any difference I have manually changed about 300 of these articles already and thought there must be a better way.
Edit: Being nervous of trying this I would like to try and find the exact syntax (if that's the right word) for the SQL Query to run.
The value I want to extract from the source table is just, and only, the numbers next to keyref= and I want them to turn up in the destination table prefixed by "xreference". - so it shows "xreference"."VALUE" with VALUE being the required numbers. There is also an entry - ,"marker":"" that is in the destination table so I guess the Query needs to produce that as well?
Sorry for labouring this but if I get it wrong, maybe by guessing what to put, I don't really have the knowledge to put it all right again....
Thanks.
Please Try it
insert into tableone(column1,column2) select column1,column2 from Tablesecond
if You have not Table another Daabase Then This query
select * into anyname_Table from tablesource

SQL Server: Remove substrings from field data by iterating through a table of city names

I have two databases, Database A and Database B.
Database A contains some data which needs to be placed in a table in Database B. However, before that can happen, some of that data must be “cleaned up” in the following way:
The table in Database A which contains the data to be placed in Database B has a field called “Desc.” Every now and then the users of the system put city names in with the data they enter into the “Desc” field. For example: a user may type in “Move furniture to new cubicle. New York. Add electric.”
Before that data can be imported into Database B the word “New York” needs to be removed from that data so that it only reads “Move furniture to new cubicle. Add electric.” However—and this is important—the original data in Database A must remain untouched. In other words, Database A’s data will still read “Move furniture to new cubicle. New York. Add electric,” while the data in Database B will read “Move furniture to new cubicle. Add electric.”
Database B contains a table which has a list of the city names which need to be removed from the “Desc” field data from Database A before being placed in Database B.
How do I construct a stored procedure or function which will grab the data from Database A, then iterate through the Cities table in Database B and if it finds a city name in the “Desc” field will remove it while keeping the rest of the information in that field thus creating a recordset which I can then use to populate the appropriate table in Database B?
I have tried several things but still haven’t cracked it. Yet I’m sure this is probably fairly easy. Any help is greatly appreciated!
Thanks.
EDIT:
The latest thing I have tried to solve this problem is this:
DECLARE #cityName VarChar(50)
While (Select COUNT(*) From ABCScanSQL.dbo.tblDiscardCitiesList) > 0
Begin
Select #cityName = ABCScanSQL.dbo.tblDiscardCitiesList.CityName FROM ABCScanSQL.dbo.tblDiscardCitiesList
SELECT JOB_NO, LTRIM(RTRIM(SUBSTRING(JOB_NO, (LEN(job_no) -2), 5))) AS LOCATION
,JOB_DESC, [Date_End] , REPLACE(Job_Desc,#cityName,' ') AS NoCity
FROM fmcs_tables.dbo.Jobt WHERE Job_No like '%loc%'
End
"Job_Desc" is the field which needs to have the city names removed.
This is a data quality issue. You can always make a copy of the [description] in Database A and call it [cleaned_desc].
One simple solution is to write a function that does the following.
1 - Read data from [tbl_remove_these_words]. These are the phrases you want removed.
2 - Compare the input - #var_description, to the rows in the table.
3 - Upon a match, replace with a empty string.
This solution depends upon a cleansing table that you maintain and update.
Run a update query that uses the input from [description] with a call to [fn_remove_these_words] and sets [cleaned_desc] to the output.
Another solution is to look at products like Melisa Data (DQ) product for SSIS or data quality services in the SQL server stack to give you a application frame work to solve the problem.