Incompatible parquet type in Impala - hive

I have seen some other posts about this, but have not found an answer that permanently works.
We have a table, and I had to add two columns to it. In order to do so, I dropped the table and recreated it. But since it was an external table, it did not drop the associated data files. The data gets loaded from a control file and is partitioned by date. So let's say the dates that were in the table were 2021-01-01 and 2021-01-02. But only 2021-01-02 is in the control file. So when I am loading that date, it gets re-run with the new columns and everything is fine. But 2021-01-01 is still there, but with a different schema.
This is no issue in Hive, as it seems to default to resolve by name, not position. But Impala resolves by position, so the new columns throw it off.
If I have a table that before had the columns c1,c2,c3, and now have the additional columns c4,c5, if I try to run a query such as
select * from my_table where c5 is null limit 3;
This will give an incompatible parquet schema error in Impala (but Hive is fine, it would just have null for c4 and c5 for the date 2021-01-01).
If I run the command set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; and then the above query again, it is fine. But I would have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; at the beginning of each session, which is not ideal.
From searching online, I have come up with a few solutions:
Drop all data files when creating the new table and just start loading from scratch (I think we want to keep the old files)
Re-load each date (this might not be ideal as there could be many, many dates that would have to be re-loaded and overwritten)
Change the setting permanently in Cloudera Manager (I do not have access to CM and don't know how feasible it would be to change it)
Are there any other solutions to have it so I don't have to run set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; each time I want to use this table in Impala?

Related

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

Airflow Pipeline CSV to BigQuery with Schema Changes

Background
I'm need to design an Airflow pipeline to load CSV's into BigQuery.
I know the CSV's frequently have a changing schema. After loading the first file the schema might be
id | ps_1 | ps_1_value
when the second file lands and I load it it might look like
id | ps_1 | ps_1_value | ps_1 | ps_2_value.
Question
What's the best approach to handling this?
My first thought on approaching this would be
Load the second file
Compare the schema against the current table
Update the table, adding two columns (ps_2, ps_2_value)
Insert the new rows
I would do this in a PythonOperator.
If file 3 comes in and looks like id | ps_2 | ps_2_value I would fill in the missing columns and do the insert.
Thanks for the feedback.
After loading two prior files example_data_1.csv and example_data_2.csv I can see that the fields are being inserted into the correct columns, with new columns being added as needed.
Edit: The light bulb moment was realizing that the schema_update_options exist. See here: https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.SchemaUpdateOption.html
csv_to_bigquery = GoogleCloudStorageToBigQueryOperator(
task_id='csv_to_bigquery',
google_cloud_storage_conn_id='google_cloud_default',
bucket=airflow_bucket,
source_objects=['data/example_data_3.csv'],
skip_leading_rows=1,
bigquery_conn_id='google_cloud_default',
destination_project_dataset_table='{}.{}.{}'.format(project, schema, table),
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_APPEND',
schema_update_options=['ALLOW_FIELD_RELAXATION', 'ALLOW_FIELD_ADDITION'],
autodetect=True,
dag=dag
)
Basically, the recommended pipeline for your case consists in creating a temporary table for treating your new data.
Since AirFlow is an orchestration tool, its not recommended to create big flows of data through it.
Given that, your DAG could be very similar to your current DAG:
Load the new file to a temporary table
Compare the actual table's schema and the temporary table's schema.
Run a query to move the data from the temporary table to the actual table. If the temporary table has new fields, add them to the actual table using the parameter schema_update_options. Besides that, if your actual table has fields in NULLABLE mode, it will be able to easily deal with missing columns case your new data have some missing field.
Delete your temporary table
If you're using GCS, move your file to another bucket or directory.
Finally, I would like to point some links that might be useful to you:
AirFlow Documentation (BigQuery's operators)
An article which shows a problem similar to yours ans where you can find some of the mentioned informations.
I hope it helps

How to reference the latest table from a manually partitioned BigQuery table

We have a manually partitioned "video metadata" table being fed fresh data each day. In our system, old data is only kept for historical reasons since the latest data is the most up to date.
What we cant figure out is how to reference only the latest partition in this table using LookML.
So far we have attempted to store views in BigQuery. We have tried and failed to store a simple "fetch the newest partition" query as a view, in both standard and legacy SQL, and upon some searching, this seems to be by design, even though the error message states "Dataset not found" instead of something more relevant.
We've also tried to build the filtering into Looker, but we're having trouble with getting things to actually work and only having the latest data returned to us through it.
Any help would be appreciated.
We've managed to find a solution, derived tables
We figured that since we couldn't define a view on BigQuery's side, we could do it on Looker's side instead, so we defined the table in a derived table block inside a view.
derived_table: {
sql: SELECT * FROM dataset.table_*
WHERE _TABLE_SUFFIX = (
SELECT max(_TABLE_SUFFIX) FROM dataset.table_*
);;
sql_trigger_value: SELECT max(_TABLE_SUFFIX) FROM dataset.table_*;;
}
This gave us a view with just the newest data in it.

Use a Query from the Destination db to limit OLE DB Source task in SSIS 2008

All,
I have a package that I'm building as a data importer so I can copy sets of data from my production environment and develop on another instance.
I have two tables that contain header and detail rows for service tickets. Those service tickets are tied back to orders.
I am pulling the service tickets from a certain time window, however, the originating orders fall outside of the date range that I'm pulling for the tickets.
I want to be able to take the following steps in an SSIS package:
Import the header and detail rows within the given date range from prod to dev
Select the relevant order numbers from dev tables
Use the list of order numbers to import only the relevant orders from prod
I poked through other answers and couldn't find answers that addressed this directly, so I apologize if there is an answer out there and I missed it. I may not have been asking the question correctly. I'm assuming that I would need to pull those order numbers into a temp table or variable in order to apply them as a filter.
As I write this, it just crossed my mind to use a join on the source system with the ticket to order tables and still use the date range to limit, but I'm still posting the question to see if anyone has dealt with this before.
Your steps are already fairly clear, are you asking how to actually implement them? It looks like you can do all three steps by using SELECT statements in your data sources:
Build a SELECT statement dynamically with the correct dates to use in your data source. The dates could be programmatically generated in a script task, or saved in a database table and populated into variables. Then you copy the data across to the dev system.
Run a SELECT statement in the dev system that returns the order numbers, and copy the results to a table in the prod database.
Run a SELECT statement in the prod database that joins on the table from step 2 and copy the results back to dev.
An alternative to the table in steps 2 and 3 would be a lookup transformation, but if you have a large number of rows then using a table will probably be faster.

What do I gain by adding a timestamp column called recordversion to a table in ms-sql?

What do I gain by adding a timestamp column called recordversion to a table in ms-sql?
You can use that column to make sure your users don't overwrite data from another user.
Lets say user A pulls up record 1 and at the same time user B pulls up record 1. User A edits the record and saves it. 5 minutes later, User B edits the record - but doesn't know about user A's changes. When he saves his changes, you use the recordversion column in your update where clause which will prevent User B from overwriting what User A did. You could detect this invalid condition and throw some kind of data out of date error.
Nothing that I'm aware of, or that Google seems to find quickly.
You con't get anything inherent by using that name for a column. Sure, you can create a column and do the record versioning as described in the next response, but there's nothing special about the column name. You could call the column anything you want and do versioning, and you could call any column RecordVersion and nothing special would happen.
Timestamp is mainly used for replication. I have also used it successfully to determine if the data has been updated since the last feed to the client (when I needed to send a delta feed) and thus pick out only the records which have changed since then. This does require having another table that stores the values of the timestamp (in a varbinary field) at the time you run the report so you can use it compare on the next run.
If you think that timestamp is recording the date or time of the last update, it does not do that, you would need dateTime fields and constraints (To get the orginal datetime)and triggers (to update) to store that information.
Also, keep in mind if you want to keep track of your data, it's a good idea to add these four columns to every table:
CreatedBy(varchar) | CreatedOn(date) | ModifiedBy(varchar) | ModifiedOn(date)
While it doesn't give you full history, it lets you know who and when created an entry, and who and when last modified it. Those 4 columns create pretty powerful tracking abilities without any serious overhead to your DB.
Obviously, you could create a full-blown logging system that tracks every change and gives you full-blown history, but that's not the solution for the issue I think you are proposing.