Reject / Bad Records Table in BigQuery - google-bigquery

I am looking for a reject link type of solution in a dedup scenario. For example in the following code:
MERGE
temp.many_random t
USING
( SELECT DISTINCT * FROM temp.many_random WHERE d=CURRENT_DATE() )
ON FALSE
WHEN NOT MATCHED
BY SOURCE AND d=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Can I replace THEN DELETE with something like INSERT INTO TABLE (different than the compare tables) so that we can capture these rejects and troubleshoot for pipeline analysis?

If I understand the problem, you want the result of a MERGE query to write to 2 different tables.
Since MERGE can't do that, I'll suggest to write 2 queries:
One that does whatever the primary query is doing.
A second almost identical one, but that writes the wrong records to a different table.

Related

SQL: Insert certain records from one table into another and also add few other fields using query

I have two tables say TABLE1 and TABLE2. And say the field id is common in both. Rest of field are different.
I now select all distinct id from TABLE1 and want to insert them into TABLE2 while also writing its other attributes. Like the pseudocode below.
for each distinct id (i) in TABLE1:
INSERT in TABLE2 (i, false, unix_timestamp())
end
Now I for some reason cannot use a programming language to do this. Is it possible to do this in SQL using Apache Drill?
What you could do is write a query that produces the output you're looking for and then save that as a table. Drill is really a query engine and doesn't support INSERT operations the way a database does.
So a pseudo query migth look like this:
CREATE TABLE <your file> AS
SELECT ...
Then you could query that file. I don't know if that helps or not. You can also create views and temporary tables, but Drill itself doesn't really implement INSERT commands.

Maintain / Set Column Descriptions for Write Truncate Scheduled Query

Scenario: We have a number of scheduled queries that copy data into a project that we use as our centralized data warehouse. These are scheduled queries are configured to run nightly, and are set to WRITE_TRUNCATE.
Problem: We added descriptions to the columns in several of our destination tables in order to document them. However, when the scheduled queries ran they removed all of the column descriptions. (Table description was maintained.)
Desired Outcome: Is there a way to insert the column descriptions as part of the scheduled queries, or some other way to avoid having these deleted nightly? Or is that simply a limitation of WRITE_TRUNCATE scheduled queries?
I've searched Google & Stack Overflow, and reviewed the documentation, but I can't find any references to table / column descriptions in relation to scheduled queries.
One solution is instead of using WRITE_TRUNCATE with SELECT, you can use:
CREATE OR REPLACE TABLE( <column_list_with_description>)
AS SELECT ...
If you don't want to repeat the column description in every schedule query, you may use:
DELETE FROM table WHERE true;
INSERT INTO table SELECT ...
If the atomacy of the update is required, above query could be written into one MERGE statement like:
MERGE full_table
USING (
SELECT *
FROM data_updates_table
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Insert with select, dependent on the values in the table inserting into EDITED

So I need to figure out how to insert into a table, from another table, with a where clause that requires me to access the table that I am inserting into. I tried an alias from the table I am inserting into, but I quickly found out that you cannot do that. Basically, what I want to check is that the values that I am inserting into the table match a particular field within the table that I am inserting into. Here is what I've tried:
INSERT INTO "USER"."TABLE1" AS A1
SELECT *
FROM "USER"."TABLE2" AS A2
WHERE A2."HIERARCHYLEVEL" = 2
AND A2."PARENT" = A1."INSTANCE"
Obviously, this was to no avail. I've tried a couple other queries, but they didn't me anywhere, either. Any help would be much appreciated.
EDIT:
I would like to add rows to this table, not add columns to the table. The two tables are of the exact same structure -- in fact, I extracted the data already in table1 from table2. What I have in table1 currently is a bunch of records who have NO PARENT, but an instance. What I want to add is all the records who have a parent in table2 that are equal to the instance in table 1.
Currently there is no way to join on a table when inserting. The solution with the subselect where you select from the table, is the correct.
Aliasing the table you want to change is only possible with UPDATE, UPSERT and MERGE. For these operations it makes sense, as you need to match a column and then decide if you need to update it or insert something instead. In your example the line from table1 that you match is not relevant, as you don't want to change it, so from the statement point of view it is not really relevant that the table you use in your subselect is the same that the one you insert into.
As alternative, I can suggest you following solution, which is equivalent with yours:
INSERT INTO "user"."table1"
SELECT
A1."ROOT",
A1."INSTANCE",
A1."PARENT",
A1."HIERARCHYLEVEL"
FROM "user"."table2" AS A1
WHERE A1."INSTANCE" in (select "PARENT" from "user"."table1")
AND A2."HIERARCHYLEVEL" = 2
This gave me the answer I was looking for, although I am sure there is an easier -- or more efficient -- way to do it.
INSERT INTO "user"."table1"
SELECT
A1."ROOT",
A1."INSTANCE",
A1."PARENT",
A1."HIERARCHYLEVEL"
FROM "user"."table2" AS A1,
"user"."table1" AS A2
WHERE A1."INSTANCE" = A2."PARENT"
AND A2."HIERARCHYLEVEL" = 2

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.

Merge Statement VS Lookup Transformation

I am stuck with a problem with different views.
Present Scenario:
I am using SSIS packages to get data from Server A to Server B every 15 minutes.Created 10 packages for 10 different tables and also created 10 staging table for the same. In the DataFlow Task it is selecting data from server A with ID greater last imported ID and dumping them onto a Staging table.(Each table has its own stagin table).After the DataFlow task I am using a MERGE statement to merge records from Staging table to Destination table where ID is NO Matched.
Problem:
This will take care all new records inserted but if once a record is picked by SSIS job and is update at the source I am not able to pick it up again and not able to grab the updated data.
Questions:
How will I be able to achieve the Update with impacting the source database server too much.
Do I use MERGE statement and select 10,000 records every single run?(every 15 minutes)
Do I use LookUp transformation to do the updates
Some tables have more than 2 million records and growing, so what is the best approach for them.
NOTE:
I can truncate tables in destination and reinsert complete data for the first run.
Edit:
The Source has a column 'LAST_UPDATE_DATE' which I can Use in my query.
If I'm understanding your statements correctly it sounds like you're pretty close to your solution. If you currently have a merge statement that includes the insert (where source does not match destination) you should be able to easily include the update statement for the (where source matches destination).
example:
MERGE target_table as destination_table_alias
USING (
SELECT <column_name(s)>
FROM source_table
) AS source_alias
ON
[source_table].[table_identifier] = [destination_table_alias].[table_identifier]
WHEN MATCHED THEN UPDATE
SET [destination_table_alias.column_name1] = mySource.column_name1,
[destination_table_alias.column_name2] = mySource.column_name2
WHEN NOT MATCHED THEN
INSERT
([column_name1],[column_name2])
VALUES([source_alias].[column_name1],mySource.[column_name2])
So, to your points:
Update can be achieved via the 'WHEN MATCHED' logic within the merge statement
If you have the last ID of the table that you're loading, you can include this as a filter on your select statement so that the dataset is incremental.
No lookup is needed with the 'WHEN MATCHED' is utilized.
utilizing a select filter in the select portion of the merge statement.
Hope this helps