Run truncate in bigquery with Apache NiFi - google-bigquery

I have a process that uses the PutBigQueryBatch processor, in which I would like it to truncate the table before inserting the data. I defined an AVRO schema, and previously created the table in BigQuery specifying how I wanted the fields. I am aware that if I change the "Write Disposition" property to the value "WRITE_TRUNCATE", it will truncate the table. However, when I use this option, the schema of the table in BigQuery ends up being deleted, which I would not like to happen, and a new schema is created to record the data. I understand that the "Create Disposition" property exists, and that if the "CREATE_NEVER" option is selected, the schema should be respected and not deleted.
When I run this processor with the "Write Disposition" property set to "WRITE_APPEND", the schema I created in BigQuery is respected, but with the "WRITE_TRUNCATE" not.
Is there any way to use the "WRITE_TRUNCATE" option and the table schema not be deleted?
Am I doing something wrong?
Below I forward the configuration that I am using in the PutBigQueryBatch processor:
PutBigQueryBatch processor configuration

It sounds like what you want is to run a TRUNCATE TABLE query before starting your process: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#truncate_table_statement

Related

Why isn't there an option to upsert data in Azure Data Factory inline sink

The problem I'm trying to tackle is inserting and/or updating dynamic tables in a sink within an Azure Data Factory data flow. I've managed to get the source data, transform it how I want it and then send it to a sink. The pipeline ran successfully and it said it copied 37 rows (as expected) but investigation showed that no data was actually deposited in the target table. This was because the Table Action on the sink was set to 'None'. So in trying to fix this last part, it seems I don't have the 'Create' option but do have the 'Recreate' option (see screenshot of the sink below) which is not what I want as the datasource will eventually only have changed data. I need the process to create the table if it doesn't exist and then Upsert data. (Recreate drops the table and then creates it).
If I change the sink type from Inline to Dataset, then I can select Insert and Upsert, etc options but this is then not dynamic as I need to select a specific dataset.
So has anyone come across the same issue and have you managed to have dynamic sinks in your data flow where the table is created if it doesn't exist, then upsert data.
I guess I can add a Pre SQL script which takes care of the 'create the table if it doesn't exist' but I still can't select the Upsert option with inline tables.
For the CREATE TABLE IF NOT EXISTS issue, I would recommend a Stored Procedure that is executed in the pipeline prior to the Data Flow.
For Inline vs Dataset, you can make the Dataset very flexible:
So still based on your runtime table name and no schema, so no need to target a specific table.
For the UPSERT issue, make sure you have an AlterRow activity before the Sink:

What's the best approach to load teradata table data into a hive table using Nifi?

I'm new to Nifi so could you help me understand this platform and its capabilities.
Would I be able to use a Nifi process to create a new table in Hive and move data into it weekly from a teradata database in the way I've defined below?
How would I go about it? Not sure if I'm building a sensible flow.
Would the following process suffice: QueryDatabaseTable (and configure a pooling service for teradata and define a new tablename and schedule ingestion) --> PutHiveStreaming (create the table defined earlier)
and then how do i pull the teradata schema into the new table?
If you want to create new hive table along with the ingestion process then
Method1:
Using ConvertAvroToOrc processor adds hive.ddl(external table) attribute to the flowfile as we can use this attribute and execute using PutHiveQL processor then we are able to create table in hive.
If you want to create transactional table then needs to change the hive.ddl attribute.
Refer to this link for more details.
If you wan to pull only the delta records from the source then you can use
ListDatabaseTables(list all tables from source db) + GenerateTableFetch(stores the state) Processors
Flow:
Method2:
QuerydatabaseTable processor will result flowfile in Avro Format then you can use ExtractAvroMetaData processor to extract the avro schema by using some script we can create a new attribute with the required schema(i.e. managed/external/transactional table).

Appending data to a table created from an Avro file in BigQuery

Every morning, an automatic job creates a new table from an Avro file. In the afternoon, I would need to append some data to this table from a Query.
When trying to do so, I get the following error:
Error: Invalid schema update. Field chn has changed mode from REQUIRED to NULLABLE
I noticed that I can change the property of the field chn from REQUIRED to NULLABLE in the BigQuery Web UI and then it works fine, but I would have to do it manually everyday which is not what I am looking for.
Is there a way to "cast" the field as REQUIRED during the append query ?
Or during the first import from the Avro file, force the field to be NULLABLE and not REQUIRED ?
Thanks !
The feature that allows relaxing a field as part of a query or a load job will be available in production shortly. I will update this answer when it goes live (likely within a week).
Update: 08/25/2016
You can supply schemaUpdateOptions in load or query job configuration.
Multiple options can be provided.
It allows the schema of the destination table to be updated as a side effect of the load or query job. Schema update options are supported in two cases:
When writeDisposition is WRITE_APPEND
When writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators
For non-partitioned tables, WRITE_TRUNCATE will always overwrite the schema.
The following values are supported:
ALLOW_FIELD_ADDITION: allow adding a nullable field to the schema
ALLOW_FIELD_RELAXATION: allow relaxing a required field in the original schema to nullable
NOTE: This doesn't currently work with schema auto-detection. We plan to support that soon.

Change the database's table in hive or hcatalog

Is there a way to change the database's table in hive or Hcatalog?
For instance, I have the table foo in the database default, and I want to put this table in the database bar. I try this, but it doesn't work:
ALTER TABLE foo RENAME TO bar.foo
Thanks in advance
AFAIK there is no way in HiveQL to do this. A ticket was raised long back though. But the issue is still open.
An alternate could be to use the EXPORT/IMPORT feature provided by Hive. With this feature we can export the data of a table to a HDFS file along with the metadata using the EXPORT command. The data is stored in JSON format. Data once exported this way could be imported back to another database (even another hive instance) using the IMPORT command.
More on this can be found on the IMPORT/EXPORT MANUAL.
HTH
thanks for your response. I found an other mean to change the database
USE db1; CREATE TABLE db2.foo like foo

remove source file from Hive table

When I load a (csv)-file to a hive table I can load without overwriting, thus adding the new file to the table.
Internally the file is just copied to the correct folder in HDFS
(e.g. user/warehouse/dbname/tablName/datafile1.csv). And probably some metadata is updated.
After a few loads I want to remove the contents of a specific file from the table.
I am sure I cannot simply delete the file because of the metadata that needs to be adjusted as well. There must be some kind of build-in function for this.
How do I do that?
Why do you need that?I mean Hive was developed to serve as a warehouse where you put lots n lots n lots of data and not to delete data every now and then. Such a need seems to be a poorly thought out schema or a poor use of Hive, at least to me.
And if you really have these kind of needs why don't you create partitioned tables? If you need to delete some specific data just delete that particular partition using either TRUNCATE or ALTER.
TRUNCATE TABLE table_name [PARTITION partition_spec];
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec,...
if this feature is needed more than just once in a while you can use MapR's distribution while allows this kind of operations with no problem (even via NFS). otherwise, if you don't have partition I think you'll have to create and new table using CTAS filterring the data in the bad file or just copy the good files back to os with "hadoop fs -copyToLocal" and move them back to hdfs into new table