How transfer data from Google Cloud Storage to Biq Query in partitioned and clustered table? - google-bigquery

In first time,I created a empty table with partition and cluster. After that, I would like to configure data transfere service to fill my table from Google Cloud Storage.But when I configure the transfer, I didn't see a parameter field which allows to choose the cluster field.
I tried to do the same thing without the cluster and I can fill my table easily.
Big query error when I ran the transfer:
Failed to start job for table matable$20190701 with error INVALID_ARGUMENT: Incompatible table partitioning specification. Destination table exists with partitioning specification interval(type:DAY,field:) clustering(string_field_15), but transfer target partitioning specification is interval(type:DAY,field:). Please retry after updating either the destination table or the transfer partitioning specification.

When you define the table you specify the partitioning and clustering columns. That's everything you need to do.
When you load the data (from CLI or UI) from GCS BigQuery automatically partition and cluster the data.
If you can give more detail of how you create the table and set up the transfer would be helpful to provide a more detailed explanation.

Thanks for your time.
Of course :
empty table configuration
transfer configuration
I success to transfer datat without cluster but, when I add a cluster in my empty table,the trasnfer fails.

Related

BigQuery Scheduled Data Transfer throws "Incompatible table partitioning specification." Error - but error message is truncated

I'm using the new BQ Data Transfer UI and upon scheduling a Data Transfer, the transfer fails.
The error message in Run History isn't terribly helpful as the error message seems truncated.
Incompatible table partitioning specification. Expects partitioning specification interval(type:hour), but input partitioning specification is ; JobID: xxxxxxxxxxxx
Note the part of the error that says..."but input partition specification is..." with nothing before the semicolon. Seems this error is truncated.
Some details about the run:
The run imports data from a CSV file located in a GCS Bucket on a nightly basis. Once successfully ingested the process will delete the file. The target table in BQ is a partitioned table using the default partition pseudo column (_PARTITIONTIME)
What I have done so far:
Reran the scheduled Data Transfer -- which failed and threw the same error
Deleted the target table in BQ and recreated it with different partition specifications (day, hour, month) -- then Reran Scheduled Transfer -- failed and threw same error.
Imported the data manually (I downloaded the file from GCS and uploaded it locally from my machine) using the BQ UI (Create Table, append the specific table) - Worked perfectly.
Checked to see if this was a known issue here on Stack Overflow and only found this (which is now closed) -- close, but not exactly the issue. (BigQuery Data Transfer Service with BigQuery partitioned table)
What I'm holding off doing since it would take a bit more work.
Change schema of the target BQ table to include a specified column specific for partitioning
Include a system-generated timestamp in the original file inside of GCS and ensure the process recognizes this as the partitioning field.
Am I doing something wrong? Or is this a known issue?
Alright, I believe I have solved this. It looks like you need to include runtime parameters into your target table if the destination table is being partitioned.
https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters
Specifically this section called "Runtime Parameter Examples" here: https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters#loading_a_snapshot_of_all_data_into_an_ingestion-time_partitioned_table
They also advise that minutes cannot be specified in these parameters.
You will need to append the parameters to your destination table details as shown below:

Are two tables (native, external) always required in Hive for querying a DynamoDB table from AWS EMR?

Are two hive tables (native, external) always required for querying a DynamoDB table from an AWS EMR?
I have created a native hive table (CTAS, create table as select) using an hive external table that was mapped to a DynamoDB table. My (read) query times against external tables are slow and it uses up the read throughput versus native table are fast and read throughput is not consumed.
My questions:
Is this a standard practice/best practice i.e., create an external table mapped to a dynamodb table and then create a CTAS and query against CTAS for all read query use cases?
Where or how GSI's on dynamodb come into picture on hive side of things? Toward this curiosity I have tried to map my external hive table column to dynamodb GSI and some what expectedly saw NULLs.
So, back to #2 question was wondering how are GSI's used with a native or external hive table?
Thanks,
Answer is no.
However, from my observation if a hive native table data is backed (CTAS) by hive external table that is referencing a DynamoDb table: Read data is not accounted if you are querying hive native table from EMR. If you to take into account the periodic update (refresh data) of hive native table.

How does overwrite existing insert mode work in redshiftcopyactivity for aws data pipeline

I am new to aws data pipeline. We have a use case where we copy updated data into redshift . I wanted to know whether I can use OVERWRITE_EXISTING insert mode for redshiftcopyactivity. Also, please explain the internal working of OVERWRITE_EXISTING.
Data Pipelines are used to move data from DynamoDB or Amazon S3 to Amazon Redshift. You can load data into a new table, or easily merge data into an existing table.
"OVERWRITE_EXISTING", over writes the already existed data in to the destination table but with a constraint of unique identifier (Primary Key) in RedShift cluster.
You can use "TRUNCATE", if you dont want your table structure to be changed due to the addition of PK.
Though, you can find things here: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html

What's the best approach to load teradata table data into a hive table using Nifi?

I'm new to Nifi so could you help me understand this platform and its capabilities.
Would I be able to use a Nifi process to create a new table in Hive and move data into it weekly from a teradata database in the way I've defined below?
How would I go about it? Not sure if I'm building a sensible flow.
Would the following process suffice: QueryDatabaseTable (and configure a pooling service for teradata and define a new tablename and schedule ingestion) --> PutHiveStreaming (create the table defined earlier)
and then how do i pull the teradata schema into the new table?
If you want to create new hive table along with the ingestion process then
Method1:
Using ConvertAvroToOrc processor adds hive.ddl(external table) attribute to the flowfile as we can use this attribute and execute using PutHiveQL processor then we are able to create table in hive.
If you want to create transactional table then needs to change the hive.ddl attribute.
Refer to this link for more details.
If you wan to pull only the delta records from the source then you can use
ListDatabaseTables(list all tables from source db) + GenerateTableFetch(stores the state) Processors
Flow:
Method2:
QuerydatabaseTable processor will result flowfile in Avro Format then you can use ExtractAvroMetaData processor to extract the avro schema by using some script we can create a new attribute with the required schema(i.e. managed/external/transactional table).

when should we go for a external table and internal table in Hive?

I understand the difference between Internal tables and external tables in hive as below
1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be
deleted
2) if the file data need to be shared by other tools/applications then we go for external table if not
internal table, so that if we drop the table(external) data will still be available for other tools/applications
I have gone through the answers for question "Difference between Hive internal tables and external tables? "
but still I am not clear about the proper uses cases for Internal Table
so my question is why is that I need to make an Internal table ? why cant I make everything as External table?
Use EXTERNAL tables when:
The data is also used outside of Hive.
For example, the data files are read and processed by an existing program that doesn't lock the files.
The data is permanent i.e used when needed.
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the lifecycle of the table and data.
Let's understand it with two simple scenarios:
Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.
There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.