I am using the confluent to import data from kafka to hive, trying to do the same thing as this: Bucket records based on time(kafka-hdfs-connector)
my sink config is like this:
{
"name":"yangfeiran_hive_sink_9",
"config":{
"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector",
"topics":"peoplet_people_1000",
"name":"yangfeiran_hive_sink_9",
"tasks.max":"1",
"hdfs.url":"hdfs://master:8020",
"flush.size":"3",
"partitioner.class":"io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner",
"partition.duration.ms":"300000",
"path.format":"'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm/",
"locale":"en",
"logs.dir":"/tmp/yangfeiran",
"topics.dir":"/tmp/yangfeiran",
"hive.integration":"true",
"hive.metastore.uris":"thrift://master:9083",
"schema.compatibility":"BACKWARD",
"hive.database":"yangfeiran",
"timezone": "UTC",
}
}
Everything works fine, I can see that data is in the hdfs, the table is created in the hive, except when I am using "select * from yang" to check if the data is already in hive.
It prints the error:
FAILED: SemanticException Unable to determine if hdfs://master:8020/tmp/yangfeiran/peoplet_people_1000 is encrypted: java.lang.IllegalArgumentException: Wrong FS: hdfs://master:8020/tmp/yangfeiran/peoplet_people_1000, expected: hdfs://nsstargate
How to solve this problem?
Feiran
Related
When we use a BigQueryIO transform to insert rows, we have an option called:
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
which instructs the pipeline to NOT attempt to create the table if the table doesn't already exist. In my scenario, I want to trap all errors. I attempted to use the following:
var write=mypipline.apply("Write table", BigQueryIO
.<Employee>write()
.to(targetTableName_notpresent)
.withExtendedErrorInfo()
.withFormatFunction(new EmployeeToTableRow())
.withSchema(schema)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withTableDescription("My Test Table")
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
which tried to insert rows into a non-existent table. What I found was a RuntimeException. Where I am stuck is that I don't know how to handle RuntimeException problems. I don't believe there is anything here I can surround with a try/catch.
This question is similar to this one:
Is it possible to catch a missing dataset java.lang.RuntimeException in a Google Cloud Dataflow pipeline that writes from Pub/Sub to BigQuery?
but I don't think that got a working answer and was focused on a missing Dataset and not a table.
My exception from the fragment above is:
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/XXXX/datasets/jupyter/tables/not_here/insertAll?prettyPrint=false
{
"code" : 404,
"errors" : [ {
"domain" : "global",
"message" : "Not found: Table XXXX:jupyter.not_here",
"reason" : "notFound"
} ],
"message" : "Not found: Table XXXX:jupyter.not_here",
"status" : "NOT_FOUND"
}
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:323)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
at .(#126:1)
You can't add a try/catch directly from the BigQueryIO in the Beam job, if the destination table doesn't exist.
I think it's better to delegate this responsability outside of Beam or launch the job only if your table exists.
Usually a tool like Terraform has the responsability to create infrastructure, before to deploy resources and run Beam jobs.
If it's mandatory for you to check the existence of the table, you can create :
A Shell script with bq and gcloud cli to check the existence before to launch the job
A Python script to check the existence before to launch the job
Python script :
For Python there is the BigQuery Python client :
from google.cloud import bigquery
from google.cloud.exceptions import NotFound
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to determine existence.
# table_id = "your-project.your_dataset.your_table"
try:
client.get_table(table_id) # Make an API request.
print("Table {} already exists.".format(table_id))
except NotFound:
print("Table {} is not found.".format(table_id))
BQ Shell script :
bq show <project_id>:<dataset_id>.<table_id>
If the table doesn't exist, catch the error and do not start the Dataflow job.
I have just working in a cloud database project involving some big data processing and need to quickly migrate data directly from bigquery to snowflake. I was thinking if there is any direct way to push data from bigquery to snowflake without any intermediate storage.
Please let me know there is a utility or api or anything other way bigquery provides to push data to snowflake. Appreciate your help.
If this is a one-time need, you can perform it in two steps, exporting the table data to Google's Cloud Storage and then bulk loading from the Cloud Storage onto Snowflake.
For the export stage, use BigQuery's inbuilt export functionality or the BigQuery Storage API if the data is very large. All of the export formats supported by BigQuery (CSV, JSON or AVRO) are readily supported by Snowflake, so a data transformation step may not be required.
Once the export's ready on the target cloud storage address, use Snowflake's COPY INTO <table> with the external location option (or named external stages) to copy them into a Snowflake-managed table.
If you want to move all your data to snowflake from bigquery, so I have few assumption
1. You have all your data in gcs itself.
2. You can connect the snowflake cluster from gcp.
Now your problem shrinks down to moving your data from gcs bucket to snowflake directly.
SO now you can run copy command like
COPY INTO [<namespace>.]<table_name>
FROM { internalStage | externalStage | externalLocation }
[ FILES = ( '<file_name>' [ , '<file_name>' ] [ , ... ] ) ]
[ PATTERN = '<regex_pattern>' ]
[ FILE_FORMAT = ( { FORMAT_NAME = '[<namespace>.]<file_format_name>' |
TYPE = { CSV | JSON | AVRO | ORC | PARQUET | XML } [ formatTypeOptions ] } ) ]
[ copyOptions ]
[ VALIDATION_MODE = RETURN_<n>_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS ]
where external location is
externalLocation ::=
'gcs://<bucket>[/<path>]'
[ STORAGE_INTEGRATION = <integration_name> ]
[ ENCRYPTION = ( [ TYPE = 'GCS_SSE_KMS' ] [ KMS_KEY_ID = '<string>' ] | [ TYPE = NONE ] ) ]
The documentation for the same can be found here and here
If the Snowflake instance is hosted on a different vendor (e.g. AWS or Azure), there is a tool that I wrote that could be of use called sling. There is a blog entry about this in detail, but essentially you run one command after setting up your credentials:
$ sling run --src-conn BIGQUERY --src-stream segment.team_activity --tgt-conn SNOWFLAKE --tgt-object public.activity_teams --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_teams
11:38AM INF created table public.activity_teams
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded
Lets say I have topic created via kafka streams from Confluent which contains messages serialized in avro with io.confluent.kafka.streams.serdes.avro.SpecificAvroSerializer
Then I create an external kafka table in Hive
CREATE EXTERNAL TABLE k_table
(`id` string , `sequence` int)
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES
(
"kafka.topic" = "sample-topic",
"kafka.bootstrap.servers"="kafka1:9092",
"kafka.serde.class"="org.apache.hadoop.hive.serde2.avro.AvroSerDe",
"avro.schema.url"="Sample.avsc"
);
When I run the query:
select * from k_table WHERE `__timestamp` > 1000 * to_unix_timestamp(CURRENT_TIMESTAMP - interval '2' DAYS)
I got unexpected IO error:
INFO : Executing command(queryId=root_20190205160129_4579b5ff-9a5c-496d-8d03-9a7ccc0f6d90): select * from k_tickets_prod2 WHERE `__timestamp` > 1000 * to_unix_timestamp(CURRENT_TIMESTAMP - interval '1' minute)
INFO : Completed executing command(queryId=root_20190205160129_4579b5ff-9a5c-496d-8d03-9a7ccc0f6d90); Time taken: 0.002 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
Error: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: 55 (state=,code=0)
Well all works fine with Confluent kafka consumer and also I tried to set confluent kafka deserializer in TBLPROPERTIES which seems to have to effect.
Environment:
Hive 4.0 + Beeline 3.1.1 + Kafka 1.1 (Clients & Broker) + Confluent 4.1
The problem is Confluent producer serializes avro messages with custom format as <magic_byte 0x00><4 bytes of schema ID><regular avro bytes for object that conforms to schema>. So Hive kafka handler has problem to deserialize cuz it uses basic bytearray kafka deserializer and these 5 bytes at the beginning of the message are unexpected.
I've created a bug in hive to support Confluent format and Schema registry as well and also I've made a PR with quick fix that removes 5 bytes from message after "avro.serde.magic.bytes"="true" property is set in TBLPROPERTIES.
After this patch it works like charm.
I am using Cloud Storage Text to BigQuery template on Cloud Composer.
The template is kicked from Python google api client.
The same program
works fine in US location (for Dataflow and BigQuery).
fails in asia-northeast1 location.
works fine with the fewer (less than 10000) input files in asia-northeast location.
Does anybody have an idea about this?
I want to execute in the asia-northeast location for business reason.
More details about failure:
The program worked until "ReifyRenameInput", and the failed .
dataflow job failed
with the error message below:
java.io.IOException: Unable to insert job: beam_load_textiotobigquerydataflow0releaser0806214711ca282fc3_8fca2422ccd74649b984a625f246295c_2a18c21953c26c4d4da2f8f0850da0d2_00000-0, aborting after 9 .
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:231)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startJob(BigQueryServicesImpl.java:202)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$JobServiceImpl.startCopyJob(BigQueryServicesImpl.java:196)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.copy(WriteRename.java:144)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.writeRename(WriteRename.java:107)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.processElement(WriteRename.java:80)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException:
404 Not Found { "code" : 404, "errors" : [ { "domain" : "global", "message" : "Not found: Dataset pj:datasetname", "reason" : "notFound" } ], "message" : "Not found: Dataset pj:datasetname" }
(pj and dataset name are not real name, and they are project name and dataset name for outputTable parameter)
Although the error message said the dataset is not found, the dataset surely existed.
Moreover, some new tables which seems to be tempory tables were created in the dataset after the program.
This is a known issue related to your Beam SDK version according to this public issue tracker. The Beam 2.5.0 SDK version doesn't have this issue.
Let's say I have a table with one single field named "version", which is a string. When I try to load data into the table using autodetect with values like "1.1" or "1", the autodetect feature infers these values as float or integer type respectively.
data1.json example:
{ "version": "1.11.0" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data1.json
Upload complete.
Waiting on bqjob_ZZZ ... (1s) Current status: DONE
data2.json example:
{ "version": "1.11" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data2.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to FLOAT
data3.json example:
{ "version": "1" }
bq load output:
$ bq load --autodetect --schema_update_option=ALLOW_FIELD_ADDITION --source_format=NEWLINE_DELIMITED_JSON temp_test.temp_table ./data3.json
Upload complete.
Waiting on bqjob_ZZZ ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'YYY:bqjob_ZZZ': Invalid schema update. Field version has changed type from STRING to INTEGER
The scenario where this problem doesn't happen is when you have, in the same file, another JSON where the value is inferred correctly as string (as seen in Bigquery autoconverting fields in data question):
{ "version": "1.12" }
{ "version": "1.12.0" }
In the question listed above, there's an answer stating that a fix was pushed to production, but it looks like the bug is back again. Is there a way/workaround to prevent this?
Looks like the confusing part here is whether "1.12" should be detected as string or float. BigQuery chose to detect as float. Before autodetect is introduced in BigQuery, BigQuery allows users to load float values in string format. This is very common in CSV/JSON format. So when autodetect is introduced, it kept this behavior. Autodetect will scan up to 100 rows to detect the type. If for all 100 rows, the data is like "1.12", then very likely this field is a float value. If one of the row has value "1.12.0", then BigQuery will detect the type is string, as you have observed.