I have an Avro schema with a namespace of "ca.gms.api-event-log". I have used this schema to serialize messages into Kafka, successfully registered that schema with the Kafka Schema Registry and am using a Kafka Connector to send that data to Amazon S3 as .avro files. So far, no issues.
I am now attempting to copy that data from AWS S3 to Azure using Azure Data Factory, and it's complaining about the following:
Failed to deserialize Avro source file 'topics/api-event-log/partition=0/api-event-log+0+0000000000.avro'. This could be caused by invalid Avro data. Check the data and try again. Namespace 'ca.gms.api-event-log' contains invalid characters. . Activity ID: 12a7dda0-8cb7-4c79-a070-d366fddb1c00
Does "ca.gms.api-event-log" really contain invalid characters? Are hyphens not allowed? The Apache Avro spec seems to indicate any valid JSON string should work: https://avro.apache.org/docs/current/spec.html
I've noticed that hyphens are not allowed in the Python avro client, but are fine in the Java API.
Therefore, it ultimately depends on the parser being used, but I'd say that the rule of thumb is keep the same naming rules of Java packages, where hyphens also aren't allowed.
Note: You should probably try using a Kafka Connector capable of writing to Azure rather than paying S3 storage + transfer fees. Also not clear why files are even being opened to check schemas if you are just copying raw files
Related
I have a Flink streaming pipeline that reads the messages from Kafka, the message has s3 path to the log file. Using the Flink async IO I download the log file, parse & extract some key information from them. I now need to write this extracted data (Hashmap<String, String>) as Parquet file back to another bucket in S3. How do I do it?
I have completed till the transformation, I have used 1.15 flink version. The Parquet format writing is unclear or some methods seem to be deprecated.
You should be using the FileSink. There are some examples in the documentation, but here's an example that writes protobuf data in Parquet format:
final FileSink<ProtoRecord> sink = FileSink
.forBulkFormat(outputBasePath, ParquetProtoWriters.forType(ProtoRecord.class))
.withRollingPolicy(
OnCheckpointRollingPolicy.builder()
.build())
.build();
stream.sinkTo(sink);
Flink includes support for Protobuf and Avro. Otherwise you'll need to implement a ParquetWriterFactory with a custom implementation of the ParquetBuilder interface.
The OnCheckpointRollingPolicy is the default for bulk formats like Parquet. There's no need to specify that unless you go further and include some custom configuration -- but I added it to the example to illustrate how the pieces fit together.
i have a task in hand, where I am supposed to create python based HTTP API connector for airbyte. connector will return a response which will contain some links of zip files.
each zip file contains csv file, which is supposed to be uploaded to the bigquery
now I have made the connector which is returning the URL of the zip file.
The main question is how to send the underlying csv file to the bigquery ,
i can for sure unzip or even read the csv file in the python connector, but i am stuck on the part of sending the same to the bigquery.
p.s if you guys can tell me even about sending the CSV to google cloud storage, that will be awesome too
When you are building an Airbyte source connector with the CDK your connector code must output records that will be sent to the destination, BigQuery in your case. This allows to decouple extraction logic from loading logic, and makes your source connector destination agnostic.
I'd suggest this high level logic in your source connector's implementation:
Call the source API to retrieve the zip's url
Download + unzip the zip
Parse the CSV file with Pandas
Output parsed records
This is under the assumption that all CSV files have the same schema. If not you'll have to declare one stream per schema.
A great guide, with more details on how to develop a Python connector is available here.
Once your source connector outputs AirbyteRecordMessages you'll be able to connect it to BigQuery and chose the best loading method according to your need (Standard or GCS staging).
I have custom metadata properties on my s3 files such as:
x-amz-meta-custom1: "testvalue"
x-amz-meta-custom2: "whoohoo"
When these files are loaded into SnowFLake, how do I access the custom properties associated with the files. Google and SnowFlake documentation haven't turned up any gems yet.
Based on docs, I think the only metadata that you can access via the stage is filename and row number. https://docs.snowflake.com/en/user-guide/querying-metadata.html
You could possibly write something custom that picks up the S3 metadata and writes out a s3 filename along with the metadata and then ingest that back into another snowflake table.
According to the BigQuery federated source documentation:
[...]or are compressed must be less than 1 GB each.
This would imply that compressed files are supported types for federated sources in BigQuery.
However, I get the following error when trying to query a gz file in GCS:
I tested with an uncompressed file and it works fine. Are compressed files supported as federated sources in BigQuery, or have I misinterpreted the documentation?
Compression mode defaults to NONE and needs to be explicitly specified in the external table definition.
At the time of the question, this couldn't be done through the UI. This is now fixed and compressed data should be automatically detected.
For more background information, see:
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.query
The interesting parameter is "configuration.query.tableDefinitions.[key].compression".
Whenever I try to load a CSV file stored in CloudStorage into BigQuery, I get an InternalError (both using the web interface as well as the command line). The CSV is (an abbreviated) part of the Google Ngram dataset.
command like:
bq load 1grams.ngrams gs://otichybucket/import_test.csv word:STRING,year:INTEGER,freq:INTEGER,volume:INTEGER
gives me:
BigQuery error in load operation: Error processing job 'otichyproject1:bqjob_r28187461b449065a_000001504e747a35_1': An internal error occurred and the request could not be completed.
However, when I load this file directly using the web interface and the File upload as a source (loading from my local drive), it works.
I need to load from Cloud Storage, since I need to load much larger files (original ngrams datasets).
I tried different files, always the same.
I'm an engineer on the BigQuery team. I was able to look up your job, and it looks like there was a problem reading the Google Cloud Storage object.
Unfortunately, we didn't log much of the context, but looking at the code, the things that could cause this are:
The URI you specified for the job is somehow malformed. It doesn't look malformed, but maybe there is some odd UTF8 non-printing character that I didn't notice.
The 'region' for your bucket is somehow unexpected. Is there any chance you've set data location on your GCS bucket to something other than {US, EU, or ASIA}. See here for more info on bucket locations. If so, and you've set location to a region, rather than a continent, that could cause this error.
There could have been some internal error in GCS that caused this. However, I didn't see this in any of the logs, and it should be fairly rare.
We're putting in some more logging to detect this in the future and to fix the issue with regional buckets (however, regional buckets may fail, because bigquery doesn't support cross-region data movement, but at least they will fail with an intelligible error).