Azure Data Factory V2 - Compression type (enum) as parameter - azure-data-factory-2

I have a pipeline to copy blobs between storages.
I'd like to have Compression Type and Compression Level in the Sink Dataset (storage) set up as parameters.
When I edit the dataset as JSON, I'm able do use expressions and parameters, but there's no way to switch compression off (None), which would be represented as an empty object.
Is there a way to achieve this?
Thanks!

Ok, I finally figured it out - you have to directly edit JSON for the target linked storage service, and the expression I used is
"compression": {
"type": "Expression",
"value": "#json(if(or(equals(dataset().TargetCompressionType, 'None'),equals(dataset().TargetCompressionType, '')),'{}',concat('{\"type\":\"', dataset().TargetCompressionType, '\", \"level\":\"', dataset().TargetCompressionLevel, '\"}')))"
}
So when TargetCompressionType is set to empty string or "None", the result will be an empty object and compression won't be used, and when set to "ZipDeflate" or others, the files from source will be compressed.

Related

Apache Beam Java 2.26.0: BigQueryIO 'No rows present in the request'

Since the Beam 2.26.0 update we ran into errors in our Java SDK streaming data pipelines. We have been investigating the issue for quite some time now but are unable to track down the root cause. When downgrading to 2.25.0 the pipeline works as expected.
Our pipelines are responsible for ingestion, i.e., consume from Pub/Sub and ingest into BigQuery. Specifically, we use the PubSubIO source and the BigQueryIO sink (streaming mode). When running the pipeline, we encounter the following error:
{
"code" : 400,
"errors" : [ {
"domain" : "global",
"message" : "No rows present in the request.",
"reason" : "invalid"
} ],
"message" : "No rows present in the request.",
"status" : "INVALID_ARGUMENT"
}
Our initial guess was that the pipeline's logic was somehow bugged, causing the BigQueryIO sink to fail. After investigation, we concluded that the PCollection feeding the sink is indeed containing correct data.
Earlier today I was looking in the changelog and noticed that the BigQueryIO sink received numerous updates. I was specifically worried about the following changes:
BigQuery’s DATETIME type now maps to Beam logical type org.apache.beam.sdk.schemas.logicaltypes.SqlTypes.DATETIME
Java BigQuery streaming inserts now have timeouts enabled by default. Pass --HTTPWriteTimeout=0 to revert to the old behavior
With respect to the first update, I made sure to disable all DATETIME in the resulting TableRow objects. In this specific scenario, the error still stands.
For the second change, I'm unsure how to pass the --HTTPWriteTimeout=0 flag to the pipeline. How is this best achieved?
Any other suggestions as to the root cause of this issue?
Thanks in advance!
We have finally been able to fix this issue and rest assured it has been a hell of a ride. We basically debugged the entire BigQueryIO connector and came to the following conclusions:
The TableRow objects that are being forwarded to BigQuery used to contain enum values. Due to these not being serializable, an empty payload is forwarded to BigQuery. In my opinion, this error should be made more explicit (why was this suddenly changed anyway?).
The issue was solved by adding the #value annotation to each enum entry (com.google.api.client.util.Value).
The same TableRow object also contained values of the type byte[]. This value was injected in a BigQuery column with the bytes type. While this was working without explicitly computing a base64 before, it was now yielding errors.
The issue was solved by computing a base64 ourselves (this setup is also discussed in the following post).
--HTTPWriteTimeout is a pipeline option. You can set it the same way you set the runner, etc. (typically on the command line).

How to define TransferConfig for BigQuery Data Transfer Service

The source projectId and source datasetId are supposed to be defined in the Params. But I am not sure how to set Params properly.
TransferConfig transferConfig = TransferConfig.newBuilder()
.setDisplayName(jobName)
.setDestinationDatasetId(dstDatasetId)
.setParams(Struct.newBuilder().build())
.build();
The corresponding cli looks like this.
bq mk --transfer_config
--data_source="cross_region_copy"
--display_name=copy-cli-display-name
--project_id=play
--target_dataset=copy_dataset_cli
--params='{"source_project_id": "tough-talent", "source_dataset_id": "billing"}'
Actually Google BigQuery Data Transfer service represents a plenty of client libraries with a flexible way developing data transport capabilities across BigQuery Data Transfer API.
Assuming that your aim is just to develop transfer service via compatible Bigquery transfer Java client library, you might be looking for TransferConfig.Builder class in order to create data transfer configuration. This class contains dedicated methods for declaring destination Bigquery location path, however source data transfer settings are basically enclosed in setParams() method, invoking Struct() class for propagating key-value structured data parameters, originally included in google.protobuf package.
Due to the fact that:
Struct represents a structured data value, consisting of fields
which map to dynamically typed values.
You can supposedly use either putAllFields() or putFields() methods for mapping key-value parameters within Struct.newBuilder() class as demonstrated in this example:
Struct struct1 = Struct.newBuilder().putFields(
"some-key", Value.newBuilder().setStringValue("some-value").build()
).build();
Find more related examples here.

How to get a RAW16 from CX3

This is my data flow for my system:
Because i can not found a demo to config a raw16, and i did not found the enum type "enum CyU3PMipicsiDataFormat_t " which not contain a RAW16type,
so i did't known how to transfer my raw16 data to the host.
I try to use the yuv422 configuration to transfer my raw data to the host, and i really received data from the CX3 by e-cam, but the image is wrong for the e-cam use the yuv2 formating to resolve the raw data. And now I think i can use the matlab to grap a frame and deal with it. But when i use the matlab getting a snashot and i found the data is a
type like this: 1280*800*3(full frame size:1280x800). Is it the matlab regard as a yuv data? and how can i config the cx3 to support raw16 or how to deal with the data i grap from the cx3 with the yuv format transfer.
Is there any other developer meet the requirement like me?

JClouds S3: Specify Content Length when uploading file

I wrote an application using JClouds 1.6.2 and had file upload code like
java.io.File file = ...
blobStore.putBlob(containerName,
blobStore.blobBuilder(name)
.payload(file)
.calculateMD5()
.build()
);
This worked perfectly well.
Now, in jclouds 1.7, BlobStore.calculateMD5() is deprecated. Furthermore, even if calculating the MD5 hash manually (using guava Hashing) and passing it with BlobStore.contentMD5(), I get the following error:
java.lang.IllegalArgumentException: contentLength must be set, streaming not supported
So obviously, I also have to set the content length.
What is the easiest way to calculate the correct content length?
Actually I don't think, jclouds suddenly removed support of features and makes uploading files much more difficult. Is there a way to let jclouds calculate MD5 and/or content length?
You should work with ByteSource which offers several helper methods:
ByteSource byteSource = Files.asByteSource(new File(...));
Blob blob = blobStore.blobBuilder(name)
.payload(byteSource)
.contentLength(byteSource.size())
.contentMD5(byteSource.hash(Hashing.md5()).asBytes())
.build();
blobStore.putBlob(containerName, blob);
jclouds made these changes to remove functionality duplicated by Guava and make some of the costs of some operations, e.g., hashing, more obvious.

cannot view document in ravendb studio

When I try to view my document I get this error:
Client side exception:
System.InvalidOperationException: Document's property: "DocumentData" is too long to view in the studio (property length: 699.608, max allowed length: 500.000)
at Raven.Studio.Models.EditableDocumentModel.AssertNoPropertyBeyondSize(RavenJToken token, Int32 maxSize, String path)
at Raven.Studio.Models.EditableDocumentModel.AssertNoPropertyBeyondSize(RavenJToken token, Int32 maxSize, String path)
at Raven.Studio.Models.EditableDocumentModel.<LoadModelParameters>b__2a(DocumentAndNavigationInfo result)
at Raven.Studio.Infrastructure.InvocationExtensions.<>c__DisplayClass17`1.<>c__DisplayClass19.<ContinueOnSuccessInTheUIThread>b__16()
at AsyncCompatLibExtensions.<>c__DisplayClass55.<InvokeAsync>b__54()
I am saving a pdf in that field.
I want to be able to edit the other fields.
Is it possible for it to ignore the field that's too big?
Thanks!
Don't save large binary (or base64 encoded) data into the json document. That's a poor use of the database. Instead, you should consider one of these two options:
Option 1
Write the binary data to disk (or cloud storage) yourself.
Save a file path (or url) to it in your document.
Option 2
Use Raven's attachments feature. This is a separate area in the database meant specifically for storing binary files.
The advantage is that your binary documents are included in database backups, and if you like you can take advantage of features like my Indexed Attachments Bundle, or write your own custom bundles that use attachment triggers.
The disadvantage is that your database can grow very large. For this reason, many prefer option 1.