dataFactory V2 - Wildcards - azure-data-lake

dataFactory V2 - Wildcards - azure-data-lake

I am trying to move & decompress data from Azure Data Lake Storage Gen1.
I have a couple of files with ".tsv.gz" extension, and I want to decompress and move them to a different folder, which is in the same data lake.
I've tried to use the wildcard "*.tsv.gz" inside the connection configuration, so I can make this process at once.
Am I making some mistake?
Thanks

Just tested it, you should just use:
*.tsv.gz
Without ' or "
Hope this helped!
PS: also, remember to check the "Copy file recursively" when you select the dataset in the pipeline.

Related

how to concatenate the OutputPathPlaceholder with a string with Kubeflow pipelines?

I am using Kubeflow pipelines (KFP) with GCP Vertex AI pipelines. I am using kfp==1.8.5 (kfp SDK) and google-cloud-pipeline-components==0.1.7. Not sure if I can find which version of Kubeflow is used on GCP.
I am bulding a component (yaml) using python inspired form this Github issue. I am defining an output like:
outputs=[(OutputSpec(name='drt_model', type='Model'))]
This will be a base output directory to store few artifacts on Cloud Storage like model checkpoints and model.
I would to keep one base output directory but add sub directories depending of the artifact:
<output_dir_base>/model
<output_dir_base>/checkpoints
<output_dir_base>/tensorboard
but I didn't find how to concatenate the OutputPathPlaceholder('drt_model') with a string like '/model'.
How can append extra folder structure like /model or /tensorboard to the OutputPathPlaceholder that KFP will set during run time ?

I didn't realized in the first place that ConcatPlaceholder accept both Artifact and string. This is exactly what I wanted to achieve:
ConcatPlaceholder([OutputPathPlaceholder('drt_model'), '/model'])

From S3 to Kafka using Apache Camel Source

I want to read data from amazon-s3 into kafka. I found camel-aws-s3-kafka-connector source and I try to use it and it works but... I want to read data from s3 without deleting files but execly once for each consumer without duplicates. It is possible to do this using only configuration file? I' ve already create file which looks like:
name=CamelSourceConnector
connector.class=org.apache.camel.kafkaconnector.awss3.CamelAwss3SourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.camel.kafkaconnector.awss3.converters.S3ObjectConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
#prefix=WriteTopic
camel.source.endpoint.prefix=full/path/to/WriteTopic2
camel.source.path.bucketNameOrArn=BucketName
camel.source.endpoint.autocloseBody=false
camel.source.endpoint.deleteAfterRead=false
camel.sink.endpoint.region=xxxx
camel.component.aws-s3.accessKey=xxxx
camel.component.aws-s3.secretKey=xxxx
Additionaly with configuration as above I am not able to read only from "WriteTopic" but from all folders in s3, is it also possible to configure?
S3Bucket folders with files

I found workaround for duplicates problem, I'm not completly sure it is the best possible way but it may help somebody. My approach is described here: https://camel.apache.org/blog/2020/12/CKC-idempotency-070/ . I used camel.idempotency.repository.type=memory, and my configuration file looks like:
name=CamelAWS2S3SourceConnector connector.class=org.apache.camel.kafkaconnector.aws2s3.CamelAws2s3SourceConnector key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
# scieżka z ktorej czytamy dane
camel.source.endpoint.prefix=full/path/to/topic/prefix
camel.source.path.bucketNameOrArn="Bucket name"
camel.source.endpoint.deleteAfterRead=false
camel.component.aws2-s3.access-key=****
camel.component.aws2-s3.secret-key=****
camel.component.aws2-s3.region=****
#remove duplicates from messages#
camel.idempotency.enabled=true
camel.idempotency.repository.type=memory
camel.idempotency.expression.type=body
It is also important that I changed camel connector library. Initially I used camel-aws-s3-kafka-connector source, to use Idempotent Consumer I need to change connector on camel-aws2-s3-kafka-connector source

Qlik Sense: how to specify path in Google Drive?

I have a Google drive account divided into some folders (say, Folder1, Folder2, etc.), with some subfolders in it.
I successfully managed to connect my Qlik Sense app to it.
I need to make it look for files only in a given subfolder.
At the moment, I read as follows ([...] is the location)
(URL IS [[...]connectorID=GoogleDriveConnector&table=ListSpreadsheets&appID=], qvx);
It works and reloads successfully, but I need it to filter the Spreadsheets properly. How could I get what I need?

To connect to Google Drive in fact you use web connector. Once web connector is installed it can be initialized as service or manually from its folder.
Once it i installed (recent version can be downloaded from https://qliksupport.force.com/apex/QS_Home_Page but it seems that you've got it as Google Drive is part of it ) it is much nicer to configure connection to online drives there.
You just go to http://localhost:5555/web and generate ready code.
In my implementation I used following options step by step to get data which I wanted:
1) CanAuthenticate to generate permanent token
2) ListSpreadsheets
3) ListWorksheets
4) GetWorksheet

You can't just specify path. But it's possible to retrieve the path from QWC services. Please use algorithm like that:
Use tables like ListFiles/ListWorksheets
Iter through every row with 'for' cycle:
FOR i=0 to (NoOfRows('Google_ListWorksheets')-1);
Let vWorksheetKey = Peek('worksheetKey', $(i), 'Google_ListWorksheets');
Let vTitle = left(Peek('title', $(i), 'Google_ListWorksheets'),3);
Using 'if' statement find desired folder id/worksheet key by its name (stored in vTitle variable) and use it:
load * FROM [$(vQwcConnectionName)]
(URL IS [http://localhost:5555/data?connectorID=GoogleDriveConnector&table=GetWorksheet&worksheetKey=$(vWorksheetKey)&appID=], qvx);
At the end you will get your files by their location.

Saving RDD to file results in _temporary path for parts

I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?

I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.

Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!

google big query: export table to own bucket results in unexpected error

I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel

It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*

Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.

I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

dataFactory V2 - Wildcards - azure-data-lake

Just tested it, you should just use: *.tsv.gz Without ' or " Hope this helped! PS: also, remember to check the "Copy file recursively" when you select the dataset in the pipeline.

Related

how to concatenate the OutputPathPlaceholder with a string with Kubeflow pipelines?

From S3 to Kafka using Apache Camel Source

Qlik Sense: how to specify path in Google Drive?

Saving RDD to file results in _temporary path for parts

google big query: export table to own bucket results in unexpected error

Categories

Resources