google big query: export table to own bucket results in unexpected error - google-bigquery

I'am stuck trying to export a table to my google cloud storage bucket.
Example job id: job_0463426872a645bea8157604780d060d
I tried the cloud storage target with alot of different variations, all reveal the same error. If I try to copy the natality report, it works.
What am I doing wrong?
Thanks!
Daniel

It looks like the error says:
"Table too large to be exported to a single file. Specify a uri including a * to shard export." Try switching the destination URI to something like gs://foo/bar/baz*

Specify the file extension along with the pattern. Example
gs://foo/bar/baz*.gz in case of GZIP (compressed)
gs://foo/bar/baz*.csv in case of csv (uncompressed)
The foo directory is the bucket name and bar directory can be your
date in string format which could be generated on the fly.

I was able to do it with:
bq extract --destination_format=NEWLINE_DELIMITED_JSON myproject:mydataset.mypartition gs://mybucket/mydataset/mypartition/{*}.json

Related

From S3 to Kafka using Apache Camel Source

I want to read data from amazon-s3 into kafka. I found camel-aws-s3-kafka-connector source and I try to use it and it works but... I want to read data from s3 without deleting files but execly once for each consumer without duplicates. It is possible to do this using only configuration file? I' ve already create file which looks like:
name=CamelSourceConnector
connector.class=org.apache.camel.kafkaconnector.awss3.CamelAwss3SourceConnector
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.camel.kafkaconnector.awss3.converters.S3ObjectConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
#prefix=WriteTopic
camel.source.endpoint.prefix=full/path/to/WriteTopic2
camel.source.path.bucketNameOrArn=BucketName
camel.source.endpoint.autocloseBody=false
camel.source.endpoint.deleteAfterRead=false
camel.sink.endpoint.region=xxxx
camel.component.aws-s3.accessKey=xxxx
camel.component.aws-s3.secretKey=xxxx
Additionaly with configuration as above I am not able to read only from "WriteTopic" but from all folders in s3, is it also possible to configure?
S3Bucket folders with files
I found workaround for duplicates problem, I'm not completly sure it is the best possible way but it may help somebody. My approach is described here: https://camel.apache.org/blog/2020/12/CKC-idempotency-070/ . I used camel.idempotency.repository.type=memory, and my configuration file looks like:
name=CamelAWS2S3SourceConnector connector.class=org.apache.camel.kafkaconnector.aws2s3.CamelAws2s3SourceConnector key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
camel.source.maxPollDuration=10000
topics=ReadTopic
# scieżka z ktorej czytamy dane
camel.source.endpoint.prefix=full/path/to/topic/prefix
camel.source.path.bucketNameOrArn="Bucket name"
camel.source.endpoint.deleteAfterRead=false
camel.component.aws2-s3.access-key=****
camel.component.aws2-s3.secret-key=****
camel.component.aws2-s3.region=****
#remove duplicates from messages#
camel.idempotency.enabled=true
camel.idempotency.repository.type=memory
camel.idempotency.expression.type=body
It is also important that I changed camel connector library. Initially I used camel-aws-s3-kafka-connector source, to use Idempotent Consumer I need to change connector on camel-aws2-s3-kafka-connector source

pyspark.sql.utils.IllegalArgumentException: requirement failed: Temporary GCS path has not been set

On Google Cloud Platform, I am trying to submit a pyspark job that writes a dataframe to BigQuery.
The code that executes the writing is as the following:
finalDF.write.format("bigquery")\
.mode('overwrite')\
.option("table","[PROJECT_ID].dataset.table")\
.save()
And I get the mentioned error in the title. How can I set the GCS temporary path?
As the github repository of spark-bigquery-connector states
One can specify it when writing:
df.write
.format("bigquery")
.option("temporaryGcsBucket","some-bucket")
.save("dataset.table")
Or in a global manner:
spark.conf.set("temporaryGcsBucket","some-bucket")
Property "temporaryGcsBucket" needs to be set either at the time of writing dataframe or while creating sparkSession.
.option("temporaryGcsBucket","some-bucket")
or like .option("temporaryGcsBucket","some-bucket/optional_path")
1. finalDF.write.format("bigquery") .mode('overwrite').option("temporaryGcsBucket","some-bucket").option("table","[PROJECT_ID].dataset.table") .save()

dataFactory V2 - Wildcards

I am trying to move & decompress data from Azure Data Lake Storage Gen1.
I have a couple of files with ".tsv.gz" extension, and I want to decompress and move them to a different folder, which is in the same data lake.
I've tried to use the wildcard "*.tsv.gz" inside the connection configuration, so I can make this process at once.
Am I making some mistake?
Thanks
Just tested it, you should just use:
*.tsv.gz
Without ' or "
Hope this helped!
PS: also, remember to check the "Copy file recursively" when you select the dataset in the pipeline.

Federated table/query not working - "Cannot read in location: us-west1"

I have a GCS bucket in US-WEST1:
That bucket has two files:
wiki_1b_000000000000.csv.gz
wiki_1b_000000000001.csv.gz
I've created a external table definition to read those files like so:
The dataset where this external table definition exists is also in the US.
When I query it with:
SELECT
*
FROM
`grey-sort-challenge.bigtable.federated`
LIMIT
100
..I get the following error:
Error: Cannot read in location: us-west1
I tested with asia-northeast1 and it works fine.
Why isn't this working for the US region?
Faced the same earlier. See G's answer - must use us-central1 for now: https://issuetracker.google.com/issues/76127552#comment11
For people from Europe
If you get an error Cannot read in location: EU while trying to read from external source - regional GCS bucket, you have to place your data in region europe-west1 as per the same comment. Unfortunately it is not reflected in the documentation yet.
I wanted to create a federation(external table) to contiually load up data from a new csv file which was imported each day.
In attempting to do so I was getting "Error: Cannot read in location: xxxx "
I solved the problem by:
I recreated a NEW bucket, this time select the US ( Multiple regions )
I then went back to BIG query and created a NEW data set with the data location as United States (US)
Presto!, I am now able to query an (constantly updating) external table!

Saving RDD to file results in _temporary path for parts

I have data in Spark which I want to save to S3. The recommended method is to save is using the saveAsTextFile method on the SparkContext, which is successful. I expect that the data will be saved as 'parts'.
My problem is that when I go to S3 to look at my data it has been saved in a folder name _temporary, with a subfolder 0 and then each part or task saved in its own folder.
For example,
data.saveAsTextFile("s3:/kirk/data");
results in file likes
s3://kirk/data/_SUCCESS
s3://kirk/data/_temporary/0/_temporary_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00000/part-00000
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001_$folder$
s3://kirk/data/_temporary/0/task_201411291454_0001_m_00001/part-00001
and so on. I would expect and have seen something like
s3://kirk/data/_SUCCESS
s3://kirk/data/part-00000
s3://kirk/data/part-00001
Is this a configuration setting, or do I need to 'commit' the save to resolve the temporary files?
I had the same problem with spark streaming, that was because my Sparkmaster was set up with conf.setMaster("local") instead of conf.SetMaster("local[*]")
Without the [*], spark can't execute saveastextfile during the stream.
Try using coalesce() to reduce the rdd to 1 partition before you export.
Good luck!