Amazon S3 parquet file - Transferring to GCP / BQ - amazon-s3

Good morning everyone. I have a GCS Bucket, which has files that have been transferred from our Amazon S3 bucket. These files are in .gz.parquet format. I am trying to set up a transfer from the GSC bucket to BigQuery with the transfer feature, however I am running into issues with the parquet file format.
When I create a transfer and specify the file format as Parquet, I receive an error stating that the data is not in parquet format. When I tried specifying the file in CSV, weird values appear in my table as shown in the image linked:
I have tried the following URIs:
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.
Does anyone have any idea on how I should proceed? Thank you in advance!

There is a dedicated documentation explaining how to copy Parquet data from Cloud storage bucket to Big Query which is given below. Could you please go thru it and update us if its still not solving your problem.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
Regards,
Anbu.

Seeing the looks of your URIs, the page you are looking for is this one, for loading hive partitioned parquet files into BigQuery.
You can try something like below in Cloud Shell:
bq load --source_format=PARQUET --autodetect \
--hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=gs://bucket-name/folder-1/folder-2/ \
dataset.table `gcs_uris`

Related

Databricks showing an empty DF when reading snappy.parquet files

Having an issue with loading parquet files onto Databricks. We have used Amazon DMS service to migrate postgres databases onto Databricks in order to save them on the Delta lake. The DMS moved the database from RDS postgres into an S3 bucket that is already mounted. Files are visible but I am unable to read them.
Running:
df = spark.read.option("header","true").option("recursiveFileLookup","true").format('delta').load('/mnt/delta/postgres_table')
display(df)
Show:
Query returned no results
Inside this directory there are a slew of snappy.parquet.
Thank you
I downloaded and reviewed as an individual parquet file as a LOAD0000.parquet file and it does show with pandas. Aside from that several scripts were tested to see if I can get 1 df to show to no avail.

Why empty parquet files are created while writing into s3 using pyspark job?

I am reading a cassandra table from pyspark, and I am observing for very small data there are three files got written into s3 but out of those 3 , one file was empty parquet file. I am wondering how empty file creation took place ?
below are my steps in pyspark job:
read a table from cassandra
coalesce(15) on data read from cassandra
write into aws s3 bucket .

Download and convert Parquet files

I have a vendor who is sharing 1000's of parquet files which I have to download and convert to CSV for my boss to run analysis. But the vendor is not giving ListObjects permission on the AWS S3 bucket. What are my alternatives to get these 1000s of files? I would prefer to get them into my S3 bucket so I can convert them to CSV using spark and then my boss can download CSV later. I am trying to use pyspark with boto3. Below is a snippet of code that I am running on a standalone EC2 instance with Spark.
print("starting...")
for s3_file in vendorbucket.objects.filter(Prefix=PREFIX):
if 'parquet' in s3_file.key:
basename, ext = os.path.splitext(os.path.split(s3_file.key)[1])
print ('processing s3 object= ',s3_file.key)
df = spark.read.parquet("s3a://{bucket}/{file}".format(bucket=BUCKET_NAME,file=s3_file.key))
df.write.csv("s3a://{bucket}/{file}".format(bucket=OUTPUT_BUCKET_NAME,file=(basename+".csv")))
The above code works when I tested with my 2 S3 buckets in my account - one for source and one for output.
Thanks

Export table from Bigquery into GCS split sizes

I am exporting a table of size>1GB from Bigquery into GCS but it splits the files into very small files of 2-3 MB. Is there a way to get bigger files like 40-60MB per files rather than 2-3 MB.
I do the expport via the api
https://cloud.google.com/bigquery/docs/exporting-data#exporting_data_into_one_or_more_files
https://cloud.google.com/bigquery/docs/reference/v2/jobs
The source table size is 60 GB on Bigquery. I extract the data with format - NewLine_Delimited_Json and GZIP compression
destination_cloud_storage_uris=[
'gs://bucket_name/main_folder/partition_date=xxxxxxx/part-*.gz'
]
Are you trying to export partitioned table? If yes, each partition is exported as different table and it might cause small files.
I run the export in cli with each of the following commands and received in both cases files of size 49 MB:
bq extract --compression=GZIP --destination_format=NEWLINE_DELIMITED_JSON project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
bq extract --compression=GZIP project:dataset.table gs://bucket_name/path5-component/file-name-*.gz
Please add more details to the question so we can provide specific advice: How are you exactly asking for this export?
Nevertheless, if you have many files in GCS and you want to merge them all into one, you can do:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/composite
https://cloud.google.com/storage/docs/gsutil/commands/compose

How to move compressed TSV files from Google Cloud Bucket to Big Query with auto detect schema?

I have been trying multiple ways to move the compressed TSV to Big query. I was able to get the command working but didn't see any table being loaded. Please help me figure out to write the command that works.
bq ‘--project_id’ --nosync load --source_format CSV --field_delimiter ‘\t’ --autodetect --skip_leading_rows ‘0’ --quote=‘’ --encoding UTF-8 :table.destinationtable ‘gs://bucketname/filename.tsv.gz’
Successfully started load 162822:bqjob_r2d00a5817904935f_0000015c79e61b7c_1