Download and convert Parquet files - amazon-s3

I have a vendor who is sharing 1000's of parquet files which I have to download and convert to CSV for my boss to run analysis. But the vendor is not giving ListObjects permission on the AWS S3 bucket. What are my alternatives to get these 1000s of files? I would prefer to get them into my S3 bucket so I can convert them to CSV using spark and then my boss can download CSV later. I am trying to use pyspark with boto3. Below is a snippet of code that I am running on a standalone EC2 instance with Spark.
print("starting...")
for s3_file in vendorbucket.objects.filter(Prefix=PREFIX):
if 'parquet' in s3_file.key:
basename, ext = os.path.splitext(os.path.split(s3_file.key)[1])
print ('processing s3 object= ',s3_file.key)
df = spark.read.parquet("s3a://{bucket}/{file}".format(bucket=BUCKET_NAME,file=s3_file.key))
df.write.csv("s3a://{bucket}/{file}".format(bucket=OUTPUT_BUCKET_NAME,file=(basename+".csv")))
The above code works when I tested with my 2 S3 buckets in my account - one for source and one for output.
Thanks

Related

Databricks showing an empty DF when reading snappy.parquet files

Having an issue with loading parquet files onto Databricks. We have used Amazon DMS service to migrate postgres databases onto Databricks in order to save them on the Delta lake. The DMS moved the database from RDS postgres into an S3 bucket that is already mounted. Files are visible but I am unable to read them.
Running:
df = spark.read.option("header","true").option("recursiveFileLookup","true").format('delta').load('/mnt/delta/postgres_table')
display(df)
Show:
Query returned no results
Inside this directory there are a slew of snappy.parquet.
Thank you
I downloaded and reviewed as an individual parquet file as a LOAD0000.parquet file and it does show with pandas. Aside from that several scripts were tested to see if I can get 1 df to show to no avail.

Why empty parquet files are created while writing into s3 using pyspark job?

I am reading a cassandra table from pyspark, and I am observing for very small data there are three files got written into s3 but out of those 3 , one file was empty parquet file. I am wondering how empty file creation took place ?
below are my steps in pyspark job:
read a table from cassandra
coalesce(15) on data read from cassandra
write into aws s3 bucket .

Amazon S3 parquet file - Transferring to GCP / BQ

Good morning everyone. I have a GCS Bucket, which has files that have been transferred from our Amazon S3 bucket. These files are in .gz.parquet format. I am trying to set up a transfer from the GSC bucket to BigQuery with the transfer feature, however I am running into issues with the parquet file format.
When I create a transfer and specify the file format as Parquet, I receive an error stating that the data is not in parquet format. When I tried specifying the file in CSV, weird values appear in my table as shown in the image linked:
I have tried the following URIs:
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: PARQUET. RESULTS: FILE NOT IN PARQUET FORMAT.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.gz.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.
bucket-name/folder-1/folder-2/dt={run_time|"%Y-%m-%d"}/b=1/geo/*.parquet. FILE FORMAT: CSV. RESULTS: TRANSFER DONE, BUT WEIRD VALUES.
Does anyone have any idea on how I should proceed? Thank you in advance!
There is a dedicated documentation explaining how to copy Parquet data from Cloud storage bucket to Big Query which is given below. Could you please go thru it and update us if its still not solving your problem.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet
Regards,
Anbu.
Seeing the looks of your URIs, the page you are looking for is this one, for loading hive partitioned parquet files into BigQuery.
You can try something like below in Cloud Shell:
bq load --source_format=PARQUET --autodetect \
--hive_partitioning_mode=STRINGS \
--hive_partitioning_source_uri_prefix=gs://bucket-name/folder-1/folder-2/ \
dataset.table `gcs_uris`

Merge multiple parquet files to single parquet file in AWS S3 using AWS Glue ETL python spark (pyspark)

I have AWS Glue ETL Job running every 15 mins that generates 1 parquet file in S3 each time.
I need to create another job to run end of each hour to merge all the 4 parquet file in S3 to 1 single parquet file using the AWS Glue ETL pyspark code.
Any one have tried it? suggestions and best practies?
Thanks in advance!
well.. an easy option would be to convert it into a spark dataframe
1) read the parquet into a dynamic frame (or better yet, just read it as spark dataframe)
2) sourcedf.toDF().repartition(1)

How to load multiple huge csv (with different columns) into AWS S3

I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?
Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.
Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.
You should copy your data to s3 in both cases.