Having an issue with loading parquet files onto Databricks. We have used Amazon DMS service to migrate postgres databases onto Databricks in order to save them on the Delta lake. The DMS moved the database from RDS postgres into an S3 bucket that is already mounted. Files are visible but I am unable to read them.
Running:
df = spark.read.option("header","true").option("recursiveFileLookup","true").format('delta').load('/mnt/delta/postgres_table')
display(df)
Show:
Query returned no results
Inside this directory there are a slew of snappy.parquet.
Thank you
I downloaded and reviewed as an individual parquet file as a LOAD0000.parquet file and it does show with pandas. Aside from that several scripts were tested to see if I can get 1 df to show to no avail.
Related
I am reading a cassandra table from pyspark, and I am observing for very small data there are three files got written into s3 but out of those 3 , one file was empty parquet file. I am wondering how empty file creation took place ?
below are my steps in pyspark job:
read a table from cassandra
coalesce(15) on data read from cassandra
write into aws s3 bucket .
I have a vendor who is sharing 1000's of parquet files which I have to download and convert to CSV for my boss to run analysis. But the vendor is not giving ListObjects permission on the AWS S3 bucket. What are my alternatives to get these 1000s of files? I would prefer to get them into my S3 bucket so I can convert them to CSV using spark and then my boss can download CSV later. I am trying to use pyspark with boto3. Below is a snippet of code that I am running on a standalone EC2 instance with Spark.
print("starting...")
for s3_file in vendorbucket.objects.filter(Prefix=PREFIX):
if 'parquet' in s3_file.key:
basename, ext = os.path.splitext(os.path.split(s3_file.key)[1])
print ('processing s3 object= ',s3_file.key)
df = spark.read.parquet("s3a://{bucket}/{file}".format(bucket=BUCKET_NAME,file=s3_file.key))
df.write.csv("s3a://{bucket}/{file}".format(bucket=OUTPUT_BUCKET_NAME,file=(basename+".csv")))
The above code works when I tested with my 2 S3 buckets in my account - one for source and one for output.
Thanks
I'm in need to move my bigquery table to redshift.
Currently I have a python job that is fetching data from redshift, and it is incremental loading my data on the redshift.
This python job is reading bigquery data, creating a csv file in the server, drops the same on s3 and the readshift table reads the data from the file on s3. But now the time size would be very big so the server won't be able to handle it.
Do you guys happen to know anything better than this ?
The new 7 tables on bigquery I would need to move, is around 1 TB each, with repeated column set. (I am doing an unnest join to flattening it)
You could actually move the data from Big Query to a Cloud Storage Bucket by following the instructions here. After that, you can easily move the data from the Cloud Storage bucket to the Amazon s3 bucket by running:
gsutil rsync -d -r gs://your-gs-bucket s3://your-s3-bucket
The documentation for this can be found here
I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?
Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.
Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.
You should copy your data to s3 in both cases.
I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby