Sqoop Process S3/CSV to S3/Parquet - hive

Is Sqoop able to target a directory or CSV file stored in S3 and then import to another S3 directory in Parquet? Have been testing and researching a bit but nothing. Any assistance would be appreciated.

Related

How to read .zip files in Synapse spark notebooks

I'm new to the synapse. I am stuck in a problem. I want to read the '.zip' file from an ADLS gen2 via spark notebooks. I Hope spark.read.csv doesn't support '.zip' compression. And I also tried to read using python zipFile libraries but it does not accept the ABFSS path. Thanks in advance.
I am fairly new to Synapse so this may not be a definitive answer, but I dont think it is possible.
I have tried the following approaches:
zipfile with abfss:// path (as you have)
zipfile with synfs:// path
shutil with synfs:// path
copy file to tempdir and use shutil (could not get this to work)

Convert S3 files to .csv using airflow task

I have an airflow task which fetches data from Redshift, creates a file out of it and loads in an s3 bucket. I want the files to be ending with .csv, but the unload command doesn't allow that. How can add a new task in the same dag to convert the files to .csv files?
The flow has to be:
task: unload the output of query to s3 bucket >> task2:convert those files into .csv
Just add another rename task after the file is downloaded in the airflow dag.

Inspect Parquet in S3 from Command Line

I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.

get zip files from one s3 bucket unzip them to another s3 bucket

I have zip files in one s3 bucket
I need to unzip them and copy the unzipped folder to another s3 bucket and keep the source path
for example - if in source bucket the zip file in under
"s3://bucketname/foo/bar/file.zip"
then in destination bucket it should be "s3://destbucketname/foo/bar/zipname/files.."
how can it be done ?
i know that it is possible somehow to do it with lambda so i wont have to download it locally but i have no idea how
thanks !
If your desire is to trigger the above process as soon as the Zip file is uploaded into the bucket, then you could write an AWS Lambda function
When the Lambda function is triggered, it will be passed the name of the bucket and object that was uploaded. The function should then:
Download the Zip file to /tmp
Unzip the file (Beware: maximum storage available: 500MB)
Loop through the unzipped files and upload them to the destination bucket
Delete all local files created (to free-up space for any future executions of the function)
For a general example, see: Tutorial: Using AWS Lambda with Amazon S3 - AWS Lambda
You can use AWS Lambda for this. You can also set an event notification in your S3 bucket so that a lambda function is triggered everytime a new file arrives. You can write a Python code that uses boto3 to connect to S3. Then you can read files into a buffer, and unzip them using these libraries, gzip them and then reupload to S3 in your desired folder/path:
import gzip
import zipfile
import io
with zipped.open(file, "r") as f_in:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There is also a tutorial here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9
Arguably Python is simpler to use for your Lambda, but if you are considering Java, I've made a library that manages unzipping of data in AWS S3 utilising stream download and multipart upload.
Unzipping is achieved without keeping data in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.
It is available in Maven Central, here is the GitHub link: nejckorasa/s3-stream-unzip

Copying all files from a directory using a pig command

Hey I need to copy all files from a local directory to the HDFS using pig.
In the pig script I am using the copyFromLocal command with a wildcard in the source-path
i.e copyFromLocal /home/hive/Sample/* /user
It says the source path doesnt exist.
When I use copyFromLocal /home/hive/Sample/ /user , it makes another directory in the HDFS by the name of 'Sample', which I don't need.
But when I include the file name i.e /home/hive/Sample/sample_1.txt it works.
I dont need a single file. I need to copy all the files in the directory without making a directory in the HDFS.
PS: Ive also tried *.txt, ?,?.txt
No wildcards work.
Pig copyFromLocal/toLocal commands work only for a file or a directory.It will never take series of files (or) wildcard.More over, pig concentrates on processing data from/to HDFS.Upto my knowledge you cant even loop the files in a directory with ls.because it lists out files in HDFS. So, for this scenario I would suggest you to write a shell script/action(i.e. fs command) to copy files from locally to HDFS.
check this link below for info:
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#copyFromLocal