Pyspark project structure - amazon-s3

I am very new in Pyspark and AWS domain. I have a project to complete where I need to read files from various datasource (db, xml file) and store and process the data in AWS cloud. Is there any project structure that I need to follow, I am thinking of having following folders:
config/
input/
utils/
process/
readme.txt

Related

get all the contents of data lake gen2 folder in a list azure synapse workspace

I am brand new to Azure.
I have created a data lake gen2 storage account and a container inside it and saved some files and folders in it.I want to list all the files and folders in azure synapse notebook so that i can process a particular file. I am using this command
mssparkutils.fs.ls("abfss://iogen2#demoadlsgen2.dfs.core.windows.net/first/")
but it giving me only one output like:
[FileInfo(path=abfss://iogen2#demoadlsgen2.dfs.core.windows.net/first/stocks, name=stocks, size=0]
I want my answer in a list like:
'abfss://iogen2#demoadlsgen2.dfs.core.windows.net/first/stocks/',
'abfss://iogen2#demoadlsgen2.dfs.core.windows.net/first/stocks/2022-03-06/',
'abfss://iogen2#demoadlsgen2.dfs.core.windows.net/first/stocks/2022-03-06/csv_files/',
'abfss://iogen2#demoadlsgen2.dfs.core.windows.net/first/stocks/2022-03-06/csv_files/demo.csv'
Apparently when i am using os.listdir it is giving an error:
FileNotFoundError: [Errno 2] No such file or directory:
Can anyone please help me in this
As per the repro from my end, it shows all the files in the folder.
Here is files contained in the folder named sample:
I'm able to get the all the files contained in the folder named sample:
If you want to use the os.listdir you need to use file mount/unmount API in Synapse.

Inspect Parquet in S3 from Command Line

I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.

How to use a common file across many Terraform builds?

I have a directory structure as follows to build the Terraform resources for my project:
s3
main.tf
variables.tf
tag_variables.tf
ec2
main.tf
variables.tf
tag_variables.tf
vpc
main.tf
variables.tf
tag_variables.tf
When I want to build or change something in s3, I run the Terraform the s3 directory.
When I want to build the ec2 resources, I cd into that folder and do a Terraform build there.
I run them one at a time.
At the moment I have a list of tags defined as variables, inside each directory.
This is the same file, copied many times to each directory.
Is there a way to avoid copying the same tags file into all of the folders? I'm looking for a solution where I only have only one copy of the tags file.
Terraform do offer a solution of sorts using the "local" verb, but this still needs the file to be repeated in each directory.
What I tried:
I tried putting the variables in a module, but variables are internal to a module, modules are not designed to share code into the main file.
I tried making the variables an output from the module but it didn't like that either.
Does anyone have a way to achieve one central tags file that gets used everywhere? I'm thinking of something like an include of a source code chunk from elsewhere? Or any other solution would be great.
Thanks for that advice ydaetskcoR, I used symlinks and this works perfectly.
I placed the tags list.tf file in a common directory. Each Terraform project now has a symbolic link to the file. (Plus I've linked some other common files in this way, like provider.tf).
For the benefit of others, in Linux a symbolic link is a small file that is a link to another file, but it can be used exactly like the original file.
This allows many different and separate Terraform projects to refer to the common file.
Note: If you are looking to modularise common Terraform functions, have a look at Terraform modules, these are designed to modularise your code. They can't be used for the simple use case above, however.

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

How to download multiple file objects of particular directory from AWS s3 using ASP.Net Core?

How to download multiple file objects of particular directory (eg. folder name batch No. having multiple folders, inside subfolders having files like images or pdf files ) from AWS s3 using ASP.Net Core.
In this regard i would recommend creating a zipped file with all the directory contents and download it as one file