Upload folder with sub-folders and files on S3 using python - amazon-s3

I have a folder with bunch of subfolders and files which I am fetching from a server and assigning to a variable. Folder Structure is as follows:
└── main_folder
├── folder
│   ├── folder
│   │   ├── folder
│   │   │   └── a.json
│   │   ├── folder
│   │   │   ├── folder
│   │   │   │   └── b.json
│   │   │   ├── folder
│   │   │   │   └── c.json
│   │   │   └── folder
│   │   │   └── d.json
│   │   └── folder
│   │   └── e.json
│   ├── folder
│   │   └── f.json
│   └── folder
│   └── i.json
Now I want to upload this main_folder to S3 bucket with the same structure using boto3. In boto3 there is no way to upload folder on s3.
I have seen the solution on this link but they fetching the files from local machine and I have fetching the data from server and assigining to variable.
Uploading a folder full of files to a specific folder in Amazon S3
upload a directory to s3 with boto
https://gist.github.com/feelinc/d1f541af4f31d09a2ec3
Has anybody faced the same type of issue?

Below is code that works for me, pure python3.
""" upload one directory from the current working directory to aws """
from pathlib import Path
import os
import glob
import boto3
def upload_dir(localDir, awsInitDir, bucketName, tag, prefix='/'):
"""
from current working directory, upload a 'localDir' with all its subcontents (files and subdirectories...)
to a aws bucket
Parameters
----------
localDir : localDirectory to be uploaded, with respect to current working directory
awsInitDir : prefix 'directory' in aws
bucketName : bucket in aws
tag : tag to select files, like *png
NOTE: if you use tag it must be given like --tag '*txt', in some quotation marks... for argparse
prefix : to remove initial '/' from file names
Returns
-------
None
"""
s3 = boto3.resource('s3')
cwd = str(Path.cwd())
p = Path(os.path.join(Path.cwd(), localDir))
mydirs = list(p.glob('**'))
for mydir in mydirs:
fileNames = glob.glob(os.path.join(mydir, tag))
fileNames = [f for f in fileNames if not Path(f).is_dir()]
rows = len(fileNames)
for i, fileName in enumerate(fileNames):
fileName = str(fileName).replace(cwd, '')
if fileName.startswith(prefix): # only modify the text if it starts with the prefix
fileName = fileName.replace(prefix, "", 1) # remove one instance of prefix
print(f"fileName {fileName}")
awsPath = os.path.join(awsInitDir, str(fileName))
s3.meta.client.upload_file(fileName, bucketName, awsPath)
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--localDir", help="which dir to upload to aws")
parser.add_argument("--bucketName", help="to which bucket to upload in aws")
parser.add_argument("--awsInitDir", help="to which 'directory' in aws")
parser.add_argument("--tag", help="some tag to select files, like *png", default='*')
args = parser.parse_args()
# cd whatever is above your dir, then run it
# (below assuming this script is in ~/git/hu-libraries/netRoutines/uploadDir2Aws.py )
# in the example below you have directory structure ~/Downloads/IO
# you copy full directory of ~/Downloads/IO to aws bucket markus1 to 'directory' 2020/IO
# NOTE: if you use tag it must be given like --tag '*txt', in some quotation marks...
# cd ~/Downloads
# python ~/git/hu-libraries/netRoutines/uploadDir2Aws.py --localDir IO --bucketName markus1 --awsInitDir 2020
upload_dir(localDir=args.localDir, bucketName=args.bucketName,
awsInitDir=args.awsInitDir, tag=args.tag)

I had to solve this problem myself, so thought I would include a snippet of my code here.
I also had the requirement to filter for specific file types, and upload the directory contents only (vs the directory itself).
import logging
import boto3
from pathlib import Path
log = logging.getLogger(__name__)
def upload_dir(
self,
local_dir: Union[str, Path],
s3_path: str = "/",
file_type: str = "",
contents_only: bool = False,
) -> bool:
"""
Upload the content of a local directory to a bucket path.
Args:
local_dir (Union[str, Path]): Directory to upload files from.
s3_path (str, optional): The path within the bucket to upload to.
If omitted, the bucket root is used.
file_type (str, optional): Upload files with extension only, e.g. txt.
contents_only (bool): Used to copy only the directory contents to the
specified path, not the directory itself.
Returns:
dict: key:value pair of file_name:upload_status.
upload_status True if uploaded, False if failed.
"""
resource = boto3.resource(
"s3",
aws_access_key_id="xxx",
aws_secret_access_key="xxx",
endpoint_url="xxx",
region_name=Bucket"xxx",
)
status_dict = {}
local_dir_path = Path(local_dir).resolve()
log.debug(f"Directory to upload: {local_dir_path}")
all_subdirs = local_dir_path.glob("**")
for dir_path in all_subdirs:
log.debug(f"Searching for files in directory: {dir_path}")
file_names = dir_path.glob(f"*{('.' + file_type) if file_type else ''}")
# Only return valid files
file_names = [f for f in file_names if f.is_file()]
log.debug(f"Files found: {list(file_names)}")
for _, file_name in enumerate(file_names):
s3_key = str(Path(s3_path) / file_name.relative_to(
local_dir_path if contents_only else local_dir_path.parent
))
log.debug(f"S3 key to upload: {s3_key}")
status_dict[str(file_name)] = self.upload_file(s3_key, file_name)
return status_dict

Related

Terraform BigQuery replaces table & deletes the data when schema gets updated

I have somewhat below folder structure:
.
├── locals.tf
├── main.tf
├── modules
│   ├── bigquery
│   │   ├── main.tf
│   │   ├── schema
│   │   └── variables.tf
│   ├── bigquery_tables
│   │   ├── main.tf
│   │   ├── schema
│   │   └── variables.tf
│   ├── bigquery_views
│   │   ├── main.tf
│   │   ├── queries
│   │   ├── schema
│   │   └── variables.tf
│   ├── cloud_composer
│   ├── list_projects
├── providers.tf
├── storage.tf
├── variables.tf
└── versions.tf
& my main.tf in bigquery_tables is:
resource "google_bigquery_table" "bq_tables" {
for_each = { for table in var.bigquery_dataset_tables : table.table_id => table }
project = var.project_id
dataset_id = each.value.dataset_id
table_id = each.value.table_id
schema = file(format("${path.module}/schema/%v.json", each.value.file_name))
deletion_protection = false
dynamic "time_partitioning" {
for_each = try(each.value.time_partition, false) == false ? [] : [each.value.time_partition]
content {
type = each.value.partitioning_type
field = each.value.partitioning_field
}
}
}
The issue I am facing is since things are in development stage we have frequent update in schema.
Thus this change in schema is causing the BigQuery table to be replaced by terraform & also leading to data loss in BigQuery.
Can some one suggest a solution like what I should add in by resource block to avoid replacing the table & data loss?
I am unsure how can I add "external_data_configuration" in my current block as per https://github.com/hashicorp/terraform-provider-google/issues/10919.
Unfortunately if you change the schema structure of the BigQuery table, you will have this behaviour.
For the deletion_protection param, I think it's better to set it to true to prevent data loss or not set it (it' true by default).
The solution I saw, because you are in dev mode, it's using the Terraform workspace and run your schema updates in a separated workspace.
It will create a new dataset or table in each apply, example :
locals.tf file :
locals {
workspace = terraform.workspace != "default" ? "${terraform.workspace}_" : ""
}
main.tf file :
resource "google_bigquery_table" "bq_tables" {
for_each = { for table in var.bigquery_dataset_tables : table.table_id => table }
project = var.project_id
dataset_id = "${local.workspace}${each.value.dataset_id}"
table_id = each.value.table_id
schema = file(format("${path.module}/schema/%v.json", each.value.file_name))
deletion_protection = false
dynamic "time_partitioning" {
for_each = try(each.value.time_partition, false) == false ? [] : [each.value.time_partition]
content {
type = each.value.partitioning_type
field = each.value.partitioning_field
}
}
}
In this example, I used the workspace as prefix of dataset ID, if the default workspace is used, the prefix is empty, otherwise equals to the given workspace :
dataset=mydataset
- workspace=default => dataset=mydataset
- workspace=evolfeature1 => dataset=evolfeature1_mydataset

Why can't files introduced by require.context() be dynamically updated in Vue?

I am building an static markdown blog website using Vue3,and I am using require.context() to load markdown files.
This is the project structure,and I load files from static/posts.
├── dist
│   ├── assets
│   │   └── static
│   │   └── posts
│   │   ├── dev-first-vue3-todolist.md
│   │   ├── dev-fix-missing-xcrun.md
├── package.json
├── src
│   ├── App.vue
│   ├── main.js
├── static
│   └── posts
│   ├── dev-first-vue3-todolist.md
│   ├── dev-fix-missing-xcrun.md
└── vue.config.js
Here's how I load these markdown files.
let context = require.context("../static/posts", true, /\.md$/, "sync");
let keys = context.keys();
// the markdown file list
const postRawList = [];
keys.forEach((key) => {
const raw = context(key).default;
postRawList.push(raw);
});
It worked well. I can read postRawList from static/posts and render them by markdown parser.
The point is, I want my blog website static, so when I build this Vue app I use copy-webpack-plugin to copy these markdown files from static/posts to dist/assets/static/posts and it worked well.
config.plugins.push(
new CopyWebpackPlugin({
patterns: [
{
from: path.resolve(__dirname, "static/posts"),
to: path.resolve(__dirname, "dist/assets/static/posts"),
},
],
})
);
I want to manage my markdown posts only by adding or deleting markdown files in dist/assets/static/posts.
But after I changing the markdown files in dist/assets/static/posts(like delete dev-first-vue3-todolist.md) ,the markdown object list(postRawList) doesn't be updated.
The markdown list is the same as the markdown list in static/posts , and no matter how I change files in this 2 posts folder,it does't be updated.
I guess it is caused by "require cache", and I had tried to delete cache after changing the files, but it did't work.
delete require.cache[context.id];
So I want to know why the file objects not be updated after I built the app and change the files in markdown directory(loaded by require.context()).
This is my first time asking questions on stackoverflow,and I feel sorry if I didn't make it clear.

Terraform Outputs across modules

I am struggling to work out how to pass outputs from a module and to consume it an another.
My folder structure:
.
├── main.tf
├── modules
│   ├── cloudwatch-event
│   │   ├── basic_event_rule.tf
│   │   ├── basic_event_target.tf
│   │   └── variables.tf
│   └── lambda
│      ├── basic_lambda.tf
│      ├── output.tf
│      ├── lambda.py
│      └── variables.tf
├── lambda
│   ├── main.tf
│   └── variables.tf
└── terraform.tfvar
In order to add scheduling to the lambda, i need to consume the Lambda ARN in to the CloudWatch module.
The lambda - basic_lambda.tf
resource "aws_lambda_function" "lambda_function" {
The lambda - outputs.tf
output "lambda_arn" {
value = "${aws_lambda_function.lambda_function.arn}"
In my lambda application module, i have this in my main lambda/main.tf
module "cloudwatch-event" {
source = "../modules/cloudwatch-event"
lambda_arn = "${module.lambda.lambda_arn}"
module "lambda" {
source = "../modules/lambda"
My lambda/variables.tf includes the lambda_arn variable as a string
variable "lambda_arn" {
type = "string"
}
The root main file looks like this:
provider "aws" {
region = var.aws_region
}
module "accesskey-lambda" {
source = "./lambda/"
}
Running TF i get this
Error: Missing required argument
on main.tf line 5, in module "accesskey-lambda":
5: module "accesskey-lambda" {
The argument "lambda_arn" is required, but no definition was found.
then adding it to the root main file doesnt resolve my issues.
Thanks
Nick
Solved, i had a typo
in the cloudwatch/basic_event_target.tf
arn = "${var.lambda_arn}"
Then in the cloudwatch/variable
variable "lambda_arn" {
type = string
}
The module then needed
module "cloudwatch-event" {
source = "../modules/cloudwatch-event"
lambda_arn = "${module.lambda.lambda_arn}"
}

Read Karate config from YAML

I would like to define environment-specific properties in a .yml/.yaml file. Therefore I created the following test.yaml:
baseUrl: 'http://localhost:1234'
Next, I wrote this karate-config.js:
function() {
var env = karate.env;
if (!env) {
env = 'test'; // default is test
}
// config = read(env + '.yaml')
var config = read('/home/user/git/karate-poc/src/test/java/test.yaml');
// var config = read('test.yaml');
// var config = read('classpath:test.yaml');
return config;
}
As seen here https://github.com/intuit/karate#reading-files the read() function should be known by Karate, however I'm not sure if this only applies to .feature files or the karate-config.js too.
Unfortunately, none of the above read()s work, as I'm getting this error:
Caused by: com.intuit.karate.exception.KarateException: javascript function call failed: could not find or read file: /home/user/git/karate-poc/src/test/java/test.yaml, prefix: NONE
at com.intuit.karate.Script.evalFunctionCall(Script.java:1602)
I'm sure that the file exists and is readable.
Am I doing something wrong or is my approach not supported? If it's not supported, what would be the recommended way to read the configuration based on the environment from a YAML file (once) in order to use it in (multiple) .feature files?
Thank you very much
Edit: Tree structure of the project:
.
├── build.gradle
├── gradle
│   └── wrapper
│   ├── gradle-wrapper.jar
│   └── gradle-wrapper.properties
├── gradle.properties
├── gradlew
├── gradlew.bat
└── src
   └── test
   └── java
   ├── karate
   │   └── rest
   │   ├── rest.feature
   │   └── RestRunner.java
   ├── karate-config.js
   └── test.yaml
Run with ./gradlew test
In JS, use the karate object, which is explained here: https://github.com/intuit/karate#the-karate-object
So this should work:
var config = karate.read('classpath:test.yaml');

How to add image from within 'content' subfolder - pelican

I got a structure
content/
├── applications
│   └── 2017
│   └── 08
│   └── 30
│   ├── article.md
│   └── forecast1.png
I want the img files to be same directories as the md files so that they can be put to:
ARTICLE_SAVE_AS = 'posts/{date:%Y}/{date:%b}/{date:%d}/{slug}/index.html'
I have STATIC_PATHS = ['static_files','content'] however, the
[alt]({attach}applications/2017/08/30/forecast1.png)
gives error:
WARNING: Unable to find `applications/2017/08/30/forecast1.png`, skipping url replacement.
How can I include image into my md file in this simple case?
EDIT
so I changed the config applications is my category to:
PATH = 'content'
STATIC_PATHS = ['static_files','applications/2017/08/30/img', 'applications/2017/09/01/img']
ARTICLE_PATHS = ['applications', 'cat2', 'cat3']
I also added the ! before the [alt]() and still the images are not copied over to output.
EDIT2
iT WORKS WHEN APPLY EDIT ABOVE AND CHANGE ({attach}img/forecast1.png)
This works for me (following this):
content/
├── p001
│ └── myArticle001.md
│ └── img001
│ └── myPic1.png
│ └── myPic2.png
├── p002
│ └── myArticle002.md
│ └── img002
│ └── myPic1.png
│ └── myPic2.png
In pelicanconfig.py set:
PATH = 'content'
STATIC_PATHS = ['p001','p002']
ARTICLE_PATHS = STATIC_PATHS
In the md-files set:
![pic_A1]({attach}img001/myPic1.png)
![pic_A2]({attach}img001/myPic2.png)
and
![pic_B1]({attach}img002/myPic1.png)
![pic_B2]({attach}img002/myPic2.png)
Probabley you missed a ! only at the begin of the command. So you might try this:
![alt]({attach}applications/2017/08/30/forecast1.png)
Or try this:
PATH = 'content'
STATIC_PATHS = ['applications']
ARTICLE_PATHS = STATIC_PATHS
...
![alt]({attach}2017/08/30/forecast1.png)