AWS Lambda - dynamically import python module from S3 at runtime - amazon-s3

I have some tens of python modules, each has one common method (e.g: run(params)) but with different implementations. I also have an AWS Lambda which will need to call that method from within one of those modules. Choosing which module depending on the input of that lambda.
It seems that I can achieve that by using Layers in Lambda.
However, if I use one single layer for all those modules, then I could see problems with versioning that. If I need to update one module, I'll need to re-deploy that layer, which could bring unexpected changes to other modules.
If I use one layer for each module, then there will be too many layers to manage.
I thought of putting each module into one individual zip file, and put those zip files into an S3 location. My lambda will then dynamically reads the required zip files from S3 and execute.
Is that approach viable?
=====================
My current solution is to have something like this:
def read_python_script_from_zip(bucket: str, key: str, script_name: str) -> str:
s3 = boto3.resource('s3')
raw = s3.Object(bucket, key).get()['Body'].read()
zf = zipfile.ZipFile(io.BytesIO(raw), "r")
scripts = list(filter(lambda f: f.endswith(f"/{script_name}.py"), zf.namelist()))
if len(scripts) == 0:
raise ModuleNotFoundError(f"{script_name} not found.")
if len(scripts) > 1:
raise ModuleNotFoundError(f"{script_name} is ambiguous.")
source = zf.read(scripts[0])
mod = ModuleType(script_name, '')
exec(source, mod.__dict__)
return mod
read_python_script_from_zip(source_bucket, source_key, module_name).run(params)
Looks complicate to me though, would expect an easier way.

You could try packaging each module as a separate distribution package, which would let you version them separately. However, creating a Python distribution package is not as simple as you might hope, especially if you want to publish it to a private repository hosted on S3.

Related

How to access a file inside sagemaker entrypoint script

I want to know how to access a private bucket S3 file or a folder inside script.py entry point of sagemaker .
I uploaded the file to S3 using following code
boto3_client = boto3.Session(
region_name='us-east-1',
aws_access_key_id='xxxxxxxxxxx',
aws_secret_access_key='xxxxxxxxxxx'
)
sess = sagemaker.Session(boto3_client)
role=sagemaker.session.get_execution_role(sagemaker_session=sess)
inputs = sess.upload_data(path="df.csv", bucket=sess.default_bucket(), key_prefix=prefix)
This is the code of estimator
import sagemaker
from sagemaker.pytorch import PyTorch
pytorch_estimator = PyTorch(
entry_point='script.py',
instance_type='ml.g4dn.xlarge',
source_dir = './',
role=role,
sagemaker_session=sess,
)
Now inside script.py file i want to access the df.csv file from s3.
This is my code inside script.py.
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
args, _ = parser.parse_known_args()
#create session
sess=Session(boto3.Session(
region_name='us-east-1'))
S3Downloader.download(s3_uri=args.data_dir,
local_path='./',
sagemaker_session=sess)
df=pd.read_csv('df.csv')
But this is giving error
ValueError: Expecting 's3' scheme, got: in /opt/ml/input/data/training., exit code: 1
I think one way is to pass secret key and access key. But i am already passing sagemaker_session. How can i call that session inside script.py file and get my file read.
I think this approach is conceptually wrong.
Files within sagemaker jobs (whether training or otherwise) should be passed during machine initialization. Imagine you have to create a job with 10 machines, do you want to read the file 10 times or replicate it directly by having it read once?
In the case of the training job, they should be passed into the fit (in the case of direct code like yours) or as TrainingInput in the case of pipeline.
You can follow this official AWS example: "Train an MNIST model with PyTorch"
However, the important part is simply passing a dictionary of input channels to the fit:
pytorch_estimator.fit({'training': s3_input_train})
You can put the name of the channel (in this case 'train') any way you want. The path s3 will be the one in your df.csv.
Within your script.py, you can read the df.csv directly between environment variables (or at least be able to specify it between argparse). Generic code with this default will suffice:
parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
It follows the nomenclature "SM_CHANNEL_" + your_channel_name.
So if you had put "train": s3_path, the variable would have been called SM_CHANNEL_TRAIN.
Then you can read your file directly by pointing to the path corresponding to that environment variable.

Scrapy upload files to dynamically created directories in S3 based on field

I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.

boto3 load custom models

For example:
session = boto3.Session()
client = session.client('custom-service')
I know that I can create a json with API definitions under ~/.aws/models and botocore will load it from there. The problem is that I need to get it done on the AWS Lambda function, which looks like impossible to do so.
Looking for a way to tell boto3 where are the custom json api definitions so it could load from the defined path.
Thanks
I have only a partial answer. There's a bit of documentation about botocore's loader module, which is what reads the model files. In a disscusion about loading models from ZIP archives, a monkey patch was offered up which extracts the ZIP to a temporary filesystem location and then extends the loader search path to that location. It doesn't seem like you can load model data directly from memory based on the API, but Lambda does give you some scratch space in /tmp.
Here's the important bits:
import boto3
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
The directory structure of /tmp/boto needs to follow the resource loader documentation. The main model file needs to be at /tmp/boto/custom-service/yyyy-mm-dd/service-2.json.
The issue also mentions that alternative loaders can be swapped in using Session.register_component so if you wanted to write a scrappy loader which returned a model straight from memory you could try that too. I don't have any info about how to go about doing that.
Just adding more details:
import boto3
import zipfile
import os
s3_client = boto3.client('s3')
s3_client.download_file('your-bucket','model.zip','/tmp/model.zip')
os.chdir('/tmp')
with zipfile.ZipFile('model.zip', 'r') as archive:
archive.extractall()
session = boto3.Session()
session._loader.search_paths.extend(["/tmp/boto"])
client = session.client("custom-service")
model.zip is just a compressed file that contains:
Archive: model.zip
Length Date Time Name
--------- ---------- ----- ----
0 11-04-2020 16:44 boto/
0 11-04-2020 16:44 boto/custom-service/
0 11-04-2020 16:44 boto/custom-service/2018-04-23/
21440 11-04-2020 16:44 boto/custom-service/2018-04-23/service-2.json
Just remember to have the proper lambda role to access S3 and your custom-service.
boto3 also allows setting the AWS_DATA_PATH environment variable which can point to a directory path of your choice.
[boto3 Docs]
Everything zipped with your lambda function is put under /opt/.
Let's assume all your custom models live under a models/ folder. When this folder is mounted to the lambda environment, it'll live under /opt/models/.
Simply specify AWS_DATA_PATH=/opt/models/ in the Lambda configuration and boto3 will pick up models in that directory.
This is better than fetching models from S3 during runtime, unpacking, and then modifying session parameters.

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self, json_filename):
# self.json_filepath = json_filepath
self.json_filename = json_filename
self.file = open(self.json_filename, 'wb')
#classmethod
def from_crawler(cls, crawler):
save_path='/tmp/'
json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
completeName = os.path.join(save_path, json_filename)
return cls(
completeName
)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.
I also noticed that only one spider is writing at same time on file.
I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
named pipes
Relatively easy to implement and ok for very small Items only (see here)
RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
A nice inotifywait-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py in the same directory as settings.py:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
#classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}

Renaming an Amazon CloudWatch Alarm

I'm trying to organize a large number of CloudWatch alarms for maintainability, and the web console grays out the name field on an edit. Is there another method (preferably something scriptable) for updating the name of CloudWatch alarms? I would prefer a solution that does not require any programming beyond simple executable scripts.
Here's a script we use to do this for the time being:
import sys
import boto
def rename_alarm(alarm_name, new_alarm_name):
conn = boto.connect_cloudwatch()
def get_alarm():
alarms = conn.describe_alarms(alarm_names=[alarm_name])
if not alarms:
raise Exception("Alarm '%s' not found" % alarm_name)
return alarms[0]
alarm = get_alarm()
# work around boto comparison serialization issue
# https://github.com/boto/boto/issues/1311
alarm.comparison = alarm._cmp_map.get(alarm.comparison)
alarm.name = new_alarm_name
conn.update_alarm(alarm)
# update actually creates a new alarm because the name has changed, so
# we have to manually delete the old one
get_alarm().delete()
if __name__ == '__main__':
alarm_name, new_alarm_name = sys.argv[1:3]
rename_alarm(alarm_name, new_alarm_name)
It assumes you're either on an ec2 instance with a role that allows this, or you've got a ~/.boto file with your credentials. It's easy enough to manually add yours.
Unfortunately it looks like this is not currently possible.
I looked around for the same solution but it seems neither console nor cloudwatch API provides that feature.
Note:
But we can copy the existing alram with the same parameter and can save on new name
.