Snakemake add checkpoint output to DAG and report - snakemake

What I currently want to get going is, to add all the files created in "someDir/" to my DAG and add them to my report. The problem is mainly that those files are created in the checkpoint, thus I can't define them as wildcards beforehand. The allFiles(wildcards) currently returns me the directory and not the files.
checkpoint someRule:
input:
"output/some.rds"
output:
directory("someDir/")
def allFiles(wildcards):
checkpoints.someRule.get(**wildcards).output[0] # is "output/some.rds" instead of wildcards
filenames, = glob_wildcards("someDir/{filenames}")
return expand("someDir/{fn}", fn=filenames)
rule all:
input:
allFiles

Found a workaround. If someone has the same problem, this worked for me.
def aggregate_input(wildcards):
checkpoint_output = checkpoints.someRule.get(**wildcards).output[0]
return expand('someDir/{i}',
i=glob_wildcards(os.path.join(checkpoint_output, '{i}')).i)
There still remains the problem, that the DAG doesn't include the checkpoint "someRule"

Related

Writing preprocessed output CSV to S3 from Scikit Learn image on Sagemaker

My problem: writing out a CSV file to S3 from inside a Sagemaker SKLearn image. I know how to write CSVs to S3 from a notebook - that is working fine. It's within the docker image that I'm unable to get it to work.
This is a preprocessing.py script called as an entry_point parameter to the SKLearn estimator. The purpose is to pre-process the data prior to running an inference. It's the first step in my inference pipeline.
Everything is working as expected in my preprocessing script, except outputting the file at the end.
Attempt #1 - this produces a CSV file that has strange binary-looking data at the beginning and end of the file (before the first cell and after the last cell of the CSV). It's almost a valid CSV but not quite. See the image at the end.
def write_joblib(file, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
with BytesIO() as f:
joblib.dump(file, f)
f.seek(0)
boto3.client("s3").upload_fileobj(Bucket=s3_bucket, Key=s3_key, Fileobj=f)
predictors_csv = predictors_df.to_csv(index = False)
write_joblib(predictors_csv, predictors_s3_uri)
Attempt #2 - I used StringIO rather than BytesIO. However, this produced a zero-byte file on S3.
Attempt #3 - I tried boto3.client('s3').put_object(...) but got ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
I believe I am almost there with Attempt #1 above. I assume it's an encoding issue. If I can fix the structure of the CSV file to remove the non-text characters at the start it will be working. A screenshot of the CSV in a Notepad++ is below.
Notice the non-character text at the start of the CSV file below
I solved this myself. This works within an SKLearn estimator container. I assume it will work inside any inbuilt fit/tranform container for writing a CSV to S3 directly.
The use case is writing out the results of pre-processing for model featurization, vectorization, dimensionality reduction etc. This would occur prior to model inference as the first step in an Inference Pipeline.
def write_text_file_to_s3(file_string, path):
s3_bucket, s3_key = path.split('/')[2], path.split('/')[3:]
s3_key = '/'.join(s3_key)
s3 = boto3.resource('s3')
s3.Object(s3_bucket, s3_key).put(Body=file_string)
predictors_csv = predictors_df.to_csv(index = False, encoding='utf-8-sig')
write_text_file_to_s3(predictors_csv, predictors_s3_uri)

Keras ImageDataGenerator: PIL.UnidentifiedImageError

I tried to use ImageDataGenerator to build generator images to train my model, But I am unable to do so because of the PIL.UnidentifiedImageError error. I tried different datasets and the problem pertains only to my dataset.
Now I can't unfortunately delete all the training/testing images as an answer suggested but I can remove the files causing this problem. How can I detect the error causing files?
This is a common problem particularly if you download images from say google. I developed a function that given a directory, it will go through all sub directories and check the files in each sub directory to ensure the have proper extensions and are valid image files. Code is provided below. It returns two lists. good_list is a list of valid image files and bad_list is a list of invalid image files. You will need to have Opencv installed.If you do not have it installed use pip install opencv-contrib-python.
def test_images(dir):
import os
import cv2
bad_list=[]
good_list=[]
good_exts=['jpg', 'png', 'bmp','tiff','jpeg', 'gif'] # make a list of acceptable image file types
for klass in os.listdir(dir) : # iterate through the sub directories
class_path=os.path.join (dir, klass) # create path to sub directory
if os.path.isdir(class_path):
for f in os.listdir(class_path): # iterate through image files
f_path=os.path.join(class_path, f) # path to image files
ext=f[f.rfind('.')+1:] # get the files extension
if ext not in good_exts:
print(f'file {f_path} has an invalid extension {ext}')
bad_list.append(f_path)
else:
try:
img=cv2.imread(f_path)
size=img.shape
good_list.append(f_path)
except:
print(f'file {f_path} is not a valid image file ')
bad_list.append(f_path)
else:
print(f'** WARNING ** directory {dir} has files in it, it should only contain sub directories')
return good_list, bad_list

tf.train.latest_checkpoint returning none when passing checkpoint path

When I am trying to load checkpoint after training ENet model for prediction using tf.train.latest_checkpoint(), it's returning "None" though I am passing the correct checkpoint path.
Here is my code:
image_dir = './dataset/test/'
images_list = sorted([os.path.join(image_dir, file) for file in
os.listdir(image_dir) if file.endswith('.png')])
checkpoint_dir = "./checkpoint_mk"
listi = os.listdir(checkpoint_dir)
print(listi)
checkpoint = tf.train.latest_checkpoint("./log/original/check")
print(checkpoint,'---------------------------------------
++++++++++++++++++++++++++++++++++++++++++++++++++++')
It returns None.
I am passing absolute path of checkpoint as they store in some other Dir.
Here are my checkpoint folder.
EDIT ---------------
model_checkpoint_path: "model.ckpt-400"
all_model_checkpoint_paths: "model.ckpt-0"
all_model_checkpoint_paths: "model.ckpt-400"
The tf.train.latest_checkpoint path argument needs to be relative to your current directory (from which Python script is executed). If it is a complex structure (i.e. data set is stored in a different folder or on a HDD) you can simply use absolute path to the folder. That is why tf.train.latest_checkpoint("/home/nikhil_m/TensorFlow-ENet/log/original") works in this case.
Try tf.train.latest_checkpoint(os.path.dirname('your_checkpoint_path'))

How to specify model directory in Floydhub?

I am new to Floydhub. I am trying to run the code from this github repository and the corresponding tutorial.
For the training, I successfully used this command:
floyd run --gpu --env tensorflow-1.2 --data janinanu/dataset
/data/2:tut_train 'python udc_train.py'
I adjusted this line in the training file to work in Floydhub:
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
As said, this worked without problems for the training.
Now for the testing, I do not really find any details on how to specify the model directory in which the output of the training gets stored. I mounted the output from training with model_dir as mount point. I assumed that the correct command should look something like this:
floyd run --cpu --env tensorflow-1.2 --data janinanu/datasets
/data/2:tut_test --data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir 'python udc_test.py
--model_dir=?'
I have no idea what to put in the --model_dir=?
Correspondingly, I assumed that I have to adjust some lines in the test file:
tf.flags.DEFINE_string("test_file", "/tut_test/test.tfrecords", "Path of
test data in TFRecords format")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to load model
checkpoints from")
...as well as in the train file (not sure about that though...):
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to store
model checkpoints (defaults to /model_dir)")
When I use e.g. --model_dir=/model_dir and the code with the above adjustments, I get this error:
2017-12-22 12:17:49,048 INFO - return func(*args, **kwargs)
2017-12-22 12:17:49,048 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 543, in evaluate
2017-12-22 12:17:49,048 INFO - log_progress=log_progress)
2017-12-22 12:17:49,049 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 816, in _evaluate_model
2017-12-22 12:17:49,049 INFO - % self._model_dir)
2017-12-22 12:17:49,049 INFO -
tensorflow.contrib.learn.python.learn.estimators._sklearn.NotFittedError:
Couldn't find trained model at /model_dir
Which doesn't come as a surprise.
Can anyone give me some clarification on how to feed the training output into the test run?
I will also post this question in the Floydhub Forum.
Thanks!!
.
You can mount the output of any job just like you mount a data. In your example:
--data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir
should mount the entire output directory from run 18 to /mount_dir of the new job.
You can confirm this by viewing the job page (select the "data" tab to see what datasets are mounted at which paths).
In you case, can you confirm if the test is looking for the correct model filename?
I will also respond to this in the FloydHub forum.

How to load two checkpoints in init_fn in slim.learning.train

I want to load two checkpoints while using slim.learning.train. For example,
init_fn = assign_from_checkpoint_fn(model_path, variables_to_restore)
slim.learning.train(train_op, log_dir, init_fn=init_fn)
The problem is that I can input only one checkpoint file in model_path. I want to put two checkpoints. I think there can be two possible solutions:
Modify the following assign_from_checkpoint_fn function in tf.contrib.framework.assign_from_checkpoint_fn so that model_path can be a list of checkpoint files
Merge two checkpoints before. (I didn't find any tool for this)
Is there anyone who help me?
I found a solution: we can define our init function using session like this:
flow_init_assign_op, flow_init_feed_dict = slim.assign_from_checkpoint(
flow_ckpt, flow_var_to_restore)
resnet_init_assign_op, resnet_init_feed_dict =
slim.assign_from_checkpoint(
resnet_ckpt, resnet_var_to_restore, ignore_missing_vars=True)
def init_fn(sess):
sess.run(flow_init_assign_op, flow_init_feed_dict)
sess.run(resnet_init_assign_op, resnet_init_feed_dict)