Is there a way to store data in hdf5 file in ScriptJob in pyiron? - pyiron

I have my own Monte Carlo code (which is not part of pyiron), which I launch via ScriptJob in pyiron. Currently, I store the output data in a file, but since the script job is a pyiron object and an hdf5 is created, I would love to store the data there. So, I'd love to have something like:
script_job = pr.create_job('ScriptJob', 'job')
script_job.script_path = 'monte_carlo.ipynb'
script_job.run()
script_job['user/output/'] # This returns the output of what I store in monte_carlo.ipynb
Is there a way to do something inside monte_carlo.ipynb to make this happen?

You can summarise your output in a dictionary named output_dict and then use:
from pyiron import Notebook
Notebook().store_custom_output_dict(output_dict)

Related

Impossible to get post transform statistics by split

I'm running a simple penguin pipeline in interactive mode with a split train/eval, the transform step run but i can't get post_transform_statistics artifacts.
Inside the dedicated artifacts folder /tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/post_transform_stats/5, i have just one FeaturesStats.pb inside, but not subfolders Split-train and Split-eval with a FeaturesStats.pb inside each.
However, I have the subfolders inside artifacts dedicated to transformed examples (/tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/transformed_examples/5/).
Here is how i define the transform components by explicitly providing splits and also disable_statistics=False:
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
disable_statistics=False,
splits_config= transform_pb2.SplitsConfig(
analyze=['train'], transform=['train', 'eval']),
module_file=_transformer_module_file)
I went to the docstring and even the __init__ of the component https://github.com/tensorflow/tfx/blob/master/tfx/components/transform/component.py, it seems there is nothing i would have forgotten or mistaken but i was very disturbed to read following comment with an untraceable location for stats....
disable_statistics: If True, do not invoke TFDV to compute pre-transform
and post-transform statistics. When statistics are computed, they will
will be stored in the `pre_transform_feature_stats/` and
`post_transform_feature_stats/` subfolders of the `transform_graph`
export.
For now, the workaround is to explicitly disable stats in the transform component and define next to it, a dedicated statistics components to work on transformed features splits but it would have been great to have the splits statistics inside transform component directly.
Thanks for any help
This is expected as StatisticsGen in Transform is currently working on the entire transform dataset regardless of split/span.
To generate separate statistics for different splits, please use StatisticsGen component.
Thank you!

How to change the model parameter saving location when training with DefaultTrainer in detectron2

my code is like following:
cfg = get_cfg()
...
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
After training, the model is automatically saved in./output/model_final.pth. I found the folder where the model is saved, but I couldn't find an interface to change the model filename
What can I do to change the saved filename? I would really appreciate it if you could help me
I've been spending ages trying to figure this out and still haven't got there. But a workaround I've used is by just changing the path itself rather than the file name.
cfg.OUTPUT_DIR = "./path/to/"
So I've got output/hands/model_final.pth and output/feet/model_final.pth instead of different filenames.

GridFs read PDF

I am trying to build a financial dashboard with Flask and pymongo. The starting point is a flask form which saves data in a MongoDB database. One of the fields in the form is a FileField (wtforms) which allows the upload of a PDF, which is then stored in MongoDB with GridFS.
Now I manage to save the pdf and I can see the resulting entries within the .files and .chunks collections. Now I would like to build a function that retrieves the PDFs and analyses them with some basic NLP, however I struggle with the getting meaningful data.
When I do:
storage = gridfs.GridFS(db, collection)
data = storage.get('some id')
a = data.read()
The result is a binary file. If I continue with:
with open(data, 'rb') as f:
b = f.read()
The result is "ValueError: embedded null byte or sometimes an empty "byte string".
Any help on this?
To follow up on the above, I found a solution for myself that consists in 2 separate functions:
(1) Upon upload of the form and before uploading the files to MongoDB, I apply a function based on pdfminer that extracts the string content of the PDF and tranform it into a list of sentences using NLTK. I will then store this list in the .files via the storage.put(file, sent_list = sent_list) #sent_list being the variable name of the list of sentences.
Whenever I wish to run NLP operations on the file, I will just call the "sent_list" variable from mongodb.
(2) If I wish to display the stored pdf in its original content however, I included the following function as a separate route.
storage = GridFS(db, collection)
data = storage.get_last_version(filename)
response = make_response(data.read())
extension = data.filename.split('.')[-1]
response.headers['Content-Type'] = f'application/{extension}'
response.headers['Content-Disposition'] = f'inline; filename={data.filename}'
return response
(2) will open a new tab in my flask app showing the .pdf file in its original format.
I hope this helps anyone coming across a similar problem in the future.

How to store data from Google Ngram API?

I need to store the data presented in the graphs on the Google Ngram website. For example, I want to store the occurences of "it's" as a percentage from 1800-2008, as presented in the following link: https://books.google.com/ngrams/graph?content=it%27s&year_start=1800&year_end=2008&corpus=0&smoothing=3&share=&direct_url=t1%3B%2Cit%27s%3B%2Cc0.
The data I want is the data you're able to scroll over on the graph. How can I extract this for about 140 different terms (e.g. "it's", "they're", "she's", etc.)?
econpy wrote a nice little module in Python that you can use through a command-line interface.
For your "it's" example, you would need to type this command in a terminal / windows console:
python getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
This will automatically save the query result in a CSV file named after your query parameters.
econpy's package, in #HugoMailhot's answer, no longer works (2021) and seems not maintained.
Here's a updated version, with some improvements for easier integration into Python code:
https://gitlab.com/cpbl/google-ngrams
You can call this from the command line (as in econpy's) to create a CSV file, e.g.
getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
or call it from python to get (and plot) data directly in python, e.g.:
from getngrams import ngrams
df = ngrams('bells and whistles -startYear=1900 -endYear=2018 -smoothing=2')
df.plot()
The xkcd functionality is still there too.
(Issues / bug fix pull requests /etc welcome there)

how do I add a pandas object (e.g. DataFrame) to a group within an HDF file?

Suppose I have an HDF5 file (myHDF.h5) with a hierarchy of groups, something like:
/root/groupA
/groupB
Now I want to add a DataFrame (myFrame) to the groupA (along with some other objects such as dictionaries). How do I do that? If I open my HDF.h5 with pandas.io.HDFStore:
store = pandas.io.HDFStore('myHDF.h5')
and then try:
store['groupA']['myFrame'] = myFrame
I get:
AttributeError: Attribute 'pandas_type' does not exist in node: '/groupA'
What is the proper way to do this?
this is enabled as of version 0.10.0
http://pandas.pydata.org/pandas-docs/stable/io.html#hierarchical-keys
Currently pandas does not support hierarchical paths as you specified.
There is an open github issue about this: https://github.com/pydata/pandas/issues/13
I'm not sure when we will get around to adding this feature, would more than welcome a pull request if you're interested in completing the skeleton code that's in the issue discussion.