how do I add a pandas object (e.g. DataFrame) to a group within an HDF file? - pandas

Suppose I have an HDF5 file (myHDF.h5) with a hierarchy of groups, something like:
/root/groupA
/groupB
Now I want to add a DataFrame (myFrame) to the groupA (along with some other objects such as dictionaries). How do I do that? If I open my HDF.h5 with pandas.io.HDFStore:
store = pandas.io.HDFStore('myHDF.h5')
and then try:
store['groupA']['myFrame'] = myFrame
I get:
AttributeError: Attribute 'pandas_type' does not exist in node: '/groupA'
What is the proper way to do this?

this is enabled as of version 0.10.0
http://pandas.pydata.org/pandas-docs/stable/io.html#hierarchical-keys

Currently pandas does not support hierarchical paths as you specified.
There is an open github issue about this: https://github.com/pydata/pandas/issues/13
I'm not sure when we will get around to adding this feature, would more than welcome a pull request if you're interested in completing the skeleton code that's in the issue discussion.

Related

Impossible to get post transform statistics by split

I'm running a simple penguin pipeline in interactive mode with a split train/eval, the transform step run but i can't get post_transform_statistics artifacts.
Inside the dedicated artifacts folder /tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/post_transform_stats/5, i have just one FeaturesStats.pb inside, but not subfolders Split-train and Split-eval with a FeaturesStats.pb inside each.
However, I have the subfolders inside artifacts dedicated to transformed examples (/tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/transformed_examples/5/).
Here is how i define the transform components by explicitly providing splits and also disable_statistics=False:
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
disable_statistics=False,
splits_config= transform_pb2.SplitsConfig(
analyze=['train'], transform=['train', 'eval']),
module_file=_transformer_module_file)
I went to the docstring and even the __init__ of the component https://github.com/tensorflow/tfx/blob/master/tfx/components/transform/component.py, it seems there is nothing i would have forgotten or mistaken but i was very disturbed to read following comment with an untraceable location for stats....
disable_statistics: If True, do not invoke TFDV to compute pre-transform
and post-transform statistics. When statistics are computed, they will
will be stored in the `pre_transform_feature_stats/` and
`post_transform_feature_stats/` subfolders of the `transform_graph`
export.
For now, the workaround is to explicitly disable stats in the transform component and define next to it, a dedicated statistics components to work on transformed features splits but it would have been great to have the splits statistics inside transform component directly.
Thanks for any help
This is expected as StatisticsGen in Transform is currently working on the entire transform dataset regardless of split/span.
To generate separate statistics for different splits, please use StatisticsGen component.
Thank you!

how to download data from ECB using pandaSDMX?

I am trying to download the NEER data from this target page.
Here is a good example:Most efficient way of converting RESTful output to dataframe
Here is the code to get the OECD data.
import pandasdmx as sdmx
df = sdmx.Request('OECD').data(
resource_id='MEI_FIN',
key='IR3TIB.GBR+USA.M',
params={'startTime': '2008-06', 'dimensionAtObservation': 'TimeDimension'},
).write()
I just wonder where to find and how to search the 'resource_id', 'params' and 'key' to get the EUR NEER data in my target ECB page.
Thank you.
I solved it.
Find the data in https://sdw.ecb.europa.eu/. If not, search other database.
Select and filter until I get the data series.
click the data series, it will show the data in a window. The url of this window gives me the key. https://sdw.ecb.europa.eu/quickview.do?SERIES_KEY=120.EXR.D.E5.EUR.EN00.A
Then use the key to download the data.
df = sdmx.Request('ECB').data(
resource_id='EXR',
key='D.E5.EUR.EN00.A',
params=dict(startPeriod='2019-01', endPeriod='2019-06'),
).write()
The remaining question is that if I can search directly using the code instead of visiting the website.

Is there a way to store data in hdf5 file in ScriptJob in pyiron?

I have my own Monte Carlo code (which is not part of pyiron), which I launch via ScriptJob in pyiron. Currently, I store the output data in a file, but since the script job is a pyiron object and an hdf5 is created, I would love to store the data there. So, I'd love to have something like:
script_job = pr.create_job('ScriptJob', 'job')
script_job.script_path = 'monte_carlo.ipynb'
script_job.run()
script_job['user/output/'] # This returns the output of what I store in monte_carlo.ipynb
Is there a way to do something inside monte_carlo.ipynb to make this happen?
You can summarise your output in a dictionary named output_dict and then use:
from pyiron import Notebook
Notebook().store_custom_output_dict(output_dict)

Accessing resources of a dynamically loaded module

I can't find a way to correctly get access to resources of an installed distribution. For example, when a module is loaded dynamically:
require ::($module);
One way to get hold of its %?RESOURCES is to ask module to have a sub which would return this hash:
sub resources { %?RESOURCES }
But that adds extra to the boilerplate code.
Another way is to deep scan $*REPO and fetch module's distribution meta.
Are there any better options to achieve this task?
One way is to use $*REPO ( as you already mention ) along with the Distribution object that CompUnit::Repository provides as an interface to the META6 data and its mapping to a given data store / file system.
my $spec = CompUnit::DependencySpecification.new(:short-name<Zef>);
my $dist = $*REPO.resolve($spec).distribution;
say $dist.content("resources/$_").open.slurp for $dist.meta<resources>.list;
Note this only works for installed distributions at the moment, but would work for not-yet-installed distributions ( like -Ilib ) with https://github.com/rakudo/rakudo/pull/1812

How to store data from Google Ngram API?

I need to store the data presented in the graphs on the Google Ngram website. For example, I want to store the occurences of "it's" as a percentage from 1800-2008, as presented in the following link: https://books.google.com/ngrams/graph?content=it%27s&year_start=1800&year_end=2008&corpus=0&smoothing=3&share=&direct_url=t1%3B%2Cit%27s%3B%2Cc0.
The data I want is the data you're able to scroll over on the graph. How can I extract this for about 140 different terms (e.g. "it's", "they're", "she's", etc.)?
econpy wrote a nice little module in Python that you can use through a command-line interface.
For your "it's" example, you would need to type this command in a terminal / windows console:
python getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
This will automatically save the query result in a CSV file named after your query parameters.
econpy's package, in #HugoMailhot's answer, no longer works (2021) and seems not maintained.
Here's a updated version, with some improvements for easier integration into Python code:
https://gitlab.com/cpbl/google-ngrams
You can call this from the command line (as in econpy's) to create a CSV file, e.g.
getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
or call it from python to get (and plot) data directly in python, e.g.:
from getngrams import ngrams
df = ngrams('bells and whistles -startYear=1900 -endYear=2018 -smoothing=2')
df.plot()
The xkcd functionality is still there too.
(Issues / bug fix pull requests /etc welcome there)