Impossible to get post transform statistics by split - tensorflow

I'm running a simple penguin pipeline in interactive mode with a split train/eval, the transform step run but i can't get post_transform_statistics artifacts.
Inside the dedicated artifacts folder /tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/post_transform_stats/5, i have just one FeaturesStats.pb inside, but not subfolders Split-train and Split-eval with a FeaturesStats.pb inside each.
However, I have the subfolders inside artifacts dedicated to transformed examples (/tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/transformed_examples/5/).
Here is how i define the transform components by explicitly providing splits and also disable_statistics=False:
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
disable_statistics=False,
splits_config= transform_pb2.SplitsConfig(
analyze=['train'], transform=['train', 'eval']),
module_file=_transformer_module_file)
I went to the docstring and even the __init__ of the component https://github.com/tensorflow/tfx/blob/master/tfx/components/transform/component.py, it seems there is nothing i would have forgotten or mistaken but i was very disturbed to read following comment with an untraceable location for stats....
disable_statistics: If True, do not invoke TFDV to compute pre-transform
and post-transform statistics. When statistics are computed, they will
will be stored in the `pre_transform_feature_stats/` and
`post_transform_feature_stats/` subfolders of the `transform_graph`
export.
For now, the workaround is to explicitly disable stats in the transform component and define next to it, a dedicated statistics components to work on transformed features splits but it would have been great to have the splits statistics inside transform component directly.
Thanks for any help

This is expected as StatisticsGen in Transform is currently working on the entire transform dataset regardless of split/span.
To generate separate statistics for different splits, please use StatisticsGen component.
Thank you!

Related

How to adapt tf.contrib.data.TextLineDataset for text from other sources?

For example, if my text data come from a database, how can I get one line/doc(as a database record) using the same mechanism (subclassing Dataset such that the pipeline described here still works) as TextLineDataset ?
By looking at the source code of TextLineDataset, I find that make_dataset_resource() seems an import method to be implemented. But I can't find where the actual code of yielding a line from a file as the docstring of TextLineDataset says: A Dataset comprising lines from one or more text files.

Change images slider step in TensorBoard

TensorBoard 1.1.0's images history. I would like to set the slider's position (on top of the black image with 7) more precisely, to be able to select any step. Now I can only select e.g. between steps 2050 or 2810. Is that possible?
Maybe a place in sources where the 10 constant is hardcoded?
I answered this question over there "TensorBoard doesn't show all data points", but this seems to be more popular so I will quote it here.
You don't have to change the source code for this, there is a flag called --samples_per_plugin.
Quoting from the help command
--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly
specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard
randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long
running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all
samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most
users should not need to set this flag.
(default: '')
So if you want to have a slider of 100 images, use:
tensorboard --samples_per_plugin images=100
I managed to do this by changing this line in TensorBoard backend
This question is covered in the FAQ:
Is my data being downsampled? Am I really seeing all the data?
TensorBoard uses reservoir sampling to downsample your data so that it
can be loaded into RAM. You can modify the number of elements it will
keep per tag in tensorboard/backend/application.py. See this
StackOverflow question for some more information.

Avoid duplicates in the destination schema

I have a little problem. I want to map every detail line to one OrderInfo. The destination schema can not have any duplicate OrderInfo. All the detail lines should be in the destination orderInfo, but the SuppliersOrderNo and BuyersOrderNo should not be twice.
Any ideas how to do this, is it possible to use XSL or inline script?
<inv:OrderInfo>
<inv:SuppliersOrderNo>123456</inv:SuppliersOrderNo>
<inv:BuyersOrderNo>6789</inv:BuyersOrderNo>
<inv:DetailLines>
<inv:DetailLine>
<inv:InvoiceDetailLineNo>1</inv:InvoiceDetailLineNo>
<inv:Item>
<inv:SuppliersArticleNo>article2</inv:SuppliersArticleNo>
<inv:SuppliersDescription>BestArticle</inv:SuppliersDescription>
</inv:Item>
</inv:DetailLine>
<inv:DetailLine>
<inv:InvoiceDetailLineNo>2</inv:InvoiceDetailLineNo>
<inv:Item>
<inv:SuppliersArticleNo>article3</inv:SuppliersArticleNo>
<inv:SuppliersDescription>AlmostBestArticle</inv:SuppliersDescription>
</inv:Item>
</inv:DetailLine>
</inv:DetailLines>
</inv:OrderInfo>
<inv:OrderInfo>
<inv:SuppliersOrderNo>123456</inv:SuppliersOrderNo>
<inv:BuyersOrderNo>6789</inv:BuyersOrderNo>
<inv:DetailLines>
<inv:DetailLine>
<inv:InvoiceDetailLineNo>1</inv:InvoiceDetailLineNo>
<inv:Item>
<inv:SuppliersArticleNo>article1337</inv:SuppliersArticleNo>
<inv:SuppliersDescription>WOW</inv:SuppliersDescription>
</inv:Item>
</inv:DetailLine>
</inv:DetailLines>
</inv:OrderInfo>
If you want to do this purely in XSLT, you'll have to use Muenchian gruoping. I wrote a blog that links to some other blogs on how to do this in BizTalk a little while back: https://blog.tallan.com/2014/12/09/muenchian-grouping-in-biztalk-while-keeping-mapper-functionality/
To summarize the blog: if you pursue this, you'll need a map that's completely custom XSLT somewhere, but you could put it into a custom pipeline component if you still want to be able to use "regular" maps functionality without any other caveats (in my blog I describe a method of doing that in a pipeline component so that a "regular" BizTalk map can still be used on the preprocessed output). There are lots of resources on Muenchian grouping out there (including on StackOverflow), so I'm not rehashing all of that in this answer.
You could also try to serialize the message in a C# component and use some LINQ methods to group/sort/order/etc, or if you're inserting the content into SQL at some point you could do it in SQL (which would be able to handle this kind of task more naturally).

How to store data from Google Ngram API?

I need to store the data presented in the graphs on the Google Ngram website. For example, I want to store the occurences of "it's" as a percentage from 1800-2008, as presented in the following link: https://books.google.com/ngrams/graph?content=it%27s&year_start=1800&year_end=2008&corpus=0&smoothing=3&share=&direct_url=t1%3B%2Cit%27s%3B%2Cc0.
The data I want is the data you're able to scroll over on the graph. How can I extract this for about 140 different terms (e.g. "it's", "they're", "she's", etc.)?
econpy wrote a nice little module in Python that you can use through a command-line interface.
For your "it's" example, you would need to type this command in a terminal / windows console:
python getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
This will automatically save the query result in a CSV file named after your query parameters.
econpy's package, in #HugoMailhot's answer, no longer works (2021) and seems not maintained.
Here's a updated version, with some improvements for easier integration into Python code:
https://gitlab.com/cpbl/google-ngrams
You can call this from the command line (as in econpy's) to create a CSV file, e.g.
getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
or call it from python to get (and plot) data directly in python, e.g.:
from getngrams import ngrams
df = ngrams('bells and whistles -startYear=1900 -endYear=2018 -smoothing=2')
df.plot()
The xkcd functionality is still there too.
(Issues / bug fix pull requests /etc welcome there)

Pentaho Kettle: how to pass variable from transformation to another transformation inside job

I have two transformations in the job.
In the first trasnformation - I get details about the file.
Now I would like to pass this information to the second transformation, I have set variable in the settings parameters of the trasnformation #2 and use Get Variables inside - but the values are not passed.
See attached sample: https://www.hightail.com/download/bXBiV28wMVhLVlZWeHNUQw
LOG_1 - displays file time, however LOG_2 and LOG_3 are not producing any value.
How can I pass variable across transformations and to parent job.
Try checking the box in the second transformation as below image:
Remove all the logs from the Job file you shared.
You may try reading this blog.
Hope this will resolve your issue :)