How to adapt tf.contrib.data.TextLineDataset for text from other sources? - tensorflow

For example, if my text data come from a database, how can I get one line/doc(as a database record) using the same mechanism (subclassing Dataset such that the pipeline described here still works) as TextLineDataset ?
By looking at the source code of TextLineDataset, I find that make_dataset_resource() seems an import method to be implemented. But I can't find where the actual code of yielding a line from a file as the docstring of TextLineDataset says: A Dataset comprising lines from one or more text files.

Related

Impossible to get post transform statistics by split

I'm running a simple penguin pipeline in interactive mode with a split train/eval, the transform step run but i can't get post_transform_statistics artifacts.
Inside the dedicated artifacts folder /tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/post_transform_stats/5, i have just one FeaturesStats.pb inside, but not subfolders Split-train and Split-eval with a FeaturesStats.pb inside each.
However, I have the subfolders inside artifacts dedicated to transformed examples (/tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/transformed_examples/5/).
Here is how i define the transform components by explicitly providing splits and also disable_statistics=False:
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
disable_statistics=False,
splits_config= transform_pb2.SplitsConfig(
analyze=['train'], transform=['train', 'eval']),
module_file=_transformer_module_file)
I went to the docstring and even the __init__ of the component https://github.com/tensorflow/tfx/blob/master/tfx/components/transform/component.py, it seems there is nothing i would have forgotten or mistaken but i was very disturbed to read following comment with an untraceable location for stats....
disable_statistics: If True, do not invoke TFDV to compute pre-transform
and post-transform statistics. When statistics are computed, they will
will be stored in the `pre_transform_feature_stats/` and
`post_transform_feature_stats/` subfolders of the `transform_graph`
export.
For now, the workaround is to explicitly disable stats in the transform component and define next to it, a dedicated statistics components to work on transformed features splits but it would have been great to have the splits statistics inside transform component directly.
Thanks for any help
This is expected as StatisticsGen in Transform is currently working on the entire transform dataset regardless of split/span.
To generate separate statistics for different splits, please use StatisticsGen component.
Thank you!

How to "single File Save" empty dataframe in Spark?

I have a job which processes files and then lands them as single CSVs on a blob storage container. The problem I face is that I also need to land empty files, which only contain the header. How can this be achieved when I use .saveSingleFile?
Example Code snipped:
df.coalesce(1)
.write
.options(configuration.readAndWriteOptions)
.partitionBy(INGESTION_TIME)
.format("csv")
.mode("append")
.saveSingleFile(path.toString)
Example readAndWriteOptions:
{"sep": ";", "header": "true"}
In other words:
In above case, if df.show() is only displaying a header, no CSV file is written. However, I want to output a csv file without data but column names. Is there an option which would allow this ? Both cases need to be possible, if data is available and if data is not available, therefore something like .take(1) will not be a sufficient solution.
Update:
Looks like this is related to a Spark API Bug and should have been resolved with Version 3.

Edit a Mainframe file in the RecordEditor without a copybook

How do you Edit a (binary EBCDIC) Mainframe file in the RecordEditor with out a Cobol Copybook.
How do you generate Java code to read the file using the RecordEditor.
Note: This is an attempt to split a question that is far to broad to give meaningful answer to
into a series of simpler Question and Answer's.
Try and avoid editing a binary file with a Cobol Copybook if at all possible. This should only be attempted as a last resort !!!.
Try and get
that Cobol copybook (or some field layout document) for the file !!!
Some general advise:
It is feasible when dealing with 10 / 20 fields in a record but not if there a thousands of fields in a Record.
Take your time do not rush the process. Try and get each step correct before moving on
Finally upgrade to the latest version of the RecordEditor (currently 0.98.4)
This process will also work for normal Text file as well
RecordEditor Layout Wizard
To start the wizard select option Record Layouts >>> Layout Wizard.
File Structure screen
The file structure screen has 3 purposes:
Get the File structure - It could be Fixed Width, VB, Windows/Unix Text file
Get the Record-Length (if it is a fixed width file).
Get the font (character-set / encoding)
The RecordEditor will try and work this out for you
Field Selection Screen
The RecordEditor will try and work out where fields start and end but
it is not perfect. You need to carefully check and correct its choices
On this screen, the fields are displayed in alternating colors
you create/delete a field by clicking on
use the Clear Fields button clear all the fields
you can change what field-types to search for using the various check box's (e.g. Mainframe Zones Decimal)
The Add Fields will do another field search
Field Definition screen
On this screen you define the field names and Types. You may need to go back to the **Field Selection Screen* to adjust the fields
Editing the file
Once the Record Layout has been defined, it can be used on the open file screen
Generating Java code
When editing your file, you can generate java~JRecord code to read the file
by selecting Generate >>> Java >>> ....
You can the enter a package-id + generate options:
and finally your sample java code is generated to read / write the
file.

How to store data from Google Ngram API?

I need to store the data presented in the graphs on the Google Ngram website. For example, I want to store the occurences of "it's" as a percentage from 1800-2008, as presented in the following link: https://books.google.com/ngrams/graph?content=it%27s&year_start=1800&year_end=2008&corpus=0&smoothing=3&share=&direct_url=t1%3B%2Cit%27s%3B%2Cc0.
The data I want is the data you're able to scroll over on the graph. How can I extract this for about 140 different terms (e.g. "it's", "they're", "she's", etc.)?
econpy wrote a nice little module in Python that you can use through a command-line interface.
For your "it's" example, you would need to type this command in a terminal / windows console:
python getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
This will automatically save the query result in a CSV file named after your query parameters.
econpy's package, in #HugoMailhot's answer, no longer works (2021) and seems not maintained.
Here's a updated version, with some improvements for easier integration into Python code:
https://gitlab.com/cpbl/google-ngrams
You can call this from the command line (as in econpy's) to create a CSV file, e.g.
getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
or call it from python to get (and plot) data directly in python, e.g.:
from getngrams import ngrams
df = ngrams('bells and whistles -startYear=1900 -endYear=2018 -smoothing=2')
df.plot()
The xkcd functionality is still there too.
(Issues / bug fix pull requests /etc welcome there)

Documentation with Diagram "Hyperlinks" in Enterprise Architect?

I'm struggling to get all the required (and only the required) information into the documentation of my Enterprise Architect Project. Precisely: we have modelled various requirements and displayed the source "standards" for these requirements in our diagrams by using the "hyperlink"-element out of the common toolbox. (This allows us to capture a title, the website where the documentation is found and a description of this documentation).
Now this element is visible on the diagram, but not in the package-view of our model and it does not get generated in our word (docx) documentation.
I can see that it should be possible to get this in the documentation, because a "Model Report" which basically prints everything does print the hyperlinks. But I can't find what I have to select in my template (in the package-tree view, as a package field, element field or diagram field) in order to get this printed. I can't just use the model report since this basically dumps the whole database in the document and reverse-engineering this model report has proven too difficult for me. Actually I would expect this to be in some kind of documentation for EA, but could not find such a thing with this level of detail... is there, is there a reproducible way of finding such things out in further cases? (btw I'm using EA 11.0)
[sorry there were illustrations here, but I'm not allowed to upload them...]
As Geert has already noted, there is a difference between "proper" elements and diagram-only elements. This is actually reflected in the document template editor, where there is an "Element" section inside the "Diagram" section. This will produce output for all elements in the diagram, whether or not they are also in the project browser.
Here's an example of the information you can pull out of your hyperlinks. Given a diagram with a hyperlink:
... and a template which outputs name, alias and hyperlink for each element in the diagram:
... EA will generate a document will the following contents:
So if you want the hyperlink to result in a hyperlink in the document, use the HyperlinkAlias field.
What might be a bit confusing is the fact that in addition to the Hyperlink element type in the Common diagram toolbox, EA allows you to create hyperlinks in regular elements (in the Element Properties dialog, Related tab: Files, which can be local files or web addresses).
In fact, I would recommend that you use those in your Requirement elements rather than diagram-only Hyperlinks if traceability is a priority in your model. The diagram-only Hyperlinks, on the other hand, give you a clearer visual.
Selecting a subset of the elements in a diagram ("only the required information") is a little more involved and depends on how your model is structured. Template fragments will get the job done, but you might be able to achieve your desired result by just using the filters in the document generation dialog.
The hyperlink is an element that is stored in the same package as the diagram it is used on, it is just not visible in the project browser (similar to a note element).
There's a good chance that it doesn't have a name, so make sure you don't omit nameless elements.
So if you print all the element of the package containing the diagram then you should be able to print the hyperlink as well.
In case that fails you might want to consider creating a template fragment based on an SQL query or a script. Those offer lots of flexibility to print whatever you need, even if it is located in a different package.
[Edited on 04.05.15 to reflect the comment by Uffe and provide a final solution]
Ok, based on Geerts answer, using the following custom query fragment in the diagram section:
select
t_object.ea_guid as CLASSGUID,
t_object.Object_Type as CLASSTYPE,
t_object.Object_Id as OBJECTID,
t_object.name as HL_Name,
t_object.Stereotype as HL_Stereotype,
t_object.object_type as HL_Type,
t_object.Alias as HL_Alias,
Note as Notes
--,t_object.*
from t_object
left join t_diagramobjects on (t_object.Object_ID = t_diagramobjects.Object_ID)
left join t_diagram on (t_diagram.Diagram_ID = t_diagramobjects.Diagram_ID)
where t_diagram.Diagram_ID = '#DIAGRAMID#'
and t_object.Object_Type='Text'
I was able to get a list of the hyperlinks following the diagram, this is the fragment:
custom >
{HL_Alias}: {HL_Name}
{Notes}
< custom
The "Notes" can be printed by getting the attribute directly out of the t_object table. Don't get confused as I was at first: the auto-completion on t_object and the results (t_object.*) DO NOT SHOW a Note-Attribute, but it does exist an when you write it into the query, it gets generated in the document.