How to loop over ExampleSets in Rapidminer? - text-mining

I am trying to extract the data from a pdf without the data in the tables.
I used "Read PDF Table" which extracts each table from pdf as one ExampleSet. So the output is an ioo object collection of ExampleSets.
I tried different "Loop" operators to extract text from this ioo object collection (from the above step), but the operators seems to extract only the FIRST ExampleSet in the ioo object collection.
Can someone suggest how to loop over ALL the ExampleSets in the ioo object collection?
Note: Since all the ExampleSets are of different types, I couldn't append or join them.

Specific to your question:
Use the Operator "Append (Superset)" from the "Operator Toolbox Extension".
This allows you to append ExampleSets even if there are new Attributes or the Attributes have a different value type.
In general regarding looping over a collection:
The Operator of your choice would be "Loop Collection".
The Operators inside this nested Operator are applied on every ExampleSet in the collection and the output is again a collection of ExampleSets.
Happy Mining,
Edin
P.S.:
Have you already checked the RapidMiner Community website (https://community.rapidminer.com)? Maybe you can find possible future questions already answered there?

Related

Create a List of elements from a DataTable LINQ Column

I would like to know how I can convert elements of a column of a DataTable to a list of type string, grouping the elements to avoid repetition.
For example my DataTable would look like this
DataTable
and I want to make a list containing the elements of only "User" without repeating itself using LINQ.
The code I was trying to use is
InvoiceList = InvoiceDT.AsEnumerable().GroupBy(Function(r) r("User").ToString).ToList(Function(g) g.ToList())
But it doesn't work for me since I am new to LINQ and still have problems forming the structures.
I'd use this:
InvoiceList = InvoiceDT.AsEnumerable().Select(Function(r) r("User").ToString()).Distinct().ToList()
If you wanted a GroupBy solution it's
InvoiceList = InvoiceDT.AsEnumerable().GroupBy(Function(r) r("User").ToString()).Select(Function(g) g.Key).ToList()
Where your code went wrong was in trying to pass a delegate to ToList; it doesn't take one (and you wouldn't ToList the g either, as it's a list of data rows with all varying properties).
To reshape our IGrouping (something like a list of objects that all share the same Key, which is a property of the list that the IGrouping represents) produced by the groupby into a sequence of string Keys we Select the Key, and then ToList that
There is a lot of back and forthing between developers over things like ToList vs ToArray - some people universally use ToList because, for collections of an unknown number of elements, both list and array will grow and resize repeatedly in the same way but using ToArray requires one additional resizing step at the end to trim off any unused slots. Mostly that's trivial in terms of an overall performance consideration and should be weighed against the benefit of releasing the memory with the trim. Getting into finer details is way beyond the scope of this answer but you can read some huge blog posts about it.
I personally think it's more important to generate sensible code by calling the method that results in the relevant type depending on what you plan to do with it; I ToList if I need List functionality (add/insert/remove).. I prefer ToArray if an array suits the follow-on purposes (read/write/random access, no insert or delete), and if I'll only ever enumerate it I don't To... anything at all - I just ForEach the result of the query, which can give a bigger performance boost than anything else because it means I may not have to enumerate the entire set (if I stop early) or allocate memory all at once for doing so (if I'm writing to a socket or file)
On the use of ToString; it's worth avoiding if you think you'll fall into a pattern where you do it on every column just to get a string. If the column is already a string it's an acceptable way to get the object that DataRow.Item gives you, into a string. If the column is another type it's better to cast it:
DirectCast(r("Age"), Integer)
r.Field(Of Integer)("Age")
Thing is, it's verbose, and ugly, and intellisense doesn't help you out with writing Age or knowing it's an Int. LINQ in VB is bad enough for verbosity without pouring gas on that fire. If you're working with datatables of a known structure, it's a lot nicer if you make strongly typed ones:
Add a new file of type DataSet to your project
Open it so the design surface appears. In the properties grid call it something reasonable, such as AccountsDataSet
Right click, Add Table, call it Invoices
Right click the emppty table, Add Column, call it User
Then use it like:
Dim dt as new AccountsDataSet.InvoicesDataTable
Populate it like:
dt.AddInvoicesRow("John Smith", ... other properties here)
Query it like:
dt.Select(Function(r) r.User).Distinct()
Much nicer than accessing column names by string, and having them be objects that need casting..
Consider the dataset generator as a way to quickly, visually, create poco classes with named, typed properties
Try this
dim list as List(of string) = InvoiceDT.Rows.
Cast(of DataRow)().
Select(Function(r) r("User").ToString()).
Distinct().
ToList()
Here you cast Row collection as IEnumerable(of DataRow), rest is trivial

ListObjects.Add.QueryTable Source Array String

I will provide some context before I ask my question.
I am attempting to query an SQL Server and create a table within Excel from the data. Because I am not familiar with how to accomplish this in VBA I recorded by using Data -> Get External Data -> From Other Sources -> Microsoft Query. In the dialog box that appears, I chose a .DSN file provided to me by someone else. I then used the Microsoft Query interface to structure the query and import the data onto a worksheet.
The code in the recorded macro looked something like this. I will use generic terms instead of the actual code.
With Sheet2.ListObjects.Add(SourceType:= 0, Source:=Array _
(Array("ODBC;DRIVER=SQL Server;SERVER=ServerName;UID=userid;Trusted_Connection=Yes;APP=Microsoft Windows Operating System;WSID=SomeString"), _
Array("A;DATABASE=DatabaseName")), Destination:=Range ("Sheet2!$A$1")).QueryTable
I know this is not formatted ideally, which is part of my question below.
https://msdn.microsoft.com/en-us/library/bb211863(v=office.12).aspx
From the above article, I know that SourceType:= 0 is an xlSrcExternal, or an external data source. This makes sense to me.
My confusion begins to arise when I get to the Source component of the Add method. From the provided article, "When SourceType = xlSrcExternal, an array of String values specifying a connection to the source, containing the following elements:
•0 - URL to SharePoint site
•1 - ListName
•2 - ViewGUID
So to begin with, what exactly is meant by "an array of String values", as the code from the recorded macro does not appear to correspond to what I thought was an array. I know that normally an array is declared something like this Array("string1", "string2", etc.). Or is the array recorded simply an array of one value? In other words Array("string1"). Does anyone know the purpose of passing an "array of string values" as opposed to just passing a string?
Also does anyone know the nuances of why the recorded macro has this particular formatting/syntax? In other words, why does it appear to have this syntax Array(Array("string1"),_ (new line) Array("string2"))? Why not just Array ("string1")? Does it have something to do with the second line being too long?
I have several more questions related to this topic, but this seemed like a good place to start..
Thank you all for any help given.

how do you flatten and unflatten an array of doubles in labview?

I have created a simple LabView program shown below that attempts to flatten an array [1,0,3] and then unflatten it and print out the contents.
However, I am unsuccessful in doing so. What am I doing wrong?
What am I doing wrong?
You're not going through tutorials or you're not reading the context help for the unflatten function (Ctrl+H) or you're not reading the full help for the function (right click>>Help) or you're not looking at the examples (from the help or Help>>Find Examples). Take your pick (preferably all four).
If you want an actual answer it is that LV is strictly typed, and therefore you need to tell the unflatten function which data type you want it to output (1D DBL array) and you're not doing that, but the real answer is what's in the previous paragraph - you should use those tools to learn how to find such an answer yourself.
The string returned by Flatten to String only contains the data, not the description of what data type was passed in, so in order to unflatten it again you need to tell Unflatten from String what type it was. You do this by wiring some data of the appropriate type (any data - if it's an array it can be an empty one) to the Type terminal.
I don't think this is immediately obvious from the LabVIEW 2012 help but I think it's fairly clear if you follow the link from the Unflatten from String help page to one of the examples. The Read Flattened Data.vi example has an array wired to the Type input.

String Template: is it possible to get the n-th element of a Java List in the template?

In String Template one can easily get an element of a Java Map within the template.
Is it possible to get the n-th element of an array in a similar way?
According to the String Template Cheat Sheet you can easily get the first or second element:
You can combine operations to say things like first(rest(names)) to get second element.
but it doesn't seem possible to get the n-th element easily. I usually transform my list into a map with list indexes as keys and do something like
map.("25")
Is there some easier/more straightforward way?
Sorry, there is no mechanism to get a[i].
There is no easy way getting n-th element of the list.
In my opinion this indicates that your view and business logic are not separated enough: knowledge of what magic number 25 means is spread in both tiers.
One possible solution might be converting list of values to object which provides meaning to the elements. For example, lets say list of String represents address lines, in which case instead of map.("3") you would write address.street.

Extract terms from query for highlighting

I'm extracting terms from the query calling ExtractTerms() on the Query object that I get as the result of QueryParser.Parse(). I get a HashTable, but each item present as:
Key - term:term
Value - term:term
Why are the key and the value the same? And more why is term value duplicated and separated by colon?
Do highlighters only insert tags or to do anything else? I want not only to get text fragments but to highlight the source text (it's big enough). I try to get terms and by offsets to insert tags by hand. But I worry if this is the right solution.
I think the answer to this question may help.
It is because .Net 2.0 doesnt have an equivalent to java's HashSet. The conversion to .Net uses Hashtables with the same value in key/value. The colon you see is just the result of Term.ToString(), a Term is a fieldname + the term text, your field name is probably "term".
To highlight an entire document using the Highlighter contrib, use the NullFragmenter