Can't run k-means with SPSS Modeler 16 - k-means

I'm using IBM SPSS modeler 16.0 to analyze my data that have four fields and all of them are retrived from a database as string and converted to numbers with the node replace using to_number(). When I connect my node to k-means node to create the clusters using that data I get an error (I'm running a french version and this is a translation of the error):
Type not enough specified for the field 'MyField1'
Type not enough specified for the field 'MyField2'
Type not enough specified for the field 'MyField3'
Type not enough specified for the field 'MyField4'
I tried almost everything but I can't get rid of this error. Can anyone help me to figure this out ?
Many thanks.

You will need to instantiate the input fields used by the k-means model.
You do this by adding a 'Type' node before the modeling node and after any field operation node that would make compute or change any of the nodes that are used as input to the model.
In the 'Type' node you then make sure to click the "Read Values" button or make the proper selections for each field, which is what will instantiate the fields.
This step is not only required for the k-means model, but for most (if not all) of the modeling nodes.

Related

Is it possible to access SCIP's Statistics output values directly from PyScipOpt Model Object?

I'm using SCIP to solve MILPs in Python using PyScipOpt. After solving a problem, the solver statistics can be either 1) printed as a string using printStatistics(), or 2) saved to an external file using writeStatistics(). For example:
import pyscipopt as pso
model = pso.Model()
model.addVar(name="x", obj=1)
model.optimize()
model.printStatistics()
model.writeStatistics(filename="stats.txt")
There's a lot of information in printStatistics/writeStatistics that doesn't seem to be accessible from the Python model object directly (e.g. primal-dual integral value, data for individual branching rules or primal heuristics, etc.) It would be helpful to be able to extract the data from this output via, e.g., attributes of the model object or a dictionary.
Is there any way to access this information from the model object without having to parse the raw text/file output?
PySCIPOpt does not provide access to the statistics directly. The data for the various tables (e.g. separators, presolvers, etc.) are stored separately for every single plugin in SCIP and are sometimes not straightforward to collect.
If you are only interested in certain statistics about the general solving process, then you might want to add PySCIPOpt wrappers for a few of the simple get functions defined in scip_solvingstats.c.
Lastly, you might want to check out IPET for parsing the statistics output.

Is it possible in my case to implement a strategy pattern with different semantics of algorithms?

Hi everyone, I am new on Stack Overflow so if you like my example please vote up so I get reputation of 50 for some extra features.
Now let's start with my problem.
I have several classes that literally convert one data model to another.
Different classes use different versions of the data model.
Here is my example:
In this example I have 3 converters (for now) and two algorithms that convert one data model to another, but they work for different versions of the data model. For example, AlgoVerOne works for an older version of the data model while AlgoVer2 works for a newer version that contains more / less information in it.
What matters is that ConverterA and ConverterB use the same version of the data model. So the conversion algorithm is exactly the same because the versions of the data model do not differ.
PROBLEM
My problem is that the semantics of some parts are different for these two classes. Let's say there is an element in a data model that has a value of 100. This value can be converted and inserted into another data model, because these classes use the same version of it. But the semantics of value 100 for ConverterA means "car" while for ConverterB means "bus".
So the algorithm needed to convert one data model to another is the same, but the value of an element within that data model is semantically different for these two classes.
I don’t want to use a completely new algorithm for both classes because it only changes 1% of the semantics of the whole data model.

Contact pressure representation in Abaqus

The main question is connected with extracting the contact pressure from .odb file.
The issue is described in three facts written below:
Imagine that we have simple 3D contact model in Abaqus/CAE
1.If we make a plot of CPRESS on a deformed shape in visualisation module, we'll get a one value of CPRESS for each node. The same (one value for one node) we will get if we request XYdata field output for all frames. And this all seems to be ok, because as far as I know Abaqus CAE use averaging for surface output (CPRESS) to make it possible to request as nodal output.
2.If we use "Probe values" instrument to examine CPRESS value in node, we'll get four values for one node. It still seems to be ok, because, i suppose, it shows the values befor averaging.
3.If we request CPRESS value from command window using this script:
odb.steps['step_name'].frames[frame_number].fieldOutputs['CPRESS'].getSubset(region='node_path').values
length of this vector of CPRESS values in a single node may be from 1 to 6 depending on a chosen node. And the quantity of CPRESS valuse got using this method have no connection with the quantity got using method 2.
So the trick is that I can't inderstand how the vector of CPRESS in node is forming.
Found very little information about this topic in Abaqus Manual.
Hope somebody may help)
Probe Values, extacts the CPRESS values for the whole element. It shows the face number and its node IDs toghether with their corresponding values.

Validating rows before inserting into BigQuery from Dataflow

According to
How do we set maximum_bad_records when loading a Bigquery table from dataflow? there is currently no way to set the maxBadRecords configuration when loading data into BigQuery from Dataflow. The suggestion is to validate the rows in the Dataflow job before inserting them into BigQuery.
If I have the TableSchema and a TableRow, how do I go about making sure that the row can safely be inserted into the table?
There must be an easier way of doing this than iterating over the fields in the schema, looking at their type and looking at the class of the value in the row, right? That seems error-prone, and the method must be fool-proof since the whole pipeline fails if a single row cannot be loaded.
Update:
My use case is an ETL job that at first will run on JSON (one object per line) logs on Cloud Storage and write to BigQuery in batch, but later will read objects from PubSub and write to BigQuery continuously. The objects contain a lot of information that isn't necessary to have in BigQuery and also contains parts that aren't even possible to describe in a schema (basically free form JSON payloads). Things like timestamps also need to be formatted to work with BigQuery. There will be a few variants of this job running on different inputs and writing to different tables.
In theory it's not a very difficult process, it takes an object, extracts a few properties (50-100), formats some of them and outputs the object to BigQuery. I more or less just loop over a list of property names, extract the value from the source object, look at a config to see if the property should be formatted somehow, apply the formatting if necessary (this could be downcasing, dividing a millisecond timestamp by 1000, extracting the hostname from a URL, etc.), and write the value to a TableRow object.
My problem is that data is messy. With a couple of hundred million objects there are some that don't look as expected, it's rare, but with these volumes rare things still happen. Sometimes a property that should contain a string contains an integer, or vice-versa. Sometimes there's an array or an object where there should be a string.
Ideally I would like to take my TableRow and pass it by TableSchema and ask "does this work?".
Since this isn't possible what I do instead is I look at the TableSchema object and try to validate/cast the values myself. If the TableSchema says a property is of type STRING I run value.toString() before adding it to the TableRow. If it's an INTEGER I check that it's a Integer, Long or BigInteger, and so on. The problem with this method is that I'm just guessing what will work in BigQuery. What Java data types will it accept for FLOAT? For TIMESTAMP? I think my validations/casts catch most problems, but there are always exceptions and edge cases.
In my experience, which is very limited, the whole work pipeline (job? workflow? not sure about the correct term) fails if a single row fails BigQuery's validations (just like a regular load does unless maxBadRecords is set to a sufficiently large number). It also fails with superficially helpful messages like 'BigQuery import job "dataflow_job_xxx" failed. Causes: (5db0b2cdab1557e0): BigQuery job "dataflow_job_xxx" in project "xxx" finished with error(s): errorResult: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field, error: JSON map specified for non-record field'. Perhaps there is somewhere that can see a more detailed error message that could tell me which property it was and what the value was? Without that information it could just as well have said "bad data".
From what I can tell, at least when running in batch mode Dataflow will write the TableRow objects to the staging area in Cloud Storage and then start a load once everything is there. This means that there is nowhere for me to catch any errors, my code is no longer running when BigQuery is loaded. I haven't run any job in streaming mode yet, but I'm not sure how it would be different there, from my (admittedly limited) understanding the basic principle is the same, it's just the batch size that's smaller.
People use Dataflow and BigQuery, so it can't be impossible to make this work without always having to worry about the whole pipeline stopping because of a single bad input. How do people do it?
I'm assuming you deserialize the JSON from the file as a Map<String, Object>. Then you should be able to recursively type-check it with a TableSchema.
I'd recommend an iterative approach to developing your schema validation, with the following two steps.
Write a PTransform<Map<String, Object>, TableRow> that converts your JSON rows to TableRow objects. The TableSchema should also be a constructor argument to the function. You can start off making this function really strict -- require that JSON parsed input as Integer directly, for instance, when a BigQuery INTEGER schema was found -- and aggressively declare records in error. Basically, ensure that no invalid records are output by being super-strict in your handling.
Our code here does something somewhat similar -- given a file produced by BigQuery and written as JSON to GCS, we recursively walk the schema and do some type conversions. However, we do not need to validate, because BigQuery itself wrote the data.
Note that the TableSchema object is not Serializable. We've worked around by converting the TableSchema in a DoFn or PTransform constructor to a JSON String and back. See the code in BigQueryIO.java that uses the jsonTableSchema variable.
Use the "dead-letter" strategy described in this blog post to handle bad records -- side output the offending Map<String, Object> rows from your PTransform and write them to a file. That way, you can inspect the rows that failed your validation later.
You might start with some small files and use the DirectPipelineRunner rather than the DataflowPipelineRunner. The direct runner runs the pipeline on your computer, rather than on Google Cloud Dataflow service, and it uses the BigQuery streaming writes. I believe when those writes fail you will get better error messages.
(We use the GCS->BigQuery Load Job pattern for Batch jobs because it's much more efficient and cost-effective, but BigQuery streaming writes in Streaming jobs because they are low-latency.)
Finally, in terms of logging information:
Definitely check Cloud Logging (by following the Worker Logs link on the logs panel.
You may get better information about why the load jobs triggered by your Batch Dataflows fail if you run the bq command-line utility: bq show -j PROJECT:dataflow_job_XXXXXXX.

DATA_BUFFER_EXCEEDED error when calling RFC_READ_TABLE?

My java/groovy program receives table names and table fields from the user input, it queries the tables in SAP and returns its contents.
The user input may concern the tables CDPOS and CDHDR. After reading the SAP documentations and googling, I found these are tables storing change document logs. But I did not find any remote call functions that can be used in java to perform this kind of queries.
Then I used the deprecated RFC Function Module RFC_READ_TABLE and tried to build up customized queries only depending on this RFC. However, I found if the number of desired fields I passed to this RFC are more than 2, I always got the DATA_BUFFER_EXCEEDED error even if I limit the max rows.
I am not authorized to be an ABAP developer in the SAP system and can not add any FM to existing systems, so I can only write code to accomplish this requirement in JAVA.
Am I doing something wrong? Could you give me some hints on that issue?
DATA_BUFFER_EXCEEDED only happens if the total width of the fields you want to read exceeds the width of the DATA parameter, which may vary depending on the SAP release - 512 characters for current systems. It has nothing to do with the number of rows, but the size of a single dataset.
So the question is: What are the contents of the FIELDS parameter? If it's empty, this means "read all fields." CDHDR is 192 characters in width, so I'd assume that the problem is CDPOS which is 774 characters wide. The main issue would be the fields VALUE_OLD and VALUE_NEW, both 245 Characters.
Even if you don't get developer access, you should prod someone to get read-only dictionary access to be able to examine the structures in detail.
Shameless plug: RCER contains a wrapper class for RFC_READ_TABLE that takes care of field handling and ensures that the total width of the selected fields is below the limit imposed by the function module.
Also be aware that these tables can be HUGE in production environments - think billions of entries. You can easily bring your database to a grinding halt by performing excessive read operations on these tables.
PS: RFC_READ_TABLE is not released for customer use as per SAP note 382318, and the note 758278 recommends to create your own function module and provides a template with an improved logic.
Use BBP_RFC_READ_TABLE instead
There is a way around the DATA_BUFFER_EXCEED error. Although this function is not released for customer use as per SAP OSS note 382318, you can get around this issue with changes to the way you pass parameters to this function. Its not a single field that is causing your error, but if the row of data exceeds 512 bytes this error will be raised. CDPOS will have this issue for sure!
The work around if you know how to call the function using Jco and pass table parameters is to specify the exact fields you want returned. You then can keep your returned results under the 512 byte limit.
Using your example of table CDPOS, specify something like this and you should be good to go...(be careful, CDPOS can get massive! You should specify and pass a where clause!)
FIELDS = 'OBJECTCLAS'....
FIELDS = 'OBJECTID'
In Java it can be expressed as..
listParams.setValue(this.getpObjectclas(), "OBJECTCLAS");
By limiting the fields you are returning you can avoid this error.