Visualize H2o XGBoost tree in json format, can not find the missing child - xgboost

I am trying to visualize my H2O XGBoost model in JSON format using below command:
java -cp h2o-genmodel.jar hex.genmodel.tools.PrintMojo -i XGBoost_model_R_1597776279050_3.zip --tree 1 --format json
The above command output the tree structure in JSON format like below:
"rightChild": {
"nodeNumber": 2,
"weight": 0.0,
"colId": 382,
"colName": "var_2",
"leftward": false,
"isCategorical": false,
"inclusiveNa": false,
"splitValue": 0.195,
"rightChild": {
"nodeNumber": 6,
"weight": 0.0,
"colId": 340,
"colName": "var_6",
"leftward": false,
"isCategorical": false,
"inclusiveNa": true,
"splitValue": 1.0,
"rightChild": {
"nodeNumber": 10,
"weight": 0.0,
"predValue": 0.011794609
},
"leftChild": {
"nodeNumber": 9,
"weight": 0.0,
"predValue": 0.011531689
}
I am trying to understand how the missing child can be calculated using above JSON for each node. The same structure can be view in png format and the missing node for the node var_6 is coming as left child. Is there a way to figure out the missing node by looking at the JSON?

Check out the leftward attribute - it's the so-called "majority way", which means that all data records that can't be explicitly evaluated (eg. a data record with a missing value) will be sent there.
In the current case, "leftward": false should cause missing values to be sent to the right child node (node number 10).

Related

DataTables Pager Showing Many Pages when there is Only One

This is a weird one.
I'm using datatables v1.10.19 with jQuery 3.3.1 and Bootstrap 3.3.7
My datatables grid is configured to display 1000 records (but you can change it to 2500, 5000 and "all").
I've only got about 60 records in my database.
It is using Server-Side processing to retrieve data.
When the grid loads, the pager displays 5 buttons plus an ellipses (as if there is even more).
And even weirder, if I change the drop-down to display "all" records, it acts as I would expect i.e. the pager has 1 page button.
The payloads are pretty much identical:
{
"data": {
"draw": 8,
"recordsTotal": 86,
"recordsFiltered": 66,
"data": [rows of data here]
},
"outcome": {
"opResult": "Success",
"message": ""
}
}
When you click on page 2, it does successfully retrieve a payload with 0 rows.
But there shouldn't be a page 2 available on the pager.
The config object for the datatable looks like this:
eventsSvr.buildConfig = function (url) {
return {
"processing": true,
"serverSide": true,
//"paging": true,
"ajax": {
url: url,
type: ajax.requestPOST,
dataSrc: 'data.data' // the path in the JSON structure to the array which will be the rows.
},
"order": [[1, "asc"]],
"lengthMenu": [[1000, 2500, 5000, -1], [1000, 2500, 5000, "All"]],
"initComplete": function (settings, json) {
eventsSvr.searchTextSpan.text('Search').removeClass('search-is-on');
},
"columns": eventsSvr.grid.columns,
"columnDefs": eventsSvr.grid.columnDefs,
dom: 'ltp'
};
I do have a bunch of custom searches on the page, so I've had to write a lot of code like this:
$.fn.dataTable.ext.search.push(
function (settings, data, dataIndex) {
var picker3 = $(eventsSvr.datePickerInputs[0]).data(icapp.kendoKey);
var picker4 = $(eventsSvr.datePickerInputs[1]).data(icapp.kendoKey);
var rowStartDate = moment(data[3], icapp.date.momentParseFormat).toDate();
var rowEndDate = moment(data[4], icapp.date.momentParseFormat).toDate();
... etc.
}
);
But the odd thing is the different behavior as between "All" records vs 1000 records.
As described above, select "All" records works (resulting in just 1 page button), but none of the other paging sizes work (i.e. 1000, 2500, 5000). The data for the 1 page does return, but I get 5 page buttons and an ellipses.
Any ideas why this would be happening?
When using server-side processing mode DataTables expects draw, recordsTotal and recordsFiltered to be root-level elements. Consider changing your repsonse to the following and you can remove dataSrc option.
{
"draw": 8,
"recordsTotal": 86,
"recordsFiltered": 66,
"data": [rows of data here],
"outcome": {
"opResult": "Success",
"message": ""
}
}
Alternatively you can manipulate the response before passing it to DataTables using function supplied as value for dataSrc option, but I would recommend keep things according to expected format for more readable code.

Spark SQL to explode array of structure

I have the below JSON structure which I am trying to convert to a structure with each element as column as shown below using Spark SQL. Explode(control) is not working. Can someone please suggest a way to do this?
Input:
{
"control" : [
{
"controlIndex": 0,
"customerValue": 100.0,
"guid": "abcd",
"defaultValue": 50.0,
"type": "discrete"
},
{
"controlIndex": 1,
"customerValue": 50.0,
"guid": "pqrs",
"defaultValue": 50.0,
"type": "discrete"
}
]
}
Desired output:
controlIndex customerValue guid defaultValult type
0 100.0 abcd 50.0 discrete
1 50.0 pqrs 50.0 discrete
Addition to Paul Leclercq's answer, here is what can work.
import org.apache.spark.sql.functions.explode
df.select(explode($"control").as("control")).select("control.*")
Explode will create a new row for each element in the given array or map column
import org.apache.spark.sql.functions.explode
df.select(
explode($"control")
)
Explode will not work here as its not a normal array column but an array of struct. You might want to try something like
df.select(col("control.controlIndex"), col("control.customerValue"), col ("control. guid"), col("control. defaultValue"), col(control. type))

AWS boto3 page_iterator.search can't compare datetime.datetime to str

Trying to capture delta files(files created after last processing) sitting on s3. To do that using boto3 filter iterator by query LastModified value rather than returning all the list of files and filtering on the client site.
According to http://jmespath.org/?, the below query is valid and filters the following json respose;
filtered_iterator = page_iterator.search(
"Contents[?LastModified>='datetime.datetime(2016, 12, 27, 8, 5, 37, tzinfo=tzutc())'].Key")
for key_data in filtered_iterator:
print(key_data)
However it fails with;
RuntimeError: xxxxxxx has failed: can't compare datetime.datetime to str
Sample paginator reponse;
{
"Contents": [{
"LastModified": "datetime.datetime(2016, 12, 28, 8, 5, 31, tzinfo=tzutc())",
"ETag": "1022dad2540da33c35aba123476a4622",
"StorageClass": "STANDARD",
"Key": "blah1/blah11/abc.json",
"Owner": {
"DisplayName": "App-AWS",
"ID": "bfc77ae78cf43fd1b19f24f99998cb86d6fd8220dbfce0ce6a98776253646656"
},
"Size": 623
}, {
"LastModified": "datetime.datetime(2016, 12, 28, 8, 5, 37, tzinfo=tzutc())",
"ETag": "1022dad2540da33c35abacd376a44444",
"StorageClass": "STANDARD",
"Key": "blah2/blah22/xyz.json",
"Owner": {
"DisplayName": "App-AWS",
"ID": "bfc77ae78cf43fd1b19f24f99998cb86d6fd8220dbfce0ce6a81234e632c5a8c"
},
"Size": 702
}
]
}
Boto3 Jmespath implementation does not support dates filtering (it will mark them as incompatible types "unicode" and "datetime" in your example). But by the way Dates are parsed by Amazon you can perform lexographical comparison of them using to_string() method of Jmespath.
Something like this:
"Contents[?to_string(LastModified)>='\"2015-01-01 01:01:01+00:00\"']"
But keep in mind that its a lexographical comparison and not dates comparison. Works most of the time tho.
After spend a few minutes on boto3 paginator documentation, I just realist it is actually an syntax problem, which I overlook it as a string.
Actually, the quote that embrace comparison value on the right is a backquote/backtick, symbol [ ` ] . You cannot use single quote [ ' ] for the comparison values/objects.
After inspect JMESPath example, I notice it is using backquote for comparative value. So boto3 paginator implementation indeed comply to JMESPath standard.
Here is the code I run without error using the backquote.
import boto3
s3 = boto3.client("s3")
s3_paginator = s3.get_paginator('list_objects')
s3_iterator = s3_paginator.paginate(Bucket='mytestbucket')
filtered_iterator = s3_iterator.search(
"Contents[?LastModified >= `datetime.datetime(2016, 12, 27, 8, 5, 37, tzinfo=tzutc())`].Key"
)
for key_data in filtered_iterator:
print(key_data)

How to load json data in form of map in spark sql?

I have json data as shown
"vScore": {
"300x600": {
"v1": "0.50",
"v2": "0.67",
"v3": "ATF",
"v4": "H2",
"v5": "0.11"
},
"728x90": {
"v1": "0.48",
"v2": "0.57",
"v3": "Unknown",
"v4": "H2",
"v5": "0.51"
},
"300x250": {
"v1": "0.64",
"v2": "0.77",
"v3": "ATF",
"v4": "H2",
"v5": "0.70"
},
I want to load this json data in the form of map i.e. I want to load vScores in the map so that 300x250 becomes the key and the nested v1...v5 becomes the value of map.
How to do it in spark sql in scala?
You need to load your data using
data = sqlContext.read.json("file")
you can check how your data was loaded
data.printSchema()
get your data with "Select" query , using
data.select....
More:
How to parse jsonfile with spark

RavenDB facet takes to long query time

I am new to ravendb and trying it out to see if it can do the job for the company i work for .
i updated a data of 10K records to the server .
each data looks like this :
{
"ModelID": 371300,
"Name": "6310I",
"Image": "0/7/4/9/28599470c",
"MinPrice": 200.0,
"MaxPrice": 400.0,
"StoreAmounts": 4,
"AuctionAmounts": 0,
"Popolarity": 16,
"ViewScore": 0.0,
"ReviewAmount": 4,
"ReviewScore": 40,
"Cat": "E-Cellphone",
"CatID": 1100,
"IsModel": true,
"ParamsList": [
{
"ModelID": 371300,
"Pid": 188396,
"IntVal": 188402,
"Param": "Nokia",
"Name": "Manufacturer",
"Unit": "",
"UnitDir": "left",
"PrOrder": 0,
"IsModelPage": true
},
{
"ModelID": 371305,
"Pid": 398331,
"IntVal": 1559552,
"Param": "1.6",
"Name": "Screen size",
"Unit": "inch",
"UnitDir": "left",
"PrOrder": 1,
"IsModelPage": false
},.....
where ParamsList is an array of all the attributes of a single product.
after building an index of :
from doc in docs.FastModels
from docParamsListItem in ((IEnumerable<dynamic>)doc.ParamsList).DefaultIfEmpty()
select new { Param = docParamsListItem.IntVal, Cat = doc.Cat }
and a facet of
var facetSetupDoc = new FacetSetup
{
Id = "facets/Params2Facets",
Facets = new List<Facet> { new Facet { Name = "Param" } }
};
and search like this
var facetResults = session.Query<FastModel>("ParamNewIndex")
.Where(x => x.Cat == "e-cellphone")
.ToFacets("facets/Params2Facets");
it takes more than a second to query and that is on only 10K of data . where our company has more than 1M products in DB.
am i doing something wrong ?
In order to generate facets, you have to check for each & every individual value of docParamsListItem.IntVal. If you have a lot of them, that can take some time.
In general, you shouldn't have a lot of facets, since that make no sense, it doesn't help the user.
For integers, you usually use ranges, instead of the actual values.
For example, price within a certain range.
You use just the field for things like manufacturer, or the MegaPixels count, where you have lot number or items (about a dozen or two)
You didn't mention which build you are using, but we made some major improvements there recently.