deal with black-box on predictive model in data science - testing

I have a question about this kind of situation.
If I have a black-box which contain only the code for one specific model,like Support Vector Machine,with no any other information in the box.
How should I test the model is still effective to use or not?
Thanks.

I would:
-first figure out if it works and how to train and generate predictions
-then pick a couple of datasets and divide it into your training and test data
-train and test the blackbox model and compare the results with a couple of known models
the point to stress here is to make sure you don't train your model(s) with your testing data...because that's the true test of how the model will generalize. If you're new to modelling, this is the most important thing.
It is common that certain models do well on some types of data and not others so that's the trick here...finding where the blackbox can be effective.
If your goal is to try and figure out the model in the box, then select datasets known to favour certain models and if it does well on it you can have an educated guess. But tricky to say for sure.
Not knowing the type of model is not good because it can be a time-waster if you are running a bunch of different algorithms on some data...you don't want to duplicate your efforts and it's nice to know how it can be regularized(unless it does that for you).

Related

Conditional GANs to Causal GANS?

Can we use conditional GANs to show causality in our data?
I tried a Conditional GAN and I want to know how can I convert it into a Causal one.
Finding causal relationships is very difficult and depends on both model and data
Generally speaking, there is no quick fix that can just make any complex ML model into a causal one (this applies to GANs as much as to anything else). It all depends on what data you have and what causal relationships you hope to find or estimate.
For example, if you have data with a lot of interventions (e.g. data collected through many controlled experiments), you may be able to leverage the difference in outcomes between the experiments to estimate causal effects. If you have only an observational dataset, as is the standard for many vanilla machine learning tasks, finding causal relationships is extremely difficult.

Neural Network: Convert HTML Table into JSON data

I'm kinda new to Neural Networks and just started to learn coding them by trying some examples.
Two weeks ago I was searching for an interesting challenge and I found one. But I'm about to give up because it seems to be too hard for me... But I was curious to know if anyone of you is able to solve this?
The Problem: Assume there are ".htm"-files that contain tables about the same topic. But the table structure isn't the same for every file. For example: We have a lot ".htm"-files containing information about teachers substitutions per day per school. Because the structure of those ".htm"-files isn't the same for every file it would be hard to program a parser that could extract the data from those tables. So my thought was that this is a task for a Neural Network.
First Question: Is it a task a Neural Network can/should handle or am I mistaken by that?
Because for me a Neural Network seemed to fit for this kind of a challenge I tried to thing of an Input. I came up with two options:
First Input Option: Take the HTML Code (only from the body-tag) as string and convert it as Tensor
Second Input Option: Convert the HTML Tables into Images (via Canvas maybe) and feed this input to the DNN through Conv2D-Layers.
Second Question: Are those Options any good? Do you have any better solution to this?
After that I wanted to figure out how I would make a DNN output this heavily dynamic data for me? My thought was to convert my desired JSON-Output into Tensors and feed them to the DNN while training and for every prediction i would expect the DNN to return a Tensor that is convertible into a JSON-Output...
Third Question: Is it even possible to get such a detailed Output from a DNN? And if Yes: Do you think the Output would be suitable for this task?
Last Question: Assuming all my assumptions are correct - Wouldn't training this DNN take for ever? Let's say you have a RTX 2080 ti for it. What would you guess?
I guess that's it. I hope i can learn a lot from you guys!
(I'm sorry about my bad English - it's not my native language)
Addition:
Here is a more in-depth Example. Lets say we have a ".htm"-file that looks like this:
The task would be to get all the relevant informations from this table. For example:
All Students from Class "9c" don't have lessons in their 6th hour due to cancellation.
1) This is not particularly suitable problem for a Neural Network, as you domain is a structured data with clear dependcies inside. Tree based ML algorithms tend to show much better results on such problems.
2) Both you choices of input are very unstructured. To learn from such data would be nearly impossible. The are clear ways to give more knowledge to the model. For example, you have the same data in different format, the difference is only the structure. It means that a model needs to learn a mapping from one structure to another, it doesn't need to know any data. Hence, words can be Tokenized with unique identifiers to remove unnecessary information. Htm data can be parsed to a tree, as well as json. Then, there are different ways to represent graph structures, which can be used in a ML model.
3) It seems that the only adequate option for output is a sequence of identifiers pointing to unique entities from text. The whole problem then is similar to Seq2Seq best solved by RNNs with an decoder-encoder architecture.
I believe that, if there is enough data and htm files don't have huge amount of noise, the task can be completed. Training time hugely depends on selected model and its complexity, as well as diversity of initial data.

idea behind xgboost/lightgbm/catboost in comparison

I'm trying to decide, which one of the following I will use in practice for regression tasks: xgboost, lightgbm or catboost (python 3).
So, what are general idea behind each of them? Why should I choose one, but not another?
I'm not interested in very slight difference in the accuracy score like 0.781 vs 0.782. Result should be tenable, and my tool should be robust, convenient in use. The workhorse.
As I understand about these methods, Just how they are implemented is different, otherwise they have implemented GBM methods.
So you should just try to do some hyper parameter tuning.
Also, its good idea to read this paper:
catboost-vs-light-gbm-vs-xgboost
You cannot determine a priori which Tree algorithm (or any algorithm) will be automatically the best. This is because of the https://en.wikipedia.org/wiki/No_free_lunch_theorem
It's best to try them all out. You should also throw in Random Forest (RF) as another one to try.
I will say that http://CatBoost.ai (CB) does have one advantage over the others: if you have Categorical Variables, CB will most likely beat the others because it can handle categorical variables directly without One-Hot-Encoding.
You might try http://H2O.ai 's grid search which supports several algorithms (RF, XGBoost, GBM, Linear Regression) with Hypertuning of parameters to see which one works best. You can run this overnight. (CB is not included in H2O's grid search)

How to make testing data manually for clustering of citation records?

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.
Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.
Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at http://dblp.uni-trier.de/. Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.