How to make testing data manually for clustering of citation records? - testing

I'm doing a research on the author name disambiguation problem. I want to make some experiments. I want to perform clustering on citation records. My dataset consist of 2000 xml records. I need testing data. The dataset that I'm using is not popular and I need to make testing data manually. I don't know how to do so. I need instruction of how to make testing data manually. Note: I want to compare the performance of a set of techniques in solving the author name disambiguation problem, So I must perform testing.

Even though it is not really clear what kind of testing you want to perform, but general answer to the issue at hand - trying to artificially create more data from the data you have at hand - is a bootstrap. In general it is technique when you perform sampling with replacement from your dataset as many times as you want. It randomly picks up some element from your data repetitively untill you get a sample of the size you want. The sample you get could be larger than your original dataset but should have similar (from statistical point of view) as your original dataset. Bootstrap sampling is available in sklearn.
P.S. You need to keep in mind that this solution is not optimal - best solution to this problem is to actually get more real data somehow.

Classification vs. Clustering
For author name disambiguation, I don't think you want clustering. What you want is classification.
You have a features for each author / publication. Now you give the classifier two of those feature vectors. It classifies "it is the same author" or "those are different authors".
Training / testing data
Having a binary classification problem, the testing suddenly becomes simple: Just use one of the measures used in literature so often (accuracy, precision, recall, confuscation matrix).
Getting the data might be a bit more complicated. You wrote that you have an XML file of 2000 records. I guess you can derive features from those records automatically and authors have an identifier? Then you can simply generate negative examples by having different authors and positive examples by checking if the identifier is the same.
Otherwise you can have a look at Although there are likely many publications under the same author which should be different, they do distinguish authors not only by name but give them identifiers.
Alternatively, you can train a classifier to classify each of the known authors with e.g. > 30 publications. Then remove the softmax layer and use those features to distinguish the authors.


Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

Boxcox transformation with tree-based models(XGBoost to be specific)

I have a question regarding boxcox transformation(or log transformation). I am working on a data-set which I have lots of skewed features. Now when I take the boxcox transformation, I get quite a nice distribution but the thing is correlation decrease. Now if I was working with linear models I would just consider correlation to decide I should transform the feature or not. But as I mentioned I am working with tree-based models, so should I transform the feature to get a more dispersed distribution or I leave the feature as it is to avoid a decrease in correlation.
I add a screenshot of distribution and its relationship with the target variable, for both transformed and not transformed(Left 2 plots original feature and target).
PS: Guessing from the plots, it seems to me that if I transform the feature it will be easier for tree to find a split for this particular feature.
Thanks a lot,

Data types and how rapidminer emphasizes them

Fairly new to rapidminer and data science.
I imported data (it's very wide, so it took a while to classify all of the data types). I put the data through random forest and it appears to have emphasized the wrong things. I believe this is due to incorrect data type classification. I can't seem to find good data type documentation and am looking for an explanation of how rapidminer looks at each.
For example, I have some columns with 90% blanks and a couple filled it. I labeled this as "nominal" and rapid miner weighted this column heavily. I wanted it to weigh the dates columns more since I'm trying to predict cycle tmie.... any help or insight very much appreciated!
Some of the data types available are:
I'm not 100% sure if I got your question correctly, but neither RapidMiner or the RandomForest algorithm emphasized a certain data type over another.
So if the algorithm puts more importance on the nominal columns, it is because the strongly separate your example.
The different data types in RapidMiner are the to allow, dis-allow certain operations.
Classic example are phone numbers. If they are stored as a real number, you could get something like a square root or averages, which does not make sense. So storing them as String (or Nominal) makes more sense.
If you want to exclude certain attributes, you could try a feature selection or dimensionality reduction method (like PCA or the Remove Correlated, Remove Useless operators.
Also feel free to ask further, or re-post, questions in the RapidMiner community forum.

Neural Network: Convert HTML Table into JSON data

I'm kinda new to Neural Networks and just started to learn coding them by trying some examples.
Two weeks ago I was searching for an interesting challenge and I found one. But I'm about to give up because it seems to be too hard for me... But I was curious to know if anyone of you is able to solve this?
The Problem: Assume there are ".htm"-files that contain tables about the same topic. But the table structure isn't the same for every file. For example: We have a lot ".htm"-files containing information about teachers substitutions per day per school. Because the structure of those ".htm"-files isn't the same for every file it would be hard to program a parser that could extract the data from those tables. So my thought was that this is a task for a Neural Network.
First Question: Is it a task a Neural Network can/should handle or am I mistaken by that?
Because for me a Neural Network seemed to fit for this kind of a challenge I tried to thing of an Input. I came up with two options:
First Input Option: Take the HTML Code (only from the body-tag) as string and convert it as Tensor
Second Input Option: Convert the HTML Tables into Images (via Canvas maybe) and feed this input to the DNN through Conv2D-Layers.
Second Question: Are those Options any good? Do you have any better solution to this?
After that I wanted to figure out how I would make a DNN output this heavily dynamic data for me? My thought was to convert my desired JSON-Output into Tensors and feed them to the DNN while training and for every prediction i would expect the DNN to return a Tensor that is convertible into a JSON-Output...
Third Question: Is it even possible to get such a detailed Output from a DNN? And if Yes: Do you think the Output would be suitable for this task?
Last Question: Assuming all my assumptions are correct - Wouldn't training this DNN take for ever? Let's say you have a RTX 2080 ti for it. What would you guess?
I guess that's it. I hope i can learn a lot from you guys!
(I'm sorry about my bad English - it's not my native language)
Here is a more in-depth Example. Lets say we have a ".htm"-file that looks like this:
The task would be to get all the relevant informations from this table. For example:
All Students from Class "9c" don't have lessons in their 6th hour due to cancellation.
1) This is not particularly suitable problem for a Neural Network, as you domain is a structured data with clear dependcies inside. Tree based ML algorithms tend to show much better results on such problems.
2) Both you choices of input are very unstructured. To learn from such data would be nearly impossible. The are clear ways to give more knowledge to the model. For example, you have the same data in different format, the difference is only the structure. It means that a model needs to learn a mapping from one structure to another, it doesn't need to know any data. Hence, words can be Tokenized with unique identifiers to remove unnecessary information. Htm data can be parsed to a tree, as well as json. Then, there are different ways to represent graph structures, which can be used in a ML model.
3) It seems that the only adequate option for output is a sequence of identifiers pointing to unique entities from text. The whole problem then is similar to Seq2Seq best solved by RNNs with an decoder-encoder architecture.
I believe that, if there is enough data and htm files don't have huge amount of noise, the task can be completed. Training time hugely depends on selected model and its complexity, as well as diversity of initial data.

How to encode inputs like artist or actor

I am currently developing a neural network that tries to make a suggestion for a specific user based on his recent activities. I will try to illustrate my problem with an example.
Now, let's say im trying to suggest new music to a user based on the music he recently listened to. Since people often listen to artists they know, one input of such a neural network might be the artists he recently listened to.
The problem is the encoding of this feature. As the id of the artist in the database has no meaning for the neural network, the only other option that comes to my mind would be one-hot encoding every artist, but that doesn't sound to promising either regarding the thousands of different artists out there.
My question is: How can i encode such a feature?
The approach you describe is called content-based filtering. The intuition is to recommend items to customer A similar to previous items liked by A. An advantage to this approach is that you only need data about one user, which tends to result in a "personalized" approach for recommendation. But some disadvantages include the construction of features (the problem you're dealing with now), the difficulty to build an interesting profile for new users, plus it will also never recommend items outside a user's content profile. As for the difficulty of representation, features are usually handcrafted and abstracted afterwards. For music specifically, features would be things like 'artist', 'genre', etc. and abstraction for informative keywords (if necessary) is widely done using tf-idf.
This may go outside the scope of the question, but I think it is also worth mentioning an alternative approach to this: collaborative filtering. Rather than similar items, here we instead try to find users with similar tastes and recommend products that they liked. The only data you need here are some sort of user ratings or values of how much they (dis)liked some data - eliminating the need for feature design. Furthermore, since we analyze similar persons rather than items for recommendation, this approach tends to also work well for new users. The general flow for collaborative filtering looks like:
Measure similarity between user of interest and all other users
(optional) Select a smaller subset consisting of most similar users
Predict ratings as a weighted combination of "nearest neighbors"
Return the highest rated items
A popular approach for the similarity weighting in the algorithm is based on the Pearson correlation coefficient.
Finally, something to consider here is the need for performance/scalability: calculating pairwise similarities for millions of users is not really light-weight on a normal computer.