How to visualize (make plot) of regression output against categorical input variable? [closed] - matplotlib

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
I am doing linear regression with multiple variables. In my data I have n = 143 features and m = 13000 training examples. Some of my features are continuous (ordinal) variables (area, year, number of rooms). But I also have categorical variables (district, color, type). For now I visualized some of my feautures against predicted price. For example here is the plot of area against predicted price:
Since area is continuous ordinal variable I had no troubles visualizing the data. But now I wanted to somehow visualize dependency of my categorical variables (such as district) on predicted price.
For categorical variables I used one-hot (dummy) encoding.
For example that kind of data:
turned to this format:
If I were using ordinal encoding for districts this way:
DistrictA - 1
DistrictB - 2
DistrictC - 3
DistrictD - 4
DistrictE - 5
I would plot this values against predicted price pretty easy by putting 1-5 to X axis and price to Y axis.
But I used dummy coding and now I do not know how can I show (visualize) dependency between price and categorical variable 'District' represented as series of zeros and ones.
How can I make a plot showing a regression line of districts against predicted price in case of using dummy coding?

If you just want to know how much the different districts influence your prediction you can take a look at the trained coefficients directly. A high theta indicates that that district increases the price.
If you want to plot this, one possible way is to make a scatter plot with the x coordinate depending on which district is set.
Something like this (untested):
plot.scatter(0, predict(data["DistrictA"==1]))
plot.scatter(1, predict(data["DistrictB"==1]))
And so on.
(Possibly you need to provide an x vector of the same size as the filtered data vector.)
It looks even better if you can add a slight random perturbation to the x coordinate.

Related

Optimize integration region

I have an optimization problem: I have a 2-D square like 0<x<X and 0<y<Y where X and Y are the boundaries of the square, and I have a 2-D array in this region Arr[x,y], which has positive and negative values.
Now, the task is to find the optimal region inside this square so that when I integrate the given array within this region, I get the maximum possible number.
Which direction should I look? Which techniques are available for this type of question?

Why does 'dimension' mean several different things in the machine-learning world? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I've noticed that AI community refers to various tensors as 512-d, meaning 512 dimensional tensor, where the term 'dimension' seems to mean 512 different float values in the representation for a single datapoint. e.g. in 512-d word-embeddings means 512 length vector of floats used to represent 1 english-word e.g. https://medium.com/#jonathan_hui/nlp-word-embedding-glove-5e7f523999f6
But it isn't 512 different dimensions, it's only 1 dimensional vector? Why is the term dimension used in such a different manner than usual?
When we use the term conv1d or conv2d which are convolutions over 1-dimension and 2-dimensions, a dimension is used in the typical way it's used in math/sciences but in the word-embedding context, a 1-d vector is said to be a 512-d vector, or am I missing something?
Why is this overloaded use of the term dimension? What context determines what dimension means in machine-learning as the term seems overloaded?
In the context of word embeddings in neural networks, dimensionality reduction, and many other machine learning areas, it is indeed correct to call the vector (which is typically, an 1D array or tensor) as n-dimensional where n is usually greater than 2. This is because we usually work in the Euclidean space where a (data) point in a certain dimensional (Euclidean) space is represented as an n-tuple of real numbers (i.e. real n-space ℝn).
Below is an exampleref of a (data) point in a 3D (Euclidean) space. To represent any point in this space, say d1, we need a tuple of three real numbers (x1, y1, z1).
Now, your confusion arises why this point d1 is called as 3 dimensional instead of 1 dimensional array. The reason is because it lies or lives in this 3D space. The same argument can be extended to all points in any n-dimensional real space, as it is done in the case of embeddings with 300d, 512d, 1024d vector etc.
However, in all nD array compute frameworks such as NumPy, PyTorch, TensorFlow etc, these are still 1D arrays because the length of the above said vectors can be represented using a single number.
But, what if you have more than 1 data point? Then, you have to stack them in some (unique) way. And this is where the need for a second dimension arises. So, let's say you stack 4 of these 512d vectors vertically, then you'd end up with a 2D array/tensor of shape (4, 512). Note that here we call the array as 2D because two integer numbers are required to represent the extent/length along each axis.
To understand this better, please refer my other answer on axis parameter visualization for nD arrays, the visual representation of which I will include it below.
ref: Euclidean space wiki
It is not overloading, but standard usage. What are the elements of a 512-dimensional vector space? They are 512 dimensional vectors. Each of which can be represented by 512 floating point number as in your equation. Each such vector spans a 1-dimensional subspace of the 512-dimensional space.
When you talk of the dimension of a tensor, a tensor is a linear map (roughly speaking, I am omitting the duals) from the product of N vector spaces to the reals. The dimension of a TENSOR is the N.
If you want to be more specific, you need to be clear on the terms dimension, rank, and shape.
The dimensionality of a tensor means the rank, which has a specific definition: the rank is the number of indices. When you see "3-dimensional tensor", you can take that to mean that the tensor has 3 indices, namely T[i][j][k]. So a vector has rank 1, a matrix has rank 2, a cube has rank 3, etc.
When you want to specify the size of each dimension, you should prefer to use the term shape. A 3-dimensional (aka rank 3) tensor can have shape [10, 20, 30] if the 0th dimension has 10 values, the 1st dimension has 20 values, and the 2nd dimension has 30 values. (This shape might represent, say, a batch of 10 images, each of shape 20x30.)
Note, though, that when talking about vectors, it is common to say "512-D vector". As you mentioned, this terminology comes up a lot with word embeddings (e.g. "we used 512-D word embeddings"). Since "vector" by definition means rank 1, then people will interpret that statement to mean "a structure of rank 1 with 512 values".
You might encounter someone saying "I have a 5-d vector", in which case you'd need to follow up with "wait, do you mean a 5-d tensor or a 1-d vector with 5 values?".
I am not a mathematician, by the way.

Neural network for AI playing Connect Four: how to encode inputs and outputs, and what kind of NN setup? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Trying to understand how to build and train neural network based AI for games, and struggling to get some details straight.
Not even concerned yet with whether to use TensorFlow or something else or build my own. First I'm trying to grasp some basics such as what ranges to use, e.g. inputs between -1 and 1 or between 0 and 1, or how to represent input game situations and output moves.
Suppose I'm building a neural network to play Connect Four. Given a game situation, the AI is to generate an 'optimal move'. Let's say the optimal move is the move with the highest probability of eventually winning the game, assuming a reasonably smart opponent.
I guess the input would be 7 columns * 6 rows = 42 input neurons, each containing a value that represents either my color, or the opponent's color, or empty.
How do I encode this? Does it make a difference whether I use:
0 = empty, 1 = my piece, 2 = opponent's piece
0 = empty, 0.5 = my piece, 1 = opponent's piece
0 = empty, 1 = my piece, -1 = opponent's piece
Or should I even use two times 42 = 84 input neurons, all binary/boolean, where 0=empty, 1s in the first 42 neurons represent my pieces, and 1s in the second 42 neurons represent the opponent's pieces?
What does the output look like? 7 neurons that each represent one column, getting an output value in the 0..1 range, and the AI's move would be the one with the highest value? Or just one output value ranging from 0 to 1, where roughly each 1/7th of the interval represents a particular column?
Also, how should I design my neural network, how many hidden layers? Classification or regression? Sigmoid or Relu or tanh as activation function?
Intuitively, based on my limited experience with neural networks, I would say 2 or 3 hidden layers, each with two times the number of input neurons. No idea about the other considerations, I would just make stabs in the dark and trial and error.
Any feedback or suggestions to get me in the right direction?
If you have not yet delved into AI gameplay, then might I suggest OpenAI Gym as a starting place?
I believe starting with that foundation will allow you to 1) start building and learning with active participation right away, and 2) answer a lot of the foundational 'what-should-I-do?' questions that you have now - or at least at the time of asking this question, and then come back with the 'How-do-I' questions that we can really help you with here

How to customize a Deep Learning model if the output is one-hot vectors? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to build a Deep Learning model with TensorFlow and Keras. This is a sequential model for tasks of Single-Instance Multi-Label, which is a simplified version of Multi-Instance Multi-Label.
Concretely, the input of my model is an array of fixed length, so it can be represented as a vector like this:
The output of my model is a sequence of letters, which are from a alphabet with a fixed size. For example, an alphabet of {A,B,C,D} with only 4 possible members. So I can use a one-hot vector to represent each letter in a sequence.
The length of the sequences is variable, but for simplicity, I use a fixed length(equals to that of the longest sequence) to store all sequences.
If the length of a sequence is shorter than the fixed length, the sequence is represented by one-hot vectors(equal to the seuqence's actual length) and zero vectors(equal to the remaining length). For example, CADB is represented by a 4 * 5 matrix like this:
Please note: the first 4 columns of this matrix are one-hot vectors, each of which has one and only one 1 entry, and all other entries are 0s.
But the entries of the last column are all 0s, which can be seen as a zero padding because the sequence of letters is not long enough.
So in one word, the input is a vector and the output is a matrix.
Different from the link posted above, the output matrix should be seen as a whole. So one input vector is assigned to a whole matrix, not to a row or column of this matrix.
My question is : how to customize my deep learning model for this special output, for example:
What loss function and accuracy metric should I choose or design?
Do I need to customize a special layer at the very end of my model?
You should use softmax activation on the output layer and have categorical_crossentropy as the loss function.
However, as you can see in the links above, the problem is that these two functions by default are applied on the last axis (axis=-1), while in you situation it is the second last axis (the columns of the matrix) that is one-hot encoded.
To use the right axis, one option is to define your own versions of these functions like so:
def softmax_columns(x):
return tf.keras.backend.softmax(x, axis=-2)
def categorical_crossentropy_columns(target, output):
return tf.keras.backend.categorical_crossentropy(target, output, axis=-2)
Then, you can use these like so:
model.add(SomeLayer(..., activation=softmax_columns, ...)) # output layer
model.compile(loss=categorical_crossentropy_columns, ...)
One good alternative (in general, not only here) is to make use of from_logits=True in the categorical_crossentropy call. This effectively makes the softmax built-in into the loss function, so that your model itself does not need (in fact: must not have) the final softmax activation anymore. This not only saves work, but is also more numerically stable.

Scalling Feature implemented in DataFrame modelling

I have dataset with 15 columns with below scenario
9 -columns are categorical use so I have convert the data one hot encoder
6 columns are numeric, out of 6 - 3 columns is having outlier since column values are different range, so I have chosen RobustScaler() as scaling features and other I chosen standard Scalar.
after that I have combined all the data frame and apply the Logistic Regression algorithm my model produced very low score even I got the good score with out scaling.
will any one can able to help on this ?
please apply column standardization to data frame and see the output..I guess since logistic regression is sensitive to outliers,you are facing problem
impute outliers properly and then apply column standardization