Do I realy need to one-hot the gender? - one-hot-encoding

I know the advantange of one-hot method in many categories, but if the question only for two category(like gender,only male and female),'the distance of 1 to 0' is same to 'the distance of 0 to 1'.In this case, do we really need a one-hot coding?
PS:My English is not good,Thank you for your understanding.

It depends how the gender is represented in the input data: if it's already as 0 and 1, then there's no need to do anything obviously. However if it's as strings like "Male" and "Female", then you must encode it since the features have to be numerical.
Also note that sometimes you might have more than two categories even for gender, e.g. "non-binary" or "prefer not to say".

Related

The set of atomic irrational numbers used to express the character table and corresponding (unitary) representations

I want to calculate the irrational number, expressed by the following formula in gap:
3^(1/7). I've read through the related description here, but still can't figure out the trick. Will numbers like this appear in the computation of the character table and corresponding (unitary) representations?
P.S. Basically, I want to figure out the following question: For the computation of the character table and corresponding (unitary) representations, what is the minimum complete set of atomic irrational numbers used to express the results?
Regards,
HZ
You can't do that with GAP's standard cyclotomic numbers, as seventh roots of 3 are not cyclotomic. Indeed, suppose $r$ is such a root, i.e. a rot of the polynomial $f = x^7-3 \in \mathbb{Q}[x]$. Then $r$ is cyclotomic if and only if the field extension \mathbb{Q}[x] is a subfield of a cyclotomic field. By Kronecker-Weber this is equivalent to that field being an abelian extension, i.e., the Galois group is abelian. One can check that this is not the case here (the Galois group is a semidirect product of C_7 with C_6).
So, $r$ is not cyclotomic.

How to encode the column which has more than 50 categories

How to encode the column which has more than 50 categories
can we use one hot coding ?
Here is a great blogpost: https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8
Basically, there are the following ways of encoding:
basic label encoding - simply replacing by numbers
one hot encoding (can be used with 50 categories, it is okay)
lots of ways to use numerical encoding: frequency, mean target, and many others

How to treat numbers inside text strings when vectorizing words?

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?
Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).
The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.

How to recognize if word has no meaning, maybe some impossible syllables?

Initially, I have m arrays of n characters, where each array contains unknown (for me) character of needed word (condition: word has meaning).
For example, m = 4, n = 3: array0 = {'t', 'e', 'c'}, array1 = {'g' 'o' 'a'}, array2 = {'w' 'd' 'y'}, array3 = {'e' 'o' 's'}. Each array contains only one correct letter: in array0 is first letter, in array1 - second... So, the probable secret word is 'code': array0[2] = 'c', array1[1] = 'o', array2[1] = 'd', array3[0] = 'e'.
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words.
Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
I'm attacking Vigenere's cipher. So, I know the length of key and its probable characters. I'm shuffling my arrays and getting many meaningless words. Problem is to filter them. As I get it, some conditions can help to recognize incorrect words. For example, if word length is > 4 then all vowel chars, or all consonant chars word is wrong. Some syllables, such as kk *hh* ww, in general, are impossible too. Where can I find such rules?
I'm supposing what you mean by the "word has meaning" is that it is an English dictionary word.
I believe that you should approach the problem from the other direction, as GregS suggests, and go through a dictionary. English has many exceptions when it comes to letters and spelling, and the number of words that look English are much greater than the actual number of English words. You won't be able to cut down your search very much in that way.
But because you know the length and probable characters you are able to quickly throw out many dictionary words. Also, if the message isn't too short, it would also be very fast to attempt a decoding of the message with possible words, and throw out unlikely decodings by letter, digram or trigram frequencies.
I'm not sure I follow your strategy for attacking a Vigenere cipher. However, in response to:
I need to find all of existing letter-combinations, i.e. exclude generated meaningless words. Are there any rules/regularities of 'impossible' syllables/letter-combinations in English?
Yes, indeed there is a plethora of such rules. There are two ways of learning and implementing these rules:
Carefully study the morphology of English, and meticulously implement the rules.
Train a Markov model on a corpus of English text.
1 will be substantially less work for little additional benefit.

Do you use singular or plural in names of arrays, maps, sets, etc.?

I have a quick question that is not particular technical, but I sometimes wonder what's better ...
Do you use singular or plural in names of arrays, maps, sets, etc.? Example:
Singular
1 std::map<string,double> age;
2 age["diego maradonna"] = 49;
Plural
1 std::map<string,double> ages;
2 ages["diego maradonna"] = 49;
In the plural version, the second line isn't nice (because you're looking up the age, not the ages of Maradonna). In the singular version, the first line sounds kind of wrong (because the map contains many ages).
Singular for instances, plural for collections.
For maps, I will typically even go a step further and name them in terms of both their keys and values (ex. agesByPersonNames). This is especially helpful if you have a map of maps.
Plurals. I use the same kind of names for SQL tables. The case of:
ages["diego maradonna"] = 49;
should be read as "in the collection of ages, find me the one that belongs to maradonna and change it to 49"
I would use nameToAgeMap["diego maradonna"], so it's obvious what you put in (a name) and get out (an age), it reads nicely in assignments: nameToAgeMap["diego maradonna"] = 49; which could be read as "put 49 into the name-to-age map for Diego Maradonna".