How to encode the column which has more than 50 categories - dataframe

How to encode the column which has more than 50 categories
can we use one hot coding ?

Here is a great blogpost: https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8
Basically, there are the following ways of encoding:
basic label encoding - simply replacing by numbers
one hot encoding (can be used with 50 categories, it is okay)
lots of ways to use numerical encoding: frequency, mean target, and many others

Related

What is meaning of # in espeak\dictsource\en_list

I'm trying to convert English sounds to IPA symbols in code. I came across espeak, which has good collection of words with corresponding sounds for English in en_list.
As per my understanding so far, e-speak uses Kirshenbaum representation.
In en_list, I come across usage of #,
eg: anomaly a#n0m#li
But # is unused character as per Kirshenbaum.
So, wanted to know meaning of # in en_list

How to treat numbers inside text strings when vectorizing words?

If I have a text string to be vectorized, how should I handle numbers inside it? Or if I feed a Neural Network with numbers and words, how can I keep the numbers as numbers?
I am planning on making a dictionary of all my words (as suggested here). In this case all strings will become arrays of numbers. How should I handle characters that are numbers? how to output a vector that does not mix the word index with the number character?
Does converting numbers to strings weakens the information i feed the network?
Expanding your discussion with #user1735003 - Lets consider both ways of representing numbers:
Treating it as string and considering it as another word and assign an ID to it when forming a dictionary. Or
Converting the numbers to actual words : '1' becomes 'one', '2' as 'two' and so on.
Does the second one change the context in anyway?. To verify it we can find similarity of two representations using word2vec. The scores will be high if they have similar context.
For example,
1 and one have a similarity score of 0.17, 2 and two have a similarity score of 0.23. They seem to suggest that the context of how they are used is totally different.
By treating the numbers as another word, you are not changing the
context but by doing any other transformation on those numbers, you
can't guarantee its for better. So, its better to leave it untouched and treat it as another word.
Note: Both word-2-vec and glove were trained by treating the numbers as strings (case 1).
The link you provide suggests that everything resulting from a .split(' ') is indexed -- words, but also numbers, possibly smileys, aso. (I would still take care of punctuation marks). Unless you have more prior knowledge about your data or your problem you could start with that.
EDIT
Example literally using your string and their code:
corpus = {'my car number 3'}
dictionary = {}
i = 1
for tweet in corpus:
for word in tweet.split(" "):
if word not in dictionary: dictionary[word] = i
i += 1
print(dictionary)
# {'my': 1, '3': 4, 'car': 2, 'number': 3}
The following paper can be helpful: http://people.csail.mit.edu/mcollins/6864/slides/bikel.pdf
Specifically, page 7.
Before they use an <unknown> tag they try to replace alphanumeric symbol combination with common pattern names tags, such as:
FourDigits (good for years)
I've tried to implement it and it gave great results.

How to present letter as a bitmap?

I know how to present number as a bitmap, for instance:
17 = 010001
11 = 001011
This is about numbers, but what about letters? Is there a way to do this? For example:
w = ??
[ = ??
Everything on your computer is represented as a sequence a bits, what you are calling a “bitmap”. So the answer to your question is yes, characters have a binary representation, along with integers, floating-point numbers, machine instructions etc.
Different languages use different binary encodings for characters, Objective-C uses Unicode. See the section Understanding Characters in the NSString documentation.
HTH
You might as well jump in at the deep end and visit www.unicode.org.
Letters are complicated. And anyway, representing them as a bitmap is a bit pointless, isn't it?

Search for a little string in a huge one

I'm working on a project in which I must search for a little string (about 40 chars) in a very very large string (we're speaking of about a hundred million chars). I'm searching the fastest way. I've tried several methods: these are the results of benchmarks:
Contains returned True in 248 ms;
IndexOf returned True in 671 ms (I would have never said that!);
Contains using an array instead of a string returned True in 48 ms only;
Even though the Contains inside an array seemed to be the best method, I've had a look at some search algoritms too (Knuth–Morris–Pratt, Rabin-Karp and Boyer-Moore), but none of them seems to be suitable for my scenario
My question is: is there a faster way to search a little string in a very big string?
Thanks,
PWhite
If you have a fixed, gigantic string that you want to search repeatedly for very small substrings, you may want to search for a library that implements a suffix tree or suffix array. Once you've done the preprocessing work to build the suffix tree or array, the runtime of searching for a pattern of length P is O(P) for the suffix tree and O(P + log T) for the suffix tree, where T is the length of the long text string. That's likely to be significantly faster than what you're seeing now, though you'll need heavierweight libraries to do this.
On the other hand, if you have a fixed set of pattern strings and a rotating cast of large strings to search in, you may want to use the Aho-Corasick string matching algorithm, which can scan for all occurrences of a fixed set of patterns in a string of length T in time O(T + z), where z is the number of matches. This is typically very fast in practice.

How to send xon/xoff in case of binary data?

In case of software data flow control, we use xon and xoff (0x11 and 0x13) standard characters to pause and resume transmission. But if we want to send binary data which contains characters which match with the ascii value of xon and xoff, what character set should we use to send xon or xoff ?
I simple solution is to use base64 encoding, which you have it in python ..
base64.b64encode(yourData) - encode
base64.b64decode(yourData) - decode,
it adds the additional overhead but the sent data is in simple character format. even HDLC used base64 so this will be one option for you I suppose.
Using software handshaking precludes the sending of binary data.
Short of doing something esoteric (sending 9 bits/byte instead of 8 - very non-standard) there is no distinction between 2 of the 256 different binary data and the 2 codes selected for uses as XON/XOFF.
There are various protocols that attempt to deal with this. They all encode the "binary data" into something efficient but not a one-to-one mapping. One can use escape codes, compression, data packets, etc. Of course, both ends of the communication need to know how to encode/decode. This often limits your choices. If in doubt, start with Binary-to-text encoding as it tends to be easier to debug. http://en.wikipedia.org/wiki/Binary-to-text_encoding
To be able to use those two special characters as control ones, you have to make sure they do not occur in the payload data. One way to do that is to encode payload with a reduced alphabet that does not include the special characters. The binary-to-text encodings mentioned in a parallel answer would do the job, but if low overhead not depending on distribution of input bytes is critical, then the escapeless encoding may help.