Reducing number of unique categories pandas - pandas

I have my dataset with job metrics, and one of my features is industry. It is a categorical feature and has 1200 unique values. Before I go on and work on building a model, I need to figure out how to best encode it esp because it has 1200 unique values. Does anyone have any tips or guidance as to where I should start?
The picture below shows the top 9 industries. I am thinking of selective encoding - maybe only using one-hot encoding for these 15-20 most frequent values, but I will be thankful for any suggestions. Thanks
Tried to look for several resources, but couldn't find anything promising so far
[A picture of the 9 most occurring industries]
https://i.stack.imgur.com/tDAEk.jpg

You could one hot encode everything, and maybe check correlations against target to see which job categories may be informative features.
if the data is too large to do this, then yes perhaps selective encoding as you said -- just conditionally fill everything else as "other" and then proceed with one hot encoding.

Related

How can I recode 53k unique addresses (saved as objects) w/o One-Hot-Encoding in Pandas?

My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)

Organising csv. file data in Python

I am quite a beginner with Python but I have a programming-related project to work on, so, I really would like to ask some help. I didn´t find many simple solutions to organize the data such a way that I could do some analysis with that.
First, I have multiple csv-files, which I read in as DataFrame objects. In the end, I need them all to analyze together (right now the files are separated to the list of DataFrames but later on I probably will need those as one DataFrame object).
However, I have a problem with organizing and separating the data. These are thousands of rows in one column, a part of it is presented:
CIP;Date;Hour;Cons;REAL/ESTIMATED
EN025140855608477018TC2L;11/03/2020;1;0 057;R
EN025140855608477018TC2L;11/03/2020;2;0 078;R
EN025140855608477018TC2L;11/03/2020;3;0 033;R
EN025140855608477018TC2L;11/03/2020;4;0 085;R
EN025140855608477018TC2L;11/03/2020;5;0 019;R
...
EN025140855608477018TC2L;11/04/2020;20;0 786;R
EN025140855608477018TC2L;11/04/2020;21;0 288;R
EN025140855608477018TC2L;11/04/2020;22;0 198;R
EN025140855608477018TC2L;11/04/2020;23;0 728;R
EN025140855608477018TC2L;11/04/2020;24;0 275;R
The area, where the huge space between, the number should be merged together, for example, 0.057, which information represents "Cons" (actually it is the most important information).
I should be able to split data into 5 columns in order to proceed with the analysis. However, it should be a universal tool for different csv-files without knowing the including symbols. But the structure of the content and the heading is always the same.
I would be happy if anyone might know to recommend a way to work with this kind of data.
Sounds like what you are trying to do is convert the Cons column so that the spaces become a dot.
df = pd.read_csv("file.txt", sep=";")
df['Cons'] = df['Cons'].str.replace("\s+",".")
df['Cons'].head()
Output:
0 0.057
1 0.078
2 0.033
3 0.085
4 0.019

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Oracle String Conversion - Alpha String to Numeric Score, Fuzzy Match

I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).

Neural Network Input and Output Data formatting

and thanks for reading my thread.
I have read some of the previous posts on formatting/normalising input data for a Neural Network, but cannot find something that addresses my queries specifically. I apologise for the long post.
I am attempting to build a radial basis function network for analysing horse racing data. I realise that this has been done before, but the data that I have is "special" and I have a keen interest in racing/sportsbetting/programming so would like to give it a shot!
Whilst I think I understand the principles for the RBFN itself, I am having some trouble understanding the normalisation/formatting/scaling of the input data so that it is presented in a "sensible manner" for the network, and I am not sure how I should formulate the output target values.
For example, in my data I look at the "Class change", which compares the class of race that the horse is running in now compared to the race before, and can have a value between -5 and +5. I expect that I need to rescale these to between -1 and +1 (right?!), but I have noticed that many more runners have a class change of 1, 0 or -1 than any other value, so I am worried about "over-representation". It is not possible to gather more data for the higher/lower class changes because thats just 'the way the data comes'. Would it be best to use the data as-is after scaling, or should I trim extreme values, or something else?
Similarly, there are "continuous" inputs - like the "Days Since Last Run". It can have a value between 1 and about 1000, but values in the range of 10-40 vastly dominate. I was going to scale these values to be between 0 and 1, but even if I trim the most extreme values before scaling, I am still going to have a huge representation of a certain range - is this going to cause me an issue? How are problems like this usually dealt with?
Finally, I am having trouble understanding how to present the "target" values for training to the network. My existing results data has the "win/lose" (0 or 1?) and the odds at which the runner won or lost. If I just use the "win/lose", it treats all wins and loses the same when really they're not - I would be quite happy with a network that ignored all the small winners but was highly profitable from picking 10-1 shots. Similarly, a network could be forgiven for "losing" on a 20-1 shot but losing a bet at 2/5 would be a bad loss. I considered making the results (+1 * odds) for a winner and (-1 / odds) for a loser to capture the issue above, but this will mean that my results are not a continuous function as there will be a "discontinuity" between short price winners and short price losers.
Should I have two outputs to cover this - one for bet/no bet, and another for "stake"?
I am sorry for the flood of questions and the long post, but this would really help me set off on the right track.
Thank you for any help anyone can offer me!
Kind regards,
Paul
The documentation that came with your RBFN is a good starting point to answer some of these questions.
Trimming data aka "clamping" or "winsorizing" is something I use for similar data. For example "days since last run" for a horse could be anything from just one day to several years but tends to centre in the region of 20 to 30 days. Some experts use a figure of say 63 days to indicate a "spell" so you could have an indicator variable like "> 63 =1 else 0" for example. One clue is to look at outliers say the upper or lower 5% of any variable and clamp these.
If you use odds/dividends anywhere make sure you use the probabilities ie 1/(odds+1) and a useful idea is to normalize these to 100%.
The odds or parimutual prices tend to swamp other predictors so one technique is to develop separate models, one for the market variables (the market model) and another for the non-market variables (often called the "fundamental" model).