KMeans clustering for the following mixed variable data - k-means

Can somebody help me with this problem?
I'm learning KMeans clustering concepts. I know how to cluster if the variables are continuous. But this data set contains categorical/discrete variables like gender and zip code.
Sno Age Gender Zip Salary
1 26 0 9822 100
2 38 1 9822 700
3 19 1 9822 100
4 64 0 9810 2500
5 53 1 9810 1200
6 75 1 9810 1800
7 19 0 9822 75
8 36 1 9822 350
9 42 1 9875 1800
10 41 0 9875 750

K-Means works only with numerical data.
K-means fails for categorical data because taking the mean of categorical data doesn't make sense at all. Neither does distance. Some people run the data on K-means by using one hot encoding. But this too doesn't give the right clusters.
To solve this kind of problem you can look at another variation of K-Means called the K-Prototype algorithm which works well for a mix of Categorical and Numerical data.
Check out https://pypi.python.org/pypi/kmodes/
This link contains the paper and the python package for using this algorithm. It's easy to understand as well.

Related

How to redistribute outliers over the previous time period?

Imagine a dataframe that looks like this:
1
2
3
4
5
6
7
50
16
17
Normally we would apply an algorithm from Detect and exclude outliers in a pandas DataFrame to entirely remove the 50, however my particular dataset instead requires me to distribute the values of the 50 over the previous 7 days:
8
9
10
11
12
13
14
15
16
17
How can I make this work in Pandas? I can detect the outliers pretty easily but not sure how to spread the values out into previous days. Note that a simple moving average doesn't work well for this type of data, as there would still be a jump in the average value when 50 shows up. What I need to do is smooth out 50 into the previous days so that no jump is visible.

Deflate: code lengths of > 7 bits for top-level HCLEN?

RFC 1951 specifies that the first level of encoding in a block contains HCLEN 3-bit values, which encode the lengths of the next level of Huffman codes. Since these are 3-bit values, it follows that no code for the next level can be longer than 7 bits (111 in binary).
However, there seem to be corner cases which (at least with the "classical" algorithm to build Huffman codes, using a priority queue) apparently generate codes of 8 bits, which can of course not be encoded.
An example I came up with is the following (this represents the 19 possible symbols resulting from the RLE encoding, 0-15 plus 16, 17 and 18):
symbol | frequency
-------+----------
0 | 15
1 | 14
2 | 6
3 | 2
4 | 18
5 | 5
6 | 12
7 | 26
8 | 3
9 | 20
10 | 79
11 | 94
12 | 17
13 | 7
14 | 8
15 | 4
16 | 16
17 | 1
18 | 13
According to various online calculators (eg https://people.ok.ubc.ca/ylucet/DS/Huffman.html), and also building the tree by hand, some symbols in the above table (namely 3 and 17) produce 8-bit long Huffman codes. The resulting tree looks ok to me, with 19 leaf nodes and 18 internal nodes.
So, is there a special way to calculate Huffman codes for use in DEFLATE?
Yes. deflate uses length-limited Huffman codes. You need either a modified Huffman algorithm that limits the length, or an algorithm that shortens a Huffman code that has exceeded the length. (zlib does the latter.)
In addition to the code lengths code being limited to seven bits, the literal/length and distance codes are limited to 15 bits. It is not at all uncommon to exceed those limits when applying Huffman's algorithm to sets of frequencies encountered during compression.
Though your example is not a valid or possible set of frequencies for that code. Here is a valid example that results in a 9-bit Huffman code, which would then need to be squashed down to seven bits:
3 0 0 5 5 1 9 31 58 73 59 28 9 1 2 0 6 0 0

How to Create a CDF out of a PDF in SQL

So I have a datatable that looks something like that following. ID represents an object, bin represents how I am segmenting the data, and percent is how much of a data falls into that bin.
id bin percent
2 8 0.20030698388
2 16 0.14504988488
2 24 0.12356101304
2 32 0.09976976208
2 40 0.09056024558
2 48 0.07137375287
2 56 0.04067536454
2 64 0.03914044512
2 72 0.02916346891
2 80 0.16039907904
3 8 0.36316695352
3 16 0.03958691910
3 24 0.11876075731
3 32 0.13253012048
3 40 0.03098106712
3 48 0.07228915662
3 56 0.07745266781
3 64 0.02581755593
3 72 0.02065404475
3 80 0.11876075731
I am looking for a function to turn this dataset into a cdf partitioning id. I have tried cume_dist and percent_rank, but they do not appear to work.
I am facing a similar problem and found this great tutorial for doing exactly that:
https://dwaincsql.com/2015/05/14/excel-in-t-sql-part-2-the-normal-distribution-norm-dist-density-functions/
It tries to rebuild the Excel function NORM.DIST function which gives you either the PDF if you set the cummulative flag as FALSE and the CDF if you set it as TRUE. I assumed that CUME_DIST would do the exact same thing in SQL. However, it turns out that the latter distributes by counting the elements whereas Excel uses the relative differences in the values.

How to return a group of rows when one row meets "where" criteria in SQL Anywhere

I am somewhat overwhelmed by what I am trying to do, since I have only been using SQL for 3 days now, but I already love the increased functionality over MS query. The need for the IN function is what drove me to learn about this, and I thank the community for the info here to get me through learning that.
I tried looking thru other questions, but I couldn't find one in which the intent was to group more than two rows, or to group a varying number of rows. This means that count and duplicate are both out as options.
What I am doing is analyzing a table of part number information that spans multiple store locations. The table gives a row to each instance of a part number, so if all 15 stores have some sort of history for a given part number, that part number will have 15 rows in the table.
I am wanting to look at other store's history for parts that meet the criteria of 0 sales history for my location. The purpose is to see if they can be transferred to another store instead of being returned to the vendor and incurring a restock fee.
Here is a simplified version of the table organized in the way I would want the output to be structured. I got here by having suspected part numbers and using the list of them as a text string in IN() but I want to go about this the other way and build a list of part numbers from sales data in this table.
Branch| Part_No| Description| Bin Qty|current 12 mo sales|previous 12 mo sales|
------|--------|------------|---------|-------------------|--------------------|
20 CA38385 SUPPORT 2 1 1
23 CA38385 SUPPORT 1 0 0
25 CA38385 SUPPORT 0 0 1
20 DFC10513 Hdw Kit 0 1 0
23 DFC10513 Hdw Kit 1 0 0
07 DFC10513 Hdw Kit 0 1 0
3 D59096 VALVE 0 0 12
5 D59096 VALVE 0 0 4
6 D59096 VALVE 4 6 12
8 D59096 VALVE 0 0 0
33 D59096 VALVE 11 14 18
21 D59096 VALVE 4 4 4
22 D59096 VALVE 0 0 0
23 D59096 VALVE 10 0 0
24 D59096 VALVE 0 0 0
25 D59096 VALVE 0 0 0
26 D59096 VALVE 2 2 0
1 TE67401 Repair Kit 1 1 2
21 TE67401 REPAIR KIT 1 3 0
22 TE67401 REPAIR KIT 0 1 0
I am branch 23, so the start of the query as I understand it would be
Select * from part_information
Group By part_number
Having IN(Branch) 23 and bin qty > 0 and current_12_mo_sales=0 and previous_12_mo_sales = 0
Can you point me down the right track? This table has approx. 200000 rows in it, so I really need to learn how to do this. I really don't see a better way.
Thank you in advance for your help and or criticism -Cody
Select * from part_information
where part_number not in (
select part_number from part_information
where branch = 23 and bin_qty > 0 -- etc...
)
(Apologies for lack of formatting).
This ended up working the way I wanted
SELECT pi_Branch, pi_Franchise, pi_Part_No, pi_Description, pi_Bin_Qty,
pi_Bin, pi_current_12_mo_sales, pi_previous_12_mo_sales, pi_Inventory_Cost,
pi_Return_Indicator
From Part_Information
Where pi_Part_No IN (Select pi_Part_No
From Part_Information
Where pi_Branch=23 And
pi_Bin_Qty>0 And pi_current_12_mo_sales<=0
And pi_previous_12_mo_sales<=0)
I was thinking that this had to be some complex process, but in reality, two simple queries were all that was needed.
I would still be interested in anyone's opinion on a better or more efficient way of handling this.
Thanks Mischa for getting me there!

What type of graph can best show the correlation between 'Fare' (price) and "Survival" (Titanic)?

I'm playing around with Seaborn and Matplotlib and I trying to find the best type of graph to show the correlation between fare values and chance of survival from the titanic dataset.
The Titanic fare column has a lot of different values ranging from 1 to 500 and some of the values are repeated often.
Here is a sample of value_counts:
titanic.fare.value_counts()
8.0500 43
13.0000 42
7.8958 38
7.7500 34
26.0000 31
10.5000 24
7.9250 18
7.7750 16
0.0000 15
7.2292 15
26.5500 15
8.6625 13
7.8542 13
7.2500 13
7.2250 12
16.1000 9
9.5000 9
15.5000 8
24.1500 8
14.5000 7
7.0500 7
52.0000 7
31.2750 7
56.4958 7
69.5500 7
14.4542 7
30.0000 6
39.6875 6
46.9000 6
21.0000 6
.....
91.0792 2
106.4250 2
164.8667 2
Survival column on the other hand has only two values :
>>> titanic.survived.head(10)
271 1
597 0
302 0
633 0
277 0
413 0
674 0
263 0
466 0
A histogram would only show the frequency of fares in certain ranges.
For a scatter plot I would need two variables; having "survived" which has only two values would make for a strange variable.
Is there a way to show the rise of survivability as fare increases clearly through a line graph?
I know there is a correlation as If I sort fare values in ascending order (000-500).
Then do:
>>> titanic.head(50).survived.sum()
5
>>>titanic.tail(50).survived.sum()
37
I see a correlation.
Thanks.
This is what I did to show the correlation between the fare values and the chance of survival:
First, I created a new column Fare Groups, converting fare values to groups of fare ranges, using cut().
df['Fare Groups'] = pd.cut(df.Fare, [0,50,100,150,200,550])
Next, I created a pivot_table().
piv_fare = df.pivot_table(index='Fare Groups', columns='Survived', values = 'Fare', aggfunc='count')
Output:
Survived 0 1
Fare Groups
(0, 50] 484 232
(50, 100] 37 70
(100, 150] 5 19
(150, 200] 3 6
(200, 550] 6 14
Plot:
piv_fare.plot(kind='bar')
It seems, those who had the cheapest tickets (0 to 50) had the lowest chance of survival. In fact, (0 to 50) is the only fare range where the chance to die is higher than the chance to survive. Not just higher, but significantly higher.