Calculate intersecting sums from flat DataFrame for a heatmap - pandas

I'm trying to wrangle some data to show how many items a range of people have in common. The goal is to show this data in a heatmap format via Seaborn to understand these overlaps visually.
Here's some sample data:
demo_df = pd.DataFrame([
("Get Back", 1,0,2),
("Help", 5, 2, 0),
("Let It Be", 0,2,2)
],columns=["Song","John", "Paul", "Ringo"])
demo_df.set_index("Song")
John Paul Ringo
Song
Get Back 1 0 2
Help 5 2 0
Let It Be 0 2 2
I don't need a breakdown by song, just the total of shared items. The resulting data would show a sum of how many items they share like this:
Name
John
Paul
Ringo
John
-
7
3
Paul
7
-
4
Ringo
3
4
-
So far I've tried a few options with groupby and unstack but haven't been able to work out how to cross match the names into both column and header rows.

We may do dot then fill diag
out = df.T.dot(df.ne(0)) + df.T.ne(0).dot(df)
np.fill_diagonal(out.values, 0)
out
Out[176]:
John Paul Ringo
John 0 7 3
Paul 7 0 4
Ringo 3 4 0

Related

Multilevel Indexing with Groupby

Being new to python I'm struggling to apply other questions about the groupby function to my data. A sample of the data frame :
ID Condition Race Gender Income
1 1 White Male 1
2 2 Black Female 2
3 3 Black Male 5
4 4 White Female 3
...
I am trying to use the groupby function to gain a count of how many black/whites, male/females, and income (12 levels) there are in each of the four conditions. Each of the columns, including income, are strings (i.e., categorical).
I'd like to get something such as
Condition Race Gender Income Count
1 White Male 1 19
1 White Female 1 17
1 Black Male 1 22
1 Black Female 1 24
1 White Male 2 12
1 White Female 2 15
1 Black Male 2 17
1 Black Female 2 19
...
Everything I've tried has come back very wrong so I don't think I'm anywhere near right, but I"m been using variations of
Data.groupby(['Condition','Gender','Race','Income'])['ID'].count()
When I run the above line I just get a 2 column matrix with an indecipherable index (e.g., f2df9ecc...) and the second column is labeled ID with what appear to be count numbers. Any help is appreciated.
if you would investigate the resulting dataframe you would see that the columns are inside the index so just reset the index...
df = Data.groupby(['Condition','Gender','Race','Income'])['ID'].count().reset_index()
that was mainly to demonstrate but since you what you want you can sepcify the argument 'as_index' as following:
df = Data.groupby(['Condition','Gender','Race','Income'],as_index=False)['ID'].count()
also since you want the last column to be 'count' :
df = df.rename(columns={'ID':'count'})

How to count the ID with the same prefix and store the total number in another column

I have a dataset in which I noticed that the ID comes with info for classification. Basically, the last 2 digits of ID stand for their sub-ID (01, 02, 03, etc) in the same family. Below is an example. I am trying to get another column (the 2nd column) to store the information of how many sub-IDs we have for the same family. e.g., 22302 belongs to family 223, which has 3 members: 22301, 22302, and 22303. So that I have a new feature for classification modeling. Not sure if there is a better idea to extract information. Anyway, can someone let me know how to extract the number in the same class (as shown the 2nd column)
ID Same class
23401 1
22302 3
43201 1
144501 2
144502 2
22301 3
22303 3
You can do it with str slice and transform
df['New']=df.groupby(df.ID.astype(str).str[:-2]).ID.transform('size')
df
Out[223]:
ID Sameclass New
0 23401 1 1
1 22302 3 3
2 43201 1 1
3 144501 2 2
4 144502 2 2
5 22301 3 3
6 22303 3 3

Data analysis with pandas

The following df is a summary of my hole dataset just to illustrate my problem.
The df shows the job application of each id and i want to know which combination of sector is more likely for an individual to apply?
df
id education area_job_application
1 Collage Construction
1 Collage Sales
1 Collage Administration
2 University Finance
2 University Sales
3 Collage Finance
3 Collage Sales
4 University Administration
4 University Sales
4 University Data analyst
5 University Administration
5 University Sales
answer
Construction Sales Administration Finance Data analyst
Contruction 1 1 1 0 0
Sales 1 5 3 1 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1
This answer shows that administration and sales are the sector that more chances have to receive a postulation by the same id (this is the answer which i am looking). But i am also interesting for other combinations, i think that a mapheat will be very informative to illustrate this data.
Sector combination from the same sector are irrelevant (maybe in the diagonal from the answer matrix should be a 0, doesnt matter the value, i wont anaylse).
Use crosstab or groupby with size and unstack first and then DataFrame.dot by transpose DataFrame and last add reindex for custom order of index and columns:
#dynamic create order by unique values of column
L = df['area_job_application'].unique()
#df = pd.crosstab(df.id, df.area_job_application)
df = df.groupby(['id', 'area_job_application']).size().unstack(fill_value=0)
df = df.T.dot(df).rename_axis(None).rename_axis(None, axis=1).reindex(columns=L, index=L)
print (df)
Construction Sales Administration Finance Data analyst
Construction 1 1 1 0 0
Sales 1 5 3 2 1
Administration 1 3 3 0 1
Finance 0 2 0 2 0
Data analyst 0 1 1 0 1

Python3 pandas: data frame grouped by a columns(such as name), then extract a number of rows for each group

There is data frame called df as following:
name id age text
a 1 1 very good, and I like him
b 2 2 I play basketball with his brother
c 3 3 I hope to get a offer
d 4 4 everything goes well, I think
a 1 1 I will visit china
b 2 2 no one can understand me, I will solve it
c 3 3 I like followers
d 4 4 maybe I will be good
a 1 1 I should work hard to finish my research
b 2 2 water is the source of earth, I agree it
c 3 3 I hope you can keep in touch with me
d 4 4 My baby is very cute, I like him
The data frame is grouped by name, then I want to extract a number of rows by row index(for example: 2) for the new dataframe: df_new.
name id age text
a 1 1 very good, and I like him
a 1 1 I will visit china
b 2 2 I play basketball with his brother
b 2 2 no one can understand me, I will solve it
c 3 3 I hope to get a offer
c 3 3 I like followers
d 4 4 everything goes well, I think
d 4 4 maybe I will be good
df_new = (df.groupby('screen_name'))[0:2]
But there is error:
hash(key)
TypeError: unhashable type: 'slice'
Try using head() instead.
import pandas as pd
from io import StringIO
buff = StringIO('''
name,id,age,text
a,1,1,"very good, and I like him"
b,2,2,I play basketball with his brother
c,3,3,I hope to get a offer
d,4,4,"everything goes well, I think"
a,1,1,I will visit china
b,2,2,"no one can understand me, I will solve it"
c,3,3,I like followers
d,4,4,maybe I will be good
a,1,1,I should work hard to finish my research
b,2,2,"water is the source of earth, I agree it"
c,3,3,I hope you can keep in touch with me
d,4,4,"My baby is very cute, I like him"
''')
df = pd.read_csv(buff)
using head() instead of [:2] then sorting by name
df_new = df.groupby('name').head(2).sort_values('name')
print(df_new)
name id age text
0 a 1 1 very good, and I like him
4 a 1 1 I will visit china
1 b 2 2 I play basketball with his brother
5 b 2 2 no one can understand me, I will solve it
2 c 3 3 I hope to get a offer
6 c 3 3 I like followers
3 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good
Another solution with iloc:
df_new = df.groupby('name').apply(lambda x: x.iloc[:2]).reset_index(drop=True)
print(df_new)
name id age text
0 a 1 1 very good, and I like him
1 a 1 1 I will visit china
2 b 2 2 I play basketball with his brother
3 b 2 2 no one can understand me, I will solve it
4 c 3 3 I hope to get a offer
5 c 3 3 I like followers
6 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good

Excel VBA - Group Data by Column A, Get the Range Value from C - Copy results to New Sheet

I've been trying to search for an example of this grouping and tested few code snippets but haven't been able to adapt it to what I need as I'm just getting to know Excel vba.
What I'm trying to do is to group by column A then get the range of the values used in that category which are in column C and get the results in a new worksheet.
Main Sheet.
A B C D
3 Baseball 4 Blue
2 Football 1 Red
2 Football 3 Red
3 Baseball 4 Blue
1 Soccer 2 Green
3 Baseball 4 Blue
1 Soccer 3 Green
1 Soccer 5 Green
2 Football 2 Red
Expected Results:
New Sheet.
A B C D
1 Soccer 2-5 Green
2 Football 1-3 Red
3 Baseball 4 Blue
If you need column C to be a range of value, eg 2 - 5, then it's text in Excel. Pivot table only able to return Min, Max, Sum, Average, but not range of the value.
You will need using VBA to solve the problem.
First, copy column A,B,D to some where, then using Remove Duplicate.
To find out the Unique combination.
Eg: (Assuming you have some new records in future)
A B C D
3 Baseball 4 Blue
2 Football 1 Red
2 Football 3 Red
3 Baseball 4 Blue
1 Soccer 2 Green
3 Baseball 4 Blue
1 Soccer 3 Green
1 Soccer 5 Green
2 Football 2 Red
4 Tennis 3 Yellow
Then you should have something like below:
A B D
1 Soccer Green
2 Football Red
3 Baseball Blue
4 Tennis Yellow
Then using Loop, to find out the range of the value for each of the Unique Combination (here we have 4 unique records).
*** assume that you know how to use loop to find out the Range of each combination.
I've actually figure this out:
For Each key In fCatId.Keys
'Debug.Print fCatId(key), key
With wshcore
llastrow = wshcore.Range("A" & Rows.Count).End(xlUp).Row
.Range("A1:N" & llastrow).AutoFilter
.Range("A1:N" & llastrow).AutoFilter Field:=1, Criteria1:=fCatId(key)
lwmin = WorksheetFunction.Subtotal(5, Range("H:H"))
lwmax = WorksheetFunction.Subtotal(4, Range("H:H"))
Im getting column a: fcatid, b: key, lwmin: lowest value and lwmax: highest.