Excel VBA - Group Data by Column A, Get the Range Value from C - Copy results to New Sheet - vba

I've been trying to search for an example of this grouping and tested few code snippets but haven't been able to adapt it to what I need as I'm just getting to know Excel vba.
What I'm trying to do is to group by column A then get the range of the values used in that category which are in column C and get the results in a new worksheet.
Main Sheet.
A B C D
3 Baseball 4 Blue
2 Football 1 Red
2 Football 3 Red
3 Baseball 4 Blue
1 Soccer 2 Green
3 Baseball 4 Blue
1 Soccer 3 Green
1 Soccer 5 Green
2 Football 2 Red
Expected Results:
New Sheet.
A B C D
1 Soccer 2-5 Green
2 Football 1-3 Red
3 Baseball 4 Blue

If you need column C to be a range of value, eg 2 - 5, then it's text in Excel. Pivot table only able to return Min, Max, Sum, Average, but not range of the value.
You will need using VBA to solve the problem.
First, copy column A,B,D to some where, then using Remove Duplicate.
To find out the Unique combination.
Eg: (Assuming you have some new records in future)
A B C D
3 Baseball 4 Blue
2 Football 1 Red
2 Football 3 Red
3 Baseball 4 Blue
1 Soccer 2 Green
3 Baseball 4 Blue
1 Soccer 3 Green
1 Soccer 5 Green
2 Football 2 Red
4 Tennis 3 Yellow
Then you should have something like below:
A B D
1 Soccer Green
2 Football Red
3 Baseball Blue
4 Tennis Yellow
Then using Loop, to find out the range of the value for each of the Unique Combination (here we have 4 unique records).
*** assume that you know how to use loop to find out the Range of each combination.

I've actually figure this out:
For Each key In fCatId.Keys
'Debug.Print fCatId(key), key
With wshcore
llastrow = wshcore.Range("A" & Rows.Count).End(xlUp).Row
.Range("A1:N" & llastrow).AutoFilter
.Range("A1:N" & llastrow).AutoFilter Field:=1, Criteria1:=fCatId(key)
lwmin = WorksheetFunction.Subtotal(5, Range("H:H"))
lwmax = WorksheetFunction.Subtotal(4, Range("H:H"))
Im getting column a: fcatid, b: key, lwmin: lowest value and lwmax: highest.

Related

Calculate intersecting sums from flat DataFrame for a heatmap

I'm trying to wrangle some data to show how many items a range of people have in common. The goal is to show this data in a heatmap format via Seaborn to understand these overlaps visually.
Here's some sample data:
demo_df = pd.DataFrame([
("Get Back", 1,0,2),
("Help", 5, 2, 0),
("Let It Be", 0,2,2)
],columns=["Song","John", "Paul", "Ringo"])
demo_df.set_index("Song")
John Paul Ringo
Song
Get Back 1 0 2
Help 5 2 0
Let It Be 0 2 2
I don't need a breakdown by song, just the total of shared items. The resulting data would show a sum of how many items they share like this:
Name
John
Paul
Ringo
John
-
7
3
Paul
7
-
4
Ringo
3
4
-
So far I've tried a few options with groupby and unstack but haven't been able to work out how to cross match the names into both column and header rows.
We may do dot then fill diag
out = df.T.dot(df.ne(0)) + df.T.ne(0).dot(df)
np.fill_diagonal(out.values, 0)
out
Out[176]:
John Paul Ringo
John 0 7 3
Paul 7 0 4
Ringo 3 4 0

Multilevel Indexing with Groupby

Being new to python I'm struggling to apply other questions about the groupby function to my data. A sample of the data frame :
ID Condition Race Gender Income
1 1 White Male 1
2 2 Black Female 2
3 3 Black Male 5
4 4 White Female 3
...
I am trying to use the groupby function to gain a count of how many black/whites, male/females, and income (12 levels) there are in each of the four conditions. Each of the columns, including income, are strings (i.e., categorical).
I'd like to get something such as
Condition Race Gender Income Count
1 White Male 1 19
1 White Female 1 17
1 Black Male 1 22
1 Black Female 1 24
1 White Male 2 12
1 White Female 2 15
1 Black Male 2 17
1 Black Female 2 19
...
Everything I've tried has come back very wrong so I don't think I'm anywhere near right, but I"m been using variations of
Data.groupby(['Condition','Gender','Race','Income'])['ID'].count()
When I run the above line I just get a 2 column matrix with an indecipherable index (e.g., f2df9ecc...) and the second column is labeled ID with what appear to be count numbers. Any help is appreciated.
if you would investigate the resulting dataframe you would see that the columns are inside the index so just reset the index...
df = Data.groupby(['Condition','Gender','Race','Income'])['ID'].count().reset_index()
that was mainly to demonstrate but since you what you want you can sepcify the argument 'as_index' as following:
df = Data.groupby(['Condition','Gender','Race','Income'],as_index=False)['ID'].count()
also since you want the last column to be 'count' :
df = df.rename(columns={'ID':'count'})

How do you “pivot” using conditions, aggregation, and concatenation in Pandas?

I have a dataframe in a format such as the following:
Index Name Fruit Quantity
0 John Apple Red 10
1 John Apple Green 5
2 John Orange Cali 12
3 Jane Apple Red 10
4 Jane Apple Green 5
5 Jane Orange Cali 18
6 Jane Orange Spain 2
I need to turn it into a dataframe such as this:
Index Name All Fruits Apples Total Oranges Total
0 John Apple Red, Apple Green, Orange Cali 15 12
1 Jane Apple Red, Apple Green, Orange Cali, Orange Spain 15 20
Question is how do I do this? I have looked at the groupby docs as well as a number of posts on pivot and aggregation but translating that into this use case somehow escapes me. Any help or pointers much appreciated.
Cheers!
Use GroupBy.agg with join, create column F by split and pass to DataFrame.pivot_table, last join together by DataFrame.join:
df1 = df.groupby('Name', sort=False)['Fruit'].agg(', '.join)
df2 = (df.assign(F = df['Fruit'].str.split().str[0])
.pivot_table(index='Name',
columns='F',
values='Quantity',
aggfunc='sum')
.add_suffix(' Total'))
df3 = df1.to_frame('All Fruits').join(df2).reset_index()
print (df3)
Name All Fruits Apple Total \
0 John Apple Red, Apple Green, Orange Cali 15
1 Jane Apple Red, Apple Green, Orange Cali, Orange Spain 15
Orange Total
0 12
1 20

Python3 pandas: data frame grouped by a columns(such as name), then extract a number of rows for each group

There is data frame called df as following:
name id age text
a 1 1 very good, and I like him
b 2 2 I play basketball with his brother
c 3 3 I hope to get a offer
d 4 4 everything goes well, I think
a 1 1 I will visit china
b 2 2 no one can understand me, I will solve it
c 3 3 I like followers
d 4 4 maybe I will be good
a 1 1 I should work hard to finish my research
b 2 2 water is the source of earth, I agree it
c 3 3 I hope you can keep in touch with me
d 4 4 My baby is very cute, I like him
The data frame is grouped by name, then I want to extract a number of rows by row index(for example: 2) for the new dataframe: df_new.
name id age text
a 1 1 very good, and I like him
a 1 1 I will visit china
b 2 2 I play basketball with his brother
b 2 2 no one can understand me, I will solve it
c 3 3 I hope to get a offer
c 3 3 I like followers
d 4 4 everything goes well, I think
d 4 4 maybe I will be good
df_new = (df.groupby('screen_name'))[0:2]
But there is error:
hash(key)
TypeError: unhashable type: 'slice'
Try using head() instead.
import pandas as pd
from io import StringIO
buff = StringIO('''
name,id,age,text
a,1,1,"very good, and I like him"
b,2,2,I play basketball with his brother
c,3,3,I hope to get a offer
d,4,4,"everything goes well, I think"
a,1,1,I will visit china
b,2,2,"no one can understand me, I will solve it"
c,3,3,I like followers
d,4,4,maybe I will be good
a,1,1,I should work hard to finish my research
b,2,2,"water is the source of earth, I agree it"
c,3,3,I hope you can keep in touch with me
d,4,4,"My baby is very cute, I like him"
''')
df = pd.read_csv(buff)
using head() instead of [:2] then sorting by name
df_new = df.groupby('name').head(2).sort_values('name')
print(df_new)
name id age text
0 a 1 1 very good, and I like him
4 a 1 1 I will visit china
1 b 2 2 I play basketball with his brother
5 b 2 2 no one can understand me, I will solve it
2 c 3 3 I hope to get a offer
6 c 3 3 I like followers
3 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good
Another solution with iloc:
df_new = df.groupby('name').apply(lambda x: x.iloc[:2]).reset_index(drop=True)
print(df_new)
name id age text
0 a 1 1 very good, and I like him
1 a 1 1 I will visit china
2 b 2 2 I play basketball with his brother
3 b 2 2 no one can understand me, I will solve it
4 c 3 3 I hope to get a offer
5 c 3 3 I like followers
6 d 4 4 everything goes well, I think
7 d 4 4 maybe I will be good

Removing duplicates from many excel sheets

I got a question if there is any fast way to remove duplicate rows across two excel spreadsheets. After searching I can do it by comparing the same rows in the spreadsheets (VBA). But I want to check whether the row from one is included anywhere in two. If exactly the same row exists in two it should be removed. So far I can do it if they are the same rows (e.g. 1 and 1).
Thanks in advance for any kind of help.
I can think of a workaround for this:
Create a column at the end of each row which is concatenation of all the columns of that particular row: Lets sat below are the two tables on the two excel sheets:
sheet1
A B C D(Concat)
1 2 3 123
4 5 6 456
7 8 9 789
1 3 5 135
4 3 2 432
sheet2
A B C D(Concat)
2 3 4 234
1 1 1 111
1 2 3 123
2 2 2 222
4 5 6 456
We will now identify the duplicate rows based on the last concatenated column. Using the formula =IF(ISNUMBER(MATCH(D4,Sheet1!D:D,0)),"DUP","NONDUP") in the second sheet, we can identify the rows which are already present in sheet1 irrespective of the sequence of the row in sheet1 wrt sheet2.
Result on Sheet2 shows up as below:
A B C D E(Result)
2 3 4 234 NONDUP
1 1 1 111 NONDUP
1 2 3 123 DUP
2 2 2 222 NONDUP
4 5 6 456 DUP