How to count the distinct values across a column in pandas - pandas

i have a dataframe like:
Company Date Country
ABC 2017-09-17 USA
BCD 2017-09-16 USA
ABC 2017-09-17 USA
BCD 2017-09-16 USA
BCD 2017-09-16 USA
ABC 2017-09-19 USA
I want to get a resultant df as :
Company No: of distinct Days
ABC 2
BCD 1
How do i do it ?

This should work:
df[['Company', 'Date']].drop_duplicates()['Company'].value_counts()

You can use nunique method of the groupby objects:
df.groupby('Company')['Date'].nunique()
Out:
Company
ABC 2
BCD 1
Name: Date, dtype: int64

Related

With 2 group by columns, how can I do a subtotal by each of the group by columns?

New to pandas. I'm trying to a subtotal within 2 group by columns. I have managed to figure out how to sum using 2 group by attributes but within that, I'm also trying to do a subtotal. Please see below for example -
df.groupby(['Fruit','Name'])['Number'].sum()
Output
Fruit Name Number
Apples Bob 16
Mike 9
Steve 10
------
35
------
Grapes Bob 35
Tom 87
Tony 15
------
137
------
Oranges Bob 67
Mike 57
Tom 15
Tony 1
What I'm looking for is to show a subtotal by each fruit within the dataframe. Thank you!
You can use a mix of unstack, assign, and stack to do this:
sums = df.groupby(['Fruit', 'Name'])['Number'].sum().unstack().assign(Total=df.groupby('Fruit').sum()).stack()
Output:
>>> sums
Fruit Name
Apples Bob 16.0
Mike 9.0
Steve 10.0
Total 35.0
Grapes Bob 35.0
Tom 87.0
Tony 15.0
Total 137.0
Oranges Bob 67.0
Mike 57.0
Tom 15.0
Tony 1.0
Total 140.0
dtype: float64
IIUC, you can use df.sum or df.groupby.sum with level=0:
df.sum(level=0) # Will be deprecated
# or
df.groupby(level=0).sum()
Output:
Fruit
Apples 35
Grapes 137
Oranges 140
Name: Number, dtype: int64

In Oracle SQL, Add max values (row by row) from another table when other columns of the table are already populated

I have two tables A and B. Table B has 4 columns(ID,NAME,CITY,COUNTRY), 3 columns has values and one column (ID) has NULLS. I want to insert max value from table A column ID to table B where the ID field in B should be in increasing order.
Screenshot
TABLE A
ID NAME
------- -------
231 Bred
134 Mick
133 Tom
233 Helly
232 Kathy
TABLE B
ID NAME CITY COUNTRY
------- ------- ---------- -----------
(NULL) Alex NY USA
(NULL) Jon TOKYO JAPAN
(NULL) Jeff TORONTO CANADA
(NULL) Jerry PARIS FRANCE
(NULL) Vicky LONDON ENGLAND
ID in column in B should be populated as MAX(ID) +1 from table A. The output should look like this:
TABLE B
ID NAME CITY COUNTRY
------ -------- ---------- -----------
234 Alex NY USA
235 Jon TOKYO JAPAN
236 Jeff TORONTO CANADA
237 Jerry PARIS FRANCE
238 Vicky LONDON ENGLAND
Perhaps the simplest method is to create a one-time sequence for the update:
create sequence temp_b_seq;
update b
set id = (select max(id) from a) + temp_b_seq.nextval;
drop sequence temp_b_seq;
You could actually initialize the sequence with the maximum value from a, but that requires dynamic SQL, so this seems like the simplest approach. Oracle should be smart enough to run the subquery only once.

SPARK SQL query for match output

I have 2 ds as below
ds1:
CustId Name Street1 City
=================================
1 Ron 1 Mn strt Hyd
2 Ashok westend av Delhi
3 Rajesh 5th Cross Mumbai
4 Venki 2nd Main NY
ds2:
Id CustName CustAddr1 City
=========================================
11 Ron 1 Mn Street Hyd
12 Ron eastend avn Patna
13 Rajesh 2nd Main Mumbai
14 Girish 100ft rd BLR
15 Dinesh 60ft Mum
16 Rajesh 1st Cross Mumbai
I am trying to find an exact match like ds1.Name --> ds2.CustName, ds1.city --> ds2.city
Output:
GrpID Rec_Id Count ds1.cond Rec_Id Count ds2.cond
======================================================================
1 1 1 Ron + Hyd 1001 1 Ron + Hyd
2 2 1 Rajesh + Mumbai 1002 2 Rajesh + Mumbai
How to write (SPARK) SQL query for it?
I tried
final Dataset<Row> rslt = spark.sql("select * from ds1 JOIN ds2 ON ds1.Name==ds2.CustName");
(using only name)
but it gives output of mXn for m matching rows in ds1 with n matching rows in ds2.
My first work on this. Any suggestion?

SAS Transpose and summarize

I'm working on following scenario in SAS.
Input 1
AccountNumber Loans
123 abc, def, ghi
456 jkl, mnopqr, stuv
789 w, xyz
Output 1
AccountNumbers Loans
123 abc
123 def
123 ghi
456 jkl
456 mnopqr
456 stuv
789 w
789 xyz
Input 2
AccountNumbers Loans
123 15-abc
123 15-def
123 15-ghi
456 99-jkl
456 99-mnopqr
456 99-stuv
789 77-w
789 77-xyz
Output 2
AccountNumber Loans
123 15-abc, 15-def, 15-ghi
456 99-jkl, 99-mnopqr, 99-stuv
789 77-w, 77-xyz
I manged to get Input 2 from output 1, just need Output 2 now.
I will really appreciate the help.
Thanks!
Try this, replacing [Input 2] with the actual name of your Input 2 table.
data output2 (drop=loans);
do until (last.accountnumbers);
set [Input 2];
by accountnumbers;
length loans_combined $100;
loans_combined=catx(', ',loans_combined,loans);
end;
run;

Quarterly mean by group

I have a DataFrame with monthly observations (var1, var2) for a group (Area)
date var1 var2 Area
2008-03-01 2 22 OH
2008-02-01 3 33 OH
2008-01-01 4 44 OH
... etc
2008-03-01 111 1111 AK
2008-02-01 222 2222 AK
2008-01-01 333 3333 AK
I wish to 'downsample' these variables to quarterly data by taking the 3-month mean. I.e. The first quarterly observation (var1) for 'OH' should be (1+3+4)/3.
How do I do this in pandas? Thank you
EDIT: Here is what I intended the output to be:
dateQtr var1 var2 Area
2008-Q1 3 33 OH
2007-Q4 ... ... OH
... etc
2008-Q1 222 2222 AK
If you set the index to 'date' then you can resample quarterly:
In [114]:
df.resample('q')
Out[114]:
var1 var2
date
2008-03-31 112.5 1127.5
So on your existing df:
In [116]:
df.set_index('date').resample('q', how='mean')
Out[116]:
var1 var2
date
2008-03-31 112.5 1127.5
EDIT
Thanks to #JohnE for pointing this out:
In [134]:
df.groupby('Area')[['var1','var2']].resample('q').reset_index()
Out[134]:
Area date var1 var2
0 AK 2008-03-31 222 2222
1 OH 2008-03-31 3 33