Select column with the most unique values from csv, python - pandas

I'm trying to come up with a way to select from a csv file the one numeric column that shows the most unique values. If there are multiple with the same amount of unique values it should be the left-most one. The output should be either the name of the column or the index.
Position,Experience in Years,Salary,Starting Date,Floor,Room
Middle Management,5,5584.10,2019-02-03,12,100
Lower Management,2,3925.52,2016-04-18,12,100
Upper Management,1,7174.46,2019-01-02,10,200
Middle Management,5,5461.25,2018-02-02,14,300
Middle Management,7,7471.43,2017-09-09,17,400
Upper Management,10,12021.31,2020-01-01,11,500
Lower Management,2,2921.92,2019-08-17,11,500
Middle Management,5,5932.94,2017-11-21,15,600
Upper Management,7,10192.14,2018-08-18,18,700
So here I would want 'Floor' or 4 as my output given that Floor and Room have the same amount of unique values but Floor is the left-most one (I need it in pure python, i can't use pandas)
I have this nested in a whole bunch of other code for what I need to do as a whole, i will spare you the details but these are the used elements in the code:
new_types_list = [str, int, str, datetime.datetime, int, int] #all the datatypes of the columns
l1_listed = ['Position', 'Experience in Years', 'Salary', 'Starting Date', 'Floor', 'Room'] #the header for each column
difference = [3, 5, 9, 9, 6, 7] #is basically the amount of unique values each column has
And here I try to do exactly what I mentioned before:
another_list = [] #now i create another list
for i in new_types_list: # this is where the error occurs, it only fills the list with the index of the first integer 3 times instead of with the individual indices
if i== int:
another_list.append(new_types_list.index(i))
integer_listi = [difference[i] for i in another_list] #and this list is the corresponding unique values from the integers
for i in difference: #now we want to find out the one that is the highest
if i== max(integer_listi):
chosen_one_i = difference.index(i) #the index of the column with the most unique values is the chosen one -
MUV_LMNC = l1_listed[chosen_one_i]
```

You can use .nunique() to get number of unique in each column:
df = pd.read_csv("your_file.csv")
print(df.nunique())
Prints:
Position 3
Experience in Years 5
Salary 9
Starting Date 9
Floor 7
Room 7
dtype: int64
Then to find max, use .idxmax():
print(df.nunique().idxmax())
Prints:
Salary
EDIT: To select only integer columns:
print(df.loc[:, df.dtypes == np.integer].nunique().idxmax())
Prints:
Floor

Related

how to sum rows in my dataframe Pandas with specific condition?

Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()

How to combine certain column values together in Python and make values in the other column be the means of the values combined?

I have a Panda dataframe where one of the columns is a sequence of numbers('sequence')many of them repeating and the other column values('binary variable') are either 1 or 0.
I have grouped by the values in the sequences column which are the same and made the column values in the binary variable be the % of entries which are non-zero in that group.
I now want to combine entries in the 'sequence' column with the same values together and make the column values in 'binary variable' the mean of the column values of those columns that that were combined.
So my data frame looks like this:
df = pd.DataFrame([{'sequence' : [1, 1, 4,4,4 ,6], 'binary variable' : [1,0,0,1,0,1]}).
I have then used this code to group together the same values in sequence. Using this code:
df.groupby(["sequence"]).apply(lambda 'binary variable': (binary variable!= 0).sum() / binary variable.count()*100 )
I am left with the sequence columns with non-repeating values and the binary variable column now being the percentage of non zeros
.
But now I want to group some of the column values together(so for this toy example the 1 and 4 values), and have the binary variable column have values which are the mean of the percentages of say the values for 1 and 4.
This isn't terribly well worded as finding it awkward to describe it but any help would be much appreciated, I've tried to look online and had many failed attempts with code of my own but it just is not working.
Any help would be greatly appreciated
It seems like you want to group the table twice and take the mean each time. For the second grouping, you need to create a new column to indicate the group.
Try this code:
import pandas as pd
# sequence groups for final average
grps = {(1,4):[1,4],
(5,6):[5,6]}
# initial data
df = pd.DataFrame({'sequence' : [1,1,4,4,4,5,5,6], 'binvar' : [1,0,0,1,0,1,0,1]})
gb = df.groupby(["sequence"])['binvar'].mean().reset_index() #.apply(lambda 'binary variable': (binary variable!= 0).sum() / binary variable.count()*100 )
def getgrp(x): # search groups
for k in grps:
if x in grps[k]:
return k
print(df.to_string(index=False))
gb['group'] = gb.apply(lambda r: getgrp(r[0]), axis = 1)
gb = gb.reset_index()
print(gb.to_string(index=False))
gb = gb[['group','binvar']].groupby("group")['binvar'].mean().reset_index()
print(gb.to_string(index=False))
Output
sequence binvar
1 1
1 0
4 0
4 1
4 0
5 1
5 0
6 1
index sequence binvar group
0 1 0.500000 (1, 4)
1 4 0.333333 (1, 4)
2 5 0.500000 (5, 6)
3 6 1.000000 (5, 6)
group binvar
(1, 4) 0.416667
(5, 6) 0.750000

Taking mean of N largest values of group by absolute value

I have some DataFrame:
d = {'fruit': ['apple', 'pear', 'peach'] * 6, 'values': np.random.uniform(-5,5,18), 'values2': np.random.uniform(-5,5,18)}
df = pd.DataFrame(data=d)
I can take the mean of each fruit group as such:
df.groupby('fruit').mean()
However, for each group of fruit, I'd like to take the mean of the N number of largest values as
ranked by absolute value.
So for example, if my values were as follows and N=3:
[ 0.7578507 , 3.81178045, -4.04810913, 3.08887538, 2.87999752, 4.65670954]
The desired outcome would be (4.65670954 + -4.04810913 + 3.81178045) / 3 = ~1.47
Edit - to clarify that sign is preserved in outcome:
(4.65670954 + -20.04810913 + 3.81178045) / 3 = -3.859
Updating with a new approach that I think is simpler. I was avoiding apply like the plague but maybe this is one of the more acceptable uses. Plus it fixes the fact that you want to mean the original values as ranked by their absolute values:
def foo(d):
return d[d.abs().nlargest(3).index].mean()
out = df.groupby('fruit')['values'].apply(foo)
So you index each group by the 3 largest absolute values, then mean.
And for the record my original, incorrect, and slower code was:
df['values'].abs().groupby(df['fruit']).nlargest(3).groupby("fruit").mean()

using list as an argument in groupby() in pandas and none of the key elements match column or index names

So I have a random values of dataframe as below and a book I am studying uses a list was groupby key (key_list). How is the dataframe grouped in this case since none of list values match column or index names? So, the last two lines are confusing to me.
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index=['Joe','Steve','Wes','Jim','Travis'])
key_list = ['one','one','one','two','two']
people.groupby(key_list).min()
people.groupby([len, key_list]).min()
Thank you in advance!
The user guide on groupby explains a lot and I suggest you have a look at it. I'll explain as much as I understand for your use case.
You can verify the groups created using the group method:
people.groupby(key_list).groups
{'one': Index(['Joe', 'Steve', 'Wes'], dtype='object'),
'two': Index(['Jim', 'Travis'], dtype='object')}
You have your dictionary with the keys 'one' and two' being the groups from the key_list list. As such when you ask for the 'min', it looks at each group and picks out the minimum, indexed from the first column. Let's inspect group 'one' using the get_group method:
people.groupby(key_list).get_group('one')
a b c d e
Joe -0.702122 0.277164 1.017261 -1.664974 -1.852730
Steve -0.866450 -0.373737 1.964857 -1.123291 1.251595
Wes -0.043835 -0.011108 0.214802 0.065022 -1.335713
You can see that Steve has the lowest value from column 'a'. when you run the next line it should give you that:
people.groupby(key_list).get_group('one').min()
a -0.866450
b -0.373737
c 0.214802
d -1.664974
e -1.852730
dtype: float64
The same concept applies when you run it on the second group 'two'. As such, when you run the first part of your groupby code:
people.groupby(key_list).min()
You get the minimum row indexed at 'a' for each group:
a b c d e
one -0.866450 -0.373737 0.214802 -1.664974 -1.852730
two -1.074355 -0.098190 -0.595726 -2.194481 0.232505
The second part of your code, which involves the len applies the same grouping concept. In this case, it groups the dataframe according to the length of the strings in its index: (Jim, Joe, Wes) - 3 letters, (Steve) - 5 letters, (Travis) - 6 letters, and then groups with the key_list to give the final output:
a b c d e
3 one -0.702122 -0.011108 0.214802 -1.664974 -1.852730
two -0.928987 -0.098190 3.025985 0.702471 0.232505
5 one -0.866450 -0.373737 1.964857 -1.123291 1.251595
6 two -1.074355 1.110879 -0.595726 -2.194481 0.394216
Note that for 3 it spills out 'one' and 'two' because 'Joe' and 'Wes' are in group 'one' but the lowest is 'Joe', while 'Jim' is the only three letter word in group 'two'. The same concept goes for 5 letter and 6 letter words.

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation