How to make a new df with column name and unique values? - pandas

I am attempting to create a new df that shows all columns and their unique values. I have this following code but I think I am referencing the column of the df in the loop wrong.
#Create empty df
df_unique = pd.DataFrame()
#Loop to take unique values from each column and append to df
for col in df:
list = df(col).unique().tolist()
df_unique.loc[len(df_unique)] = list
To visualize what I am hoping to achieve, I've included a before and after example below.
Before
ID Name Zip Type
01 Bennett 10115 House
02 Sally 10119 Apt
03 Ben 11001 House
04 Bennett 10119 House
After
Column List_of_unique
ID 01, 02, 03, 04
Name Bennett, Sally, Ben
Zip 10115, 10119, 11001
Type House, Apt

You can use:
>>> df.apply(np.unique)
ID [1, 2, 3, 4]
Name [Ben, Bennett, Sally]
Zip [10115, 10119, 11001]
Type [Apt, House]
dtype: object
# OR
>>> (df.apply(lambda x: ', '.join(x.unique().astype(str)))
.rename_axis('Column').rename('List_of_unique').reset_index())
Column List_of_unique
0 ID 1, 2, 3, 4
1 Name Bennett, Sally, Ben
2 Zip 10115, 10119, 11001
3 Type House, Apt

Related

convert some columns in a dataframe to la column of list in the dataframe

I would like to convert some of the columns to list in adataframe.
The dataframe, df:
Name salary department days other
0 ben 1000 A 90 abc
1 alex 3000 B 80 gf
2 linn 600 C 55 jgj
3 luke 5000 D 88 gg
The desired output, df1:
Name list other
0 ben [1000,A,90] abc
1 alex [3000,B,80] gf
2 linn [600,C,55] jgj
3 luke [5000,D,88] gg
You can slice and convert the columns to a list of list, then to a Series:
cols = ['salary', 'department', 'days']
out = (df.drop(columns=cols)
.join(pd.Series(df[cols].to_numpy().tolist(), name='list', index=df.index))
)
Output:
Name other list
0 ben abc [1000, A, 90]
1 alex gf [3000, B, 80]
2 linn jgj [600, C, 55]
3 luke gg [5000, D, 88]
If you want to preserve the order, then we can break it down into 3 parts, as #mozway mentioned in his answer
Define columns we want to group (as #mozway mentioned in his answer)
Find the first element's index (you can take it a step forward and find the smallest one, as the list won't be necessarily sorted as the DataFrame)
Insert the Series to the dataframe at the position we generated
cols = ['salary', 'department', 'other']
first_location = df.columns.get_loc(cols[0])
list_values = pd.Series(df[cols].values.tolist()) # converting values to one list
df.insert(loc=first_location, column='list', value=list_values) # inserting the Series in the desired location
df = df.drop(columns=cols) # dropping the columns we grouped together.
print(df)
Which results in:
Name list other
0 ben [1000, A, 90] abc
1 alex [3000, B, 80] gf
...

Group by count in pandas dataframe

In pandas dataframe I want to create two new columns that would calculate count the occurrence of the same value and a third column that would calculate the ratio
ratio = count_occurrence_both_columns /count_occurrence_columnA *100
df = pd.DataFrame({"column A": ["Atlanta", "Atlanta", "New York", "New York","New York"], "column B": ["AT", "AT", "NY", "NY", "AT"]})
df
columnA
ColumnB
occurrence_columnA
occurrence_both_columns
Ratio
Atlanta
AT
2
2
100%
Atlanta
AT
2
2
100%
Newyork
NY
3
2
66.66%
Newyork
NY
3
2
66.66%
Newyork
AT
3
1
33.33%
First, you can create a dictionary that has the keys as column A unique values and the values as the count.
>>> column_a_mapping = df['column A'].value_counts().to_dict()
>>> column_a_mapping
>>> {'New York': 3, 'Atlanta': 2}
Then, you can create a new column that has the two columns merged in order to have the same value counts dictionary as above.
>>> df['both_columns'] = (
df[['column A', 'column B']]
.apply(lambda row: '_'.join(row), axis=1)
)
>>> both_columns_mapping = df['both_columns'].value_counts().to_dict()
>>> both_columns_mapping
>>> {'New York_NY': 2, 'Atlanta_AT': 2, 'New York_AT': 1}
Once you have the unique values count, you can simple use the replace pd.Series method.
>>> df['count_occurrence_both_columns'] = df['both_columns'].replace(both_columns_mapping)
>>> df['count_occurrence_columnA'] = df['column A'].replace(column_a_mapping)
Lastly, you can drop the column that has both columns merged and then create you ratio column with:
>>> df['ratio'] = df['count_occurrence_both_columns'] / df['count_occurrence_columnA'] * 100
>>> df.drop('both_columns', axis=1, inplace=True)
You should obtain this dataframe:
column A
column B
count_occurrence_columnA
count_occurrence_both_columns
ratio
Atlanta
AT
2
2
100.000000
Atlanta
AT
2
2
100.000000
New York
NY
3
2
66.666667
New York
NY
3
2
66.666667
New York
AT
3
1
33.333333
Use pandas groupby to count the items
df['occurrence_columnA'] = df.groupby(['column A'])['column B'].transform(len)
df['occurrence_both_columns'] = df.groupby(['column A','column B'])['occurrence_columnA'].transform(len)
The alternative way is to use transform('count') but this will ignore NaN's

pandas - get count of duplicate rows (matching across multiple columns)

I have a table like below - unique IDs and names. I want to return any duplicated names (based on matching First and Last).
Id First Last
1 Dave Davis
2 Dave Smith
3 Bob Smith
4 Dave Smith
I've managed to return a count of duplicates across all columns if I don't have an ID column, i.e.
import pandas as pd
dict2 = {'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df2 = pd.DataFrame(dict2)
print(df2.groupby(df2.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
Output:
First Last records
0 Bob Smith 1
1 Dave Davis 1
2 Dave Smith 2
I want to be able to return the duplicates (of first and last) when I also have an ID column, i.e.
import pandas as pd
dict1 = {'Id': pd.Series([1, 2, 3, 4]),
'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df1 = pd.DataFrame(dict1)
print(df1.groupby(df1.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
gives:
Id First Last records
0 1 Dave Davis 1
1 2 Dave Smith 1
2 3 Bob Smith 1
3 4 Dave Smith 1
I want (ideally):
First Last records Ids
0 Dave Smith 2 2, 4
first filter only duplicated rows by DataFrame.duplicated by columns for check and keep=False for return all dupes, filter by boolean indexing. Then aggregate by GroupBy.agg counts with GroupBy.size and join id with converting to strings:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1[df1.duplicated(['First','Last'], keep=False)]
.groupby(['First','Last'])['Id'].agg(tup)
.reset_index())
print (df2)
First Last records Ids
0 Dave Smith 2 2,4
Another idea is aggregate all values and then filter with DataFrame.query:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1.groupby(['First','Last'])['Id'].agg(tup)
.reset_index()
.query('records != 1'))
print (df2)
First Last records Ids
2 Dave Smith 2 2,4

How to make new cell based on appearance in dataframe cell

I want to create new column in dataframe if a value is in existed column with array type and another column matches another condition.
Dataset:
name loto
0 Jason [22]
1 Molly [222]
2 Tina [232]
3 Jake [223]
4 Amy [73, 1, 2, 3]
If name=="Jason" and loto has 22 new=1
I tried to use np.where, but having issues check element in array.
import numpy as np
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'loto': [[22], [222], [232], [223], [73,1,2,3]]}
df = pd.DataFrame(data, columns = ['name', 'loto'])
df['new'] = np.where((22 in df['loto']) & (df[name]=="Jason"), 1, 0)
first create value you want to check in a set like set([22])
provide loto_chck in map and apply condition in .loc
loto_val = set([22])
loto_chck= loto_val.issubset
df.loc[(df['loto'].map(loto_chck))&(df['name']=='Jason'),"new"]=1
name loto new
0 Jason [22] 1
1 Molly [222] Nan
2 Tina [232] Nan
3 Jake [223] Nan
4 Amy [73, 1, 2, 3] Nan
You could try :
df['new'] = ((df.apply(lambda x : 22 in x.loto , axis = 1)) & \
(df.name =='Jason')).astype(int)
Even though it's not a good idea to store lists in a dataframe

How do I make a Dataframe of columns and unique values stacked?

I have a large data frame that I would like to develop a summation table from. In other words, column 1 would be the columns of the first data frame, column 2 would be each unique value of each column and column three thru ... would be a summation of different variables I choose. Like the below:
Variable Level Summed_Column
Here is some sample code:
data = {"name": ['bob', 'john', 'mary', 'timmy']
, "age": [32, 32, 29, 28]
, "location": ['philly', 'philly', 'philly', 'ny']
, "amt": [100, 2000, 300, 40]}
df = pd.DataFrame(data)
df.head()
So the output in the above example would be as follows:
Variable Level Summed_Column
Name Bob 100
Name john 2000
Name Mary 300
Name timmy 40
age 32 2100
age 29 300
age 29 40
location philly 2400
location ny 40
I'm not even sure where to start. The actual dataframe has 32 columns in which 4 will be summed and 28 put into the variable and Level format.
You don't need a loop for this and concatenation, you can do this in one go by combining melt with groupby and using the agg method:
final = df.melt(value_vars=['name', 'age', 'location'], id_vars='amt')\
.groupby(['variable', 'value']).agg({'amt':'sum'})\
.reset_index()
Which yields:
print(final)
variable value amt
0 age 28 40
1 age 29 300
2 age 32 2100
3 location ny 40
4 location philly 2400
5 name bob 100
6 name john 2000
7 name mary 300
8 name timmy 40
ok #Datanovice. I figured out how to do this using a for loop w/ pd.melt.
id = ['name', 'age', 'location']
final = pd.DataFrame(columns = ['variable', 'value', 'amt'])
for i in id:
table = df.groupby(i).agg({'amt':'sum'}).reset_index()
table2 = pd.melt(table, value_vars = i, id_vars = ['amt'])
final = pd.concat([final, table2])
print(final)