Pandas get unique values in every column - pandas

I'd like to print unique values in each column of a grouped dataframe and the following code snippet doesn't work as expected:
df = pd.DataFrame({'a' : [1, 2, 1, 2], 'b' : [5, 5, 5, 5], 'c' : [11, 12, 13, 14]})
print(
df.groupby(['a']).apply(
lambda df: df.apply(
lambda col: col.unique(), axis=0))
)
I'd expect it to print
1 [5] [11, 13]
2 [5] [12, 14]
While there are other ways of doing so, I'd like to understand what's wrong with this approach. Any ideas?

This should do the trick:
print(df.groupby(['a', 'b'])['c'].unique())
a | b |
--+---+---------
1 | 5 | [11, 13]
2 | 5 | [12, 14]
As to what's wrong with your approach - when you groupby on df and then apply some function f, the input for f will be a DataFrame with all of df's columns, unless otherwise specified (as is in my code snippet with ['c']). So your first apply is passing a DataFrame with 3 columns, and so is your second apply. Then your function also_print iterates over each of those 3 columns and prints them out, so you get 3 prints for every group.

Related

Apply custom functions to groupby pandas

I know there are questions/answers about how to use custom function for groupby in pandas but my case is slightly different.
My data is
group_col val_col
0 a [1, 2, 34]
1 a [2, 4]
2 b [2, 3, 4, 5]
data = {'group_col': {0: 'a', 1: 'a', 2: 'b'}, 'val_col': {0: [1, 2, 34], 1: [2, 4], 2: [2, 3, 4, 5]}}
df = pd.DataFrame(data)
What I am trying to do is to group by group_col, then sum up the length of lists in val_col for each group. My desire output is
a 5
b 4
I wonder I can do this in pandas?
You can try
df['val_col'].str.len().groupby(df['group_col']).sum()
df.groupby('group_col')['val_col'].sum().str.len()
Output:
group_col
a 5
b 4
Name: val_col, dtype: int64

Combine unequal length lists to dataframe pandas with values repeating

How to add a list to a dataframe column such that the values repeat for every row of the dataframe?
mylist = ['one error','delay error']
df['error'] = mylist
This gives error of unequal length as df has 2000 rows. I can still add it if I make mylist into a series, however that only appends to the first row and the output looks like this:
d = {'col1': [1, 2, 3, 4, 5],
'col2': [3, 4, 9, 11, 17],
'error':['one error',np.NaN,np.NaN,np.NaN,np.NaN]}
df = pd.DataFrame(data=d)
However I would want the solution to look like this:
d = {'col1': [1, 2, 3, 4, 5],
'col2': [3, 4, 9, 11, 17],
'error':[''one error','delay error'',''one error','delay error'',''one error','delay error'',''one error','delay error'',''one error','delay error'']}
df = pd.DataFrame(data=d)
I have tried ffill() but it didn't work.
You can assign to the result of df.to_numpy(). Note that you'll have to use [mylist] instead of mylist, even though it's already a list ;)
>>> mylist = ['one error']
>>> df['error'].to_numpy()[:] = [mylist]
>>> df
col1 col2 error
0 1 3 [one error]
1 2 4 [one error]
2 3 9 [one error]
3 4 11 [one error]
4 5 17 [one error]
>>> mylist = ['abc', 'def', 'ghi']
>>> df['error'].to_numpy()[:] = [mylist]
>>> df
col1 col2 error
0 1 3 [abc, def, ghi]
1 2 4 [abc, def, ghi]
2 3 9 [abc, def, ghi]
3 4 11 [abc, def, ghi]
4 5 17 [abc, def, ghi]
It's not a very clean way to do it, but you can first update your mylist to become the same length as the rows in dataframe, and only then you put it into your dataframe.
mylist = ['one error','delay error']
new_mylist = [mylist for i in range(len(df['col1']))]
df['error'] = new_mylist
Repeat the elements in mylist exactly N times where N is the ceil of quotient obtained after dividing length of dataframe with length of list, now assign this to new column but while assigning make sure that the length of repeated list don't exceed the length of column
df['error'] = (mylist * (len(df) // len(mylist) + 1))[:len(df)]
col1 col2 error
0 1 3 one error
1 2 4 delay error
2 3 9 one error
3 4 11 delay error
4 5 17 one error
df.assign(error=mylist.__str__())

fastest way to get value frequency stored in dictionary format in groupby pandas

In order to calculate frequency of each value by id, we can do something using value_counts and groupby.
>>> df = pd.DataFrame({"id":[1,1,1,2,2,2], "col":['a','a','b','a','b','b']})
>>> df
id col
0 1 a
1 1 a
2 1 b
3 2 a
4 2 b
5 2 b
>>> df.groupby('id')['col'].value_counts()
id col
1 a 2
b 1
2 b 2
a 1
But I would like to get results stored in dictionary format, not Series. So how am I able to achieve that and also the speed is fast if we have a large dataset?
The ideal format is:
id
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
You can unstack the groupby result to get a dict-of-dicts:
df.groupby('id')['col'].value_counts().unstack().to_dict(orient='index')
# {1: {'a': 2, 'b': 1}, 2: {'a': 1, 'b': 2}}
If you want a Series of dicts, use agg instead of to_dict:
df.groupby('id')['col'].value_counts().unstack().agg(pd.Series.to_dict)
col
a {1: 2, 2: 1}
b {1: 1, 2: 2}
dtype: object
I don't recommend storing data in this format, objects are generally more troublesome to work with.
If unstacking generates NaNs, try an alternative with GroupBy.agg:
df.groupby('id')['col'].agg(lambda x: x.value_counts().to_dict())
id
1 {'a': 2, 'b': 1}
2 {'b': 2, 'a': 1}
Name: col, dtype: object
We can do pd.crosstab
pd.Series(pd.crosstab(df.id,df.col).to_dict('i'))
1 {'a': 2, 'b': 1}
2 {'a': 1, 'b': 2}
dtype: object

How to get the index of each increment in pandas series?

how to get the index of pandas series when the value incremented by one?
Ex. The input is
A
0 0
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
the output should be: [0, 1, 4, 6, 7]
You can use Series.duplicated and access the index, should be slightly faster.
df.index[~df.A.duplicated()]
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
If you really want a list, you can do this,
df.index[~df.A.duplicated()].tolist()
# [0, 1, 4, 6, 7]
Note that duplicated (and drop_duplicates) will only work if your Series does not have any decrements.
Alternatively, you can use diff here, and index into df.index, similar to the previous solution:
np.insert(df.index[df.A.diff().gt(0)], 0, 0)
# Int64Index([0, 1, 4, 6, 7], dtype='int64')
It is drop_duplicates
df.drop_duplicates('A').index.tolist()
[0, 1, 4, 6, 7]
This makes sure the second row is incremented by one (not by two or anything else!)
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
output is numpy array:
array([2, 5])
Example:
# * * here value increase by 1
# 0 1 2 3 4 5 6 7
df = pd.DataFrame({ 'A' : [1, 1, 1, 2, 8, 3, 4, 4]})
df[ ((df.A.shift(-1) - df.A) == 1.0)].index.values
array([2, 5])

how to convert a pandas column containing list into dataframe

I have a pandas dataframe.
One of its columns contains a list of 60 elements, constant across its rows.
How do I convert each of these lists into a row of a new dataframe?
Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements.
I need to create a new dataframe nx60.
My tentative:
def expand(x):
return(pd.DataFrame(np.array(x)).reshape(-1,len(x)))
df["col"].apply(lambda x: expand(x))]
it gives funny results....
The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it
expand(df["col"][0])
To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data
df = pd.DataFrame()
df['col'] = np.arange(4*5).reshape(4,5).tolist()
df
Output:
col
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11, 12, 13, 14]
3 [15, 16, 17, 18, 19]
now exctract DataFrame from col
df.col.apply(pd.Series)
Output:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
Try this:
new_df = pd.DataFrame(df["col"].tolist())
This is a little frankensteinish, but you could also try:
import numpy as np
np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',')
new_df = pd.read_csv('outfile.csv')
You can try this as well:
newCol = pd.Series(yourList)
df['colD'] = newCol.values
The above code:
1. Creates a pandas series.
2. Maps the series value to columns in original dataframe.