how to split one column into many columns and count the frequency - pandas

Here is the question I have in mind, given a table
Id type
0 1 [a,b]
1 2 [c]
2 3 [a,d]
I want to convert it into the form of:
Id a b c d
0 1 1 1 0 0
1 2 0 0 1 0
2 3 1 0 0 1
I need a very efficient way to convert a large table. any comment is welcome.
====================================
I have received several good answers, and really appreciate your help.
Now a new question comes along, which is my laptop memory is insufficient to generating the whole dataframe by using pd.dummies.
is there anyway to generate a sparse vector row by row and stack then together?

Try this
>>> df
Id type
0 1 [a, b]
1 2 [c]
2 3 [a, d]
>>> df2 = pd.DataFrame([x for x in df['type'].apply(
... lambda item: dict(map(
... lambda x: (x,1),
... item))
... ).values]).fillna(0)
>>> df2.join(df)
a b c d Id type
0 1 1 0 0 1 [a, b]
1 0 0 1 0 2 [c]
2 1 0 0 1 3 [a, d]
It basically convert the list of list to list of dict and construct a DataFrame out of this
[ ['a', 'b'], ['c'], ['a', 'd'] ] # list of list
[ {'a':1, 'b':1}, {'c':1}, {'a':1, 'd':1} ] # list of dict
Make DataFrame out of this

try this:
pd.get_dummies(df.type.apply(lambda x: pd.Series([i for i in x])))
to explain:
df.type.apply(lambda x: pd.Series([i for i in x]
gets you a column for index position in your lists. You can then use get dummies to get the count of each value
pd.get_dummies(df.type.apply(lambda x: pd.Series([i for i in x])))
outputs:
a c b d
0 1 0 1 0
1 0 1 0 0
2 1 0 0 1

Related

Change 1st instance of every unique row as 1 in pandas

Hi let us assume i have a data frame
Name quantity
0 a 0
1 a 0
2 b 0
3 b 0
4 c 0
And i want something like
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
which is essentially i want to change first row of every unique element with one
currently i am using code like:
def store_counter(df):
unique_names = list(df.name.unique())
df['quantity'] = 0
for i,j in df.iterrows():
if j['name'] in unique_outlets:
df.loc[i, 'quantity'] = 1
unique_names.remove(j['name'])
else:
pass
return df
which is highly inefficient. is there a better approach for this?
Thank you in advance.
Use Series.duplicated with DataFrame.loc:
df.loc[~df.Name.duplicated(), 'quantity'] = 1
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1
If need set both values use numpy.where:
df['quantity'] = np.where(df.Name.duplicated(), 0, 1)
print (df)
Name quantity
0 a 1
1 a 0
2 b 1
3 b 0
4 c 1

How to transform dataframe column containing list of values in to its own individual column with count of occurrence?

I have a frame like this
presence_data = pd.DataFrame({
"id": ["id1", "id2"],
"presence": [
["A", "B", "C", "A"],
["G", "A", "B", "I", "B"],
]
})
id
presence
id1
[A, B, C, A]
id2
[G, A, B, I, B]
I want to transform above into something like this...
id
A
B
C
G
I
id1
2
1
1
0
0
id2
1
2
0
1
1
Currently, I have a approach where I iterate over rows and iterate over values in presence column and then create/update new columns with count based on the values encountered. I want to see if there is a better way.
edited based on feedback from Henry Ecker in comments, might as well have the better answer here:
You can use pd.explode() to get everything within the lists to become separate rows, and then use pd.crosstab() to count the occurrences.
df = presence_data.explode('presence')
pd.crosstab(index=df['id'],columns=df['presence'])
This gave me the following:
presence A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
from collections import Counter
(presence_data
.set_index('id')
.presence
.map(Counter)
.apply(pd.Series)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
Speedwise it is hard to say; it is usually more efficient to deal with python native data structures within python, yet this solution has a lot of method calls, which in itself are relatively expensive
Alternatively, you can create a new dataframe ( and reduce the number of method calls):
(pd.DataFrame(map(Counter, presence_data.presence),
index = presence_data.id)
.fillna(0, downcast='infer')
.reset_index()
)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1
You can use apply and value_counts. First we use the lists in your presence column to create new columns. We can then use axis=1 to get the row value counts.
df = pd.DataFrame(presence_data['presence'].tolist(), index=presence_data['id']).apply(pd.Series.value_counts, axis=1).fillna(0).astype(int)
print(df)
A B C G I
id
id1 2 1 1 0 0
id2 1 2 0 1 1
You can use this after if you want to have id as a column, rather than the index.
df.reset_index(inplace=True)
print(df)
id A B C G I
0 id1 2 1 1 0 0
1 id2 1 2 0 1 1

python pandas sort_values with multiple custom keys

I have a dataframe with 2 columns, I would like to sort column A ascending, and B ascending but using the absolute value. How do I do this? I tried like df.sort_values(by=['A', 'B'], key=lambda x: abs(x)) but it will take the absolute value of both columns and sort ascend.
df = pd.DataFrame({'A': [1,2,-3], 'B': [-1, 2, -3]})
output:
A B
0 1 -1
1 2 2
2 -3 -3
Expected output:
A B
0 -3 -1
1 1 2
2 2 -3
You can't have multiple sort key because index can't be dissociated. The only way is to sort your columns independently and recreate a dataframe:
>>> df.agg({'A': lambda x: x.sort_values().values,
'B': lambda x: x.sort_values(key=abs).values}) \
.apply(pd.Series).T
A B
0 -3 -1
1 1 2
2 2 -3
Use numpy.sort, to so sort column A values
df =df.assign(A= np.sort(df['A'].values))
A B
0 -3 -1
1 1 2
2 2 -3

Adding new column to an existing dataframe at an arbitrary position [duplicate]

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2