how reset index with respect of a group? - pandas

I have an id column for each person (data with the same id belongs to one person). I want these:
Now the id column is not based on numbering, it's 10 digit. How can I reset id with integers, e.g. 1, 2, 3, 4?
For example:
id col1
12a4 summer
12a4 goest
3b yes
3b No
3b why
4t Hi
Output:
id col1
1 summer
1 goest
2 yes
2 No
2 why
3 Hi

Use, factorize:
df['id']=df['id'].factorize()[0]+1
Output:
id col1
0 1 summer
1 1 goest
2 2 yes
3 2 No
4 2 why
5 3 Hi
Another option is to use categorical data:
df['id'] = df['id'].astype('category').cat.codes + 1

Try:
df.reset_index(inplace=True)
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame([('bird', 389.0),
('bird', 24.0),
('mammal', 80.5),
('mammal', np.nan)],
index=['falcon', 'parrot', 'lion', 'monkey'],
columns=('class', 'max_speed'))
print(df)
class max_speed
falcon bird 389.0
parrot bird 24.0
lion mammal 80.5
monkey mammal NaN
This is how looks like, let replace the index:
df.reset_index(inplace=True)
print(df)
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN

import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['12a4', '12a4', '3b', '3b', '3b', '4t'],
'col1': ['summer', 'goest', 'yes', 'No', 'why', 'Hi']})
unique_id = df.drop_duplicates(subset=['id']).reset_index(drop=True)
id_dict = dict(zip(unique_id['id'], unique_id.index))
df['id'] = df['id'].apply(lambda x: id_dict[x])
df.drop_duplicates(subset=['id']).reset_index(drop=True) removes duplicate rows in column id.
# print(unique_id)
id col1
0 12a4 summer
1 3b yes
2 4t Hi
dict(zip(unique_id['id'], unique_id.index)) creates a dictionary from column id and index value.
# print(id_dict)
{'12a4': 0, '3b': 1, '4t': 2}
df['id'].apply(lambda x: id_dict[x]) sets the column value mapping with value from dict.
# print(df)
id col1
0 0 summer
1 0 goest
2 1 yes
3 1 No
4 1 why
5 2 Hi

Related

how to add costum ID in pandas dataframe [duplicate]

In pandas, how can I convert a column of a DataFrame into dtype object?
Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)
Also, what's the difference between pandas.Factor and pandas.Categorical?
You can use the astype method to cast a Series (one column):
df['col_name'] = df['col_name'].astype(object)
Or the entire DataFrame:
df = df.astype(object)
Update
Since version 0.15, you can use the category datatype in a Series/column:
df['col_name'] = df['col_name'].astype('category')
Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.
There's also pd.factorize function to use:
# use the df data from #herrfz
In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]
In [153]: df
Out[153]:
a b c
0 1 yes 0
1 2 no 1
2 3 yes 0
3 4 no 1
4 5 absent 2
Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:
In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})
In [28]: df
Out[28]:
a b
0 1 yes
1 2 no
2 3 yes
3 4 no
4 5 absent
In [29]: df['c'] = pd.Categorical.from_array(df.b).labels
In [30]: df
Out[30]:
a b c
0 1 yes 2
1 2 no 1
2 3 yes 2
3 4 no 1
4 5 absent 0

Pivoting and transposing using pandas dataframe

Suppose that I have a pandas dataframe like the one below:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
The above would give me the following output:
print(df)
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
or
|fk ID| value | valId |
| 1 | 3 | 1 |
| 1 | 3 | 2 |
| 2 | 4 | 1 |
| 2 | 5 | 2 |
and I would like to transpose and pivot it in such a way that I get the following table and the same order of column names:
fk ID value valID fkID value valID
| 1 | 3 | 1 | 1 | 3 | 2 |
| 2 | 4 | 1 | 2 | 5 | 2 |
The most straightforward solution I can think of is
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
# concatenate the rows (Series) of each 'fk ID' group side by side
def flatten_group(g):
return pd.concat(row for _, row in g.iterrows())
res = df.groupby('fk ID', as_index=False).apply(flatten_group)
However, using Series.iterrows is not ideal, and can be very slow if the size of each group is large.
Furthermore, the above solution doesn't work if the 'fk ID' groups have different sizes. To see that, we can add a third group to the DataFrame
>>> df2 = df.append({'fk ID': 3, 'value':10, 'valID': 4},
ignore_index=True)
>>> df2
fk ID value valID
0 1 3 1
1 1 3 2
2 2 4 1
3 2 5 2
4 3 10 4
>>> df2.groupby('fk ID', as_index=False).apply(flatten_group)
0 fk ID 1
value 3
valID 1
fk ID 1
value 3
valID 2
1 fk ID 2
value 4
valID 1
fk ID 2
value 5
valID 2
2 fk ID 3
value 10
valID 4
dtype: int64
The result is not a DataFrame as one could expect, because pandas can't align the columns of the groups.
To solve this I suggest the following solution. It should work for any group size, and should be faster for large DataFrames.
import numpy as np
def flatten_group(g):
# flatten each group data into a single row
flat_data = g.to_numpy().reshape(1,-1)
return pd.DataFrame(flat_data)
# group the rows by 'fk ID'
groups = df.groupby('fk ID', group_keys=False)
# get the maximum group size
max_group_size = groups.size().max()
# contruct the new columns by repeating the
# original columns 'max_group_size' times
new_cols = np.tile(df.columns, max_group_size)
# aggregate the flattened rows
res = groups.apply(flatten_group).reset_index(drop=True)
# update the columns
res.columns = new_cols
Output:
# df
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1 3 2
1 2 4 1 2 5 2
# df2
>>> res
fk ID value valID fk ID value valID
0 1 3 1 1.0 3.0 2.0
1 2 4 1 2.0 5.0 2.0
2 3 10 4 NaN NaN NaN
You can cast df as a numpy array, reshape it and cast it back to a dataframe, then rename the columns (0..5).
This is working too if values are not numbers but strings.
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
nrows = 2
array = df.to_numpy().reshape((nrows, -1))
pd.DataFrame(array).rename(mapper=lambda x: df.columns[x % len(df.columns)], axis=1)
If your group sizes are guaranteed to be the same, you could merge your odd and even rows:
import pandas as pd
df = pd.DataFrame({'fk ID': [1,1,2,2],
'value': [3,3,4,5],
'valID': [1,2,1,2]})
df_even = df[df.index%2==0].reset_index(drop=True)
df_odd = df[df.index%2==1].reset_index(drop=True)
df_odd.join(df_even, rsuffix='_2')
Yields
fk ID value valID fk ID_2 value_2 valID_2
0 1 3 2 1 3 1
1 2 5 2 2 4 1
I'd expect this to be pretty performant, and this could be generalized for any number of rows in each group (vs assuming odd/even for two rows per group), but will require that you have the same number of rows per fk ID.

How to make new cell based on appearance in dataframe cell

I want to create new column in dataframe if a value is in existed column with array type and another column matches another condition.
Dataset:
name loto
0 Jason [22]
1 Molly [222]
2 Tina [232]
3 Jake [223]
4 Amy [73, 1, 2, 3]
If name=="Jason" and loto has 22 new=1
I tried to use np.where, but having issues check element in array.
import numpy as np
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'loto': [[22], [222], [232], [223], [73,1,2,3]]}
df = pd.DataFrame(data, columns = ['name', 'loto'])
df['new'] = np.where((22 in df['loto']) & (df[name]=="Jason"), 1, 0)
first create value you want to check in a set like set([22])
provide loto_chck in map and apply condition in .loc
loto_val = set([22])
loto_chck= loto_val.issubset
df.loc[(df['loto'].map(loto_chck))&(df['name']=='Jason'),"new"]=1
name loto new
0 Jason [22] 1
1 Molly [222] Nan
2 Tina [232] Nan
3 Jake [223] Nan
4 Amy [73, 1, 2, 3] Nan
You could try :
df['new'] = ((df.apply(lambda x : 22 in x.loto , axis = 1)) & \
(df.name =='Jason')).astype(int)
Even though it's not a good idea to store lists in a dataframe

Use a row on a DF as new columns name in another DF

I would like to replace some columns names of a DF with names in rows in another DF
import pandas as pd
df1=pd.DataFrame({'T2': [2,3],
'T1': [4,5],
'HO': [2,7]
})
df2=pd.DataFrame({'T1' : ['cat'],
'T2' :['dog']
})
How can I replace in df1 'T1' and 'T2' with 'dog and 'cat' that are in the df2 ?
You can also use the easier way
print df2.iloc[0]
T1 cat
T2 dog
Name: 0, dtype: object
Solution:
df1 = df1.rename(columns=df2.iloc[0])
print df1
HO cat dog
0 2 4 2
1 7 5 3
You can convert df2 to a dict that maps the column name to the row value using df.to_dict, then use the dict to rename the columns using df.rename. Here's how:
In [4]: df1
Out[4]:
HO T1 T2
0 2 4 2
1 7 5 3
In [5]: df2
Out[5]:
T1 T2
0 cat dog
In [6]: df2.to_dict(orient="records")
Out[6]: [{'T1': 'cat', 'T2': 'dog'}]
In [7]: df1.rename(columns=df2.to_dict(orient="records")[0])
Out[7]:
HO cat dog
0 2 4 2
1 7 5 3
You can use mapping:
def mapping(x):
return df2[x] if x in df2 else x
df1.columns=list(map(mapping, list(df1.columns)))
print(df1)
dog cat HO
0 2 4 2
1 3 5 7

Convert ordered levels to numeric in pandas

I was wondering is there any function in pandas that allows me to do this.
I have a column with levels [low, medium, high].
I would like to translate them to [1,2,3] to perform linear regression. However, what i am currently doing is df[df['interest_level'] == 'low'] = 1. is there a better way of doing this?
Thanks.
use pd.factorize() method:
df['interest_level'] = pd.factorize(df['interest_level'])[0]
you can also categorize your new numerical values (this might save a lot of memory):
Sample DataFrame:
In [34]: df = pd.DataFrame({'interest_level':np.random.choice(['medium','high','low'], 10)})
In [35]: df
Out[35]:
interest_level
0 high
1 low
2 medium
3 high
4 low
5 high
6 high
7 low
8 low
9 medium
Solution:
In [36]: df['interest_level'], cats = pd.factorize(df['interest_level'])
In [37]: df['interest_level'] = pd.Categorical(df['interest_level'], categories=np.arange(len(cats)))
In [38]: df
Out[38]:
interest_level
0 0
1 1
2 2
3 0
4 1
5 0
6 0
7 1
8 1
9 2
In [39]: cats # this can be used for the backtracing ...
Out[39]: Index(['high', 'low', 'medium'], dtype='object')
In [40]: df.memory_usage()
Out[40]:
Index 80
interest_level 34 # <---- NOTE: only 34 bytes used for 10 integers
dtype: int64
In [41]: df.dtypes
Out[41]:
interest_level category
dtype: object
You can use map:
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
Sample:
df = pd.DataFrame({'interest_level':['medium','high','low', 'low', 'medium']})
print (df)
interest_level
0 medium
1 high
2 low
3 low
4 medium
d = {'low':1,'medium':2,'high':3}
df['interest_level'] = df['interest_level'].map(d)
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2
Another solution is cast to Categorical and then use cat.codes:
categories = ['low','medium','high']
df['interest_level'] = df['interest_level'].astype('category',
categories=categories,
ordered=True).cat.codes + 1
print (df)
interest_level
0 2
1 3
2 1
3 1
4 2