Pandas - Select multiple nested rows after a groupby - pandas

Given the following dataset :
df = pd.DataFrame({'a' : [1, 1, 3],
'b' : [4, 5, 6],
'c' : [7, 8, 9],
'cat_1' : ['a', 'a', 'b'],
'cat_2' : ['a', 'b', 'c']})
df
group_cat = df.groupby(['cat_1', 'cat_2'])
agg_cat = group_cat.agg({'a':['sum','mean'], 'b':['min','max']})
print(agg_cat)
a b
sum mean min max
cat_1 cat_2
a a 1 1 4 4
b 1 1 5 5
b c 3 3 6 6
Using xs() I'm able to select specific nested columns :
print(agg_cat.xs([('a','sum'),('b','max')], axis = 1))
a b
sum max
cat_1 cat_2
a a 1 4
b 1 5
b c 3 6
But when I'm trying to apply the same logic at the row level (axis=0), I'm getting an error :
print(agg_cat.xs([('a','a'),('b','c')], axis = 0))
KeyError: (('a', 'a'), ('b', 'c'))

You need to use .loc indexer to filter data by indexes.
agg_cat.loc[[('a','a'),('b','c')]]
a b
sum mean min max
cat_1 cat_2
a a 1 1 4 4
b c 3 3 6 6

Related

Replace a single value in a column of a dataframe with 3 new column values

I would effectively like to replace a single value in a column (based on a criteria) with 3 new column values. For example:
Ch
Time
A
1 min
A
1 min
B
2 min
B
2 min
A1
A2
A3
1
2
3
2
3
4
1
1
2
B1
B2
B3
1
2
1
1
1
1
In the above table, If Ch==A, then I would like to see (All values from table 2 repeated twice and so on so forth for Ch==B etc):
ColA
ColB
ColC
1
2
3
1
3
4
1
1
2
1
2
3
1
3
4
1
1
2
How can I go about doing this efficiently? Thank you in advance for your help!
I tried replacing the value C, however I am not sure of how to insert 3 new columns.
IIUC, you can concat and merge:
df1 = pd.DataFrame({'Ch': ['A', 'A', 'B', 'B'], 'Time': ['1 min', '1 min', '2 min', '2 min']})
df2 = pd.DataFrame({'A1': [1, 2, 1], 'A2': [2, 3, 1], 'A3': [3, 4, 2]})
df3 = pd.DataFrame({'B1': [1, 1], 'B2': [2, 1], 'B3': [1, 1]})
d = {'A': df2, 'B': df3}
df1.merge(pd.concat({k: v.set_axis(['ColA', 'ColB', 'ColC'], axis=1)
for k,v in d.items()}
).droplevel(1),
left_on='Ch', right_index=True
)
Output:
Ch Time ColA ColB ColC
0 A 1 min 1 2 3
0 A 1 min 2 3 4
0 A 1 min 1 1 2
1 A 1 min 1 2 3
1 A 1 min 2 3 4
1 A 1 min 1 1 2
2 B 2 min 1 2 1
2 B 2 min 1 1 1
3 B 2 min 1 2 1
3 B 2 min 1 1 1

Retrieving values from different columns in Pandas based on a column condition [duplicate]

The operation pandas.DataFrame.lookup is "Deprecated since version 1.2.0", and has since invalidated a lot of previous answers.
This post attempts to function as a canonical resource for looking up corresponding row col pairs in pandas versions 1.2.0 and newer.
Standard LookUp Values With Default Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 B 4 8
I would like to be able to lookup the corresponding value in the column specified in Col:
I would like my result to look like:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Standard LookUp Values With a Non-Default Index
Non-Contiguous Range Index
Given the following DataFrame:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
Col A B
0 B 1 5
2 A 2 6
8 A 3 7
9 B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
MultiIndex
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
Col A B
C E B 1 5
F A 2 6
D E A 3 7
F B 4 8
I would like to preserve the index but still find the correct corresponding Value:
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
LookUp with Default For Unmatched/Not-Found Values
Given the following DataFrame
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 C 4 8 # Column C does not correspond with any column
I would like to look up the corresponding values if one exists otherwise I'd like to have it default to 0
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0 # Default value 0 since C does not correspond
LookUp with Missing Values in the lookup Col
Given the following DataFrame:
Col A B
0 B 1 5
1 A 2 6
2 A 3 7
3 NaN 4 8 # <- Missing Lookup Key
I would like any NaN values in Col to result in a NaN value in Val
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # NaN to indicate missing
Standard LookUp Values With Any Index
The documentation on Looking up values by index/column labels recommends using NumPy indexing via factorize and reindex as the replacement for the deprecated DataFrame.lookup.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
factorize is used to convert the column encode the values as an "enumerated type".
idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
Notice that B corresponds to 0 and A corresponds to 1. reindex is used to ensure that columns appear in the same order as the enumeration:
df.reindex(columns=col)
B A # B appears First (location 0) A appers second (location 1)
0 5 1
1 6 2
2 7 3
3 8 4
We need to create an appropriate range indexer compatible with NumPy indexing.
The standard approach is to use np.arange based on the length of the DataFrame:
np.arange(len(df))
[0 1 2 3]
Now NumPy indexing will work to select values from the DataFrame:
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
[5 2 3 8]
*Note: This approach will always work regardless of type of index.
MultiIndex
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
C E B 1 5 5
F A 2 6 2
D E A 3 7 3
F B 4 8 8
Why use np.arange and not df.index directly?
Standard Contiguous Range Index
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
In this case only, there is no error as the result from np.arange is the same as the df.index.
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
Non-Contiguous Range Index Error
Raises IndexError:
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: index 8 is out of bounds for axis 0 with size 4
MultiIndex Error
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
Raises IndexError:
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
LookUp with Default For Unmatched/Not-Found Values
There are a few approaches.
First let's look at what happens by default if there is a non-corresponding value:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 C 4 8
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 C 4 8 NaN # NaN Represents the Missing Value in C
If we look at why the NaN values are introduced, we will find that when factorize goes through the column it will enumerate all groups present regardless of whether they correspond to a column or not.
For this reason, when we reindex the DataFrame we will end up with the following result:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
B A C
0 5 1 NaN
1 6 2 NaN
2 7 3 NaN
3 8 4 NaN # Reindex adds the missing column with the Default `NaN`
If we want to specify a default value, we can specify the fill_value argument of reindex which allows us to modify the behaviour as it relates to missing column values:
idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)
idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
B A C
0 5 1 0
1 6 2 0
2 7 3 0
3 8 4 0 # Notice reindex adds missing column with specified value `0`
This means that we can do:
idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
columns=col,
fill_value=0 # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]
df:
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 C 4 8 0
*Notice the dtype of the column is int, since NaN was never introduced, and, therefore, the column type was not changed.
LookUp with Missing Values in the lookup Col
factorize has a default na_sentinel=-1, meaning that when NaN values appear in the column being factorized the resulting idx value is -1
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 A 3 7
# 3 NaN 4 8 # <- Missing Lookup Key
idx, col = pd.factorize(df['Col'])
# idx = array([ 0, 1, 1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
# Col A B Val
# 0 B 1 5 5
# 1 A 2 6 2
# 2 A 3 7 3
# 3 NaN 4 8 4 <- Value From A
This -1 means that, by default, we'll be pulling from the last column when we reindex. Notice the col still only contains the values B and A. Meaning, that we will end up with the value from A in Val for the last row.
The easiest way to handle this is to fillna Col with some value that cannot be found in the column headers.
Here I use the empty string '':
idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')
Now when I reindex, the '' column will contain NaN values meaning that the lookup produces the desired result:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
df:
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 A 3 7 3.0
3 NaN 4 8 NaN # Missing as expected
Other Approaches to LookUp
There are 2 other approaches to performing this operation:
apply (Intuitive, but quite slow)
apply can be used on axis=1 in order to use the Column values as the key:
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This operation will work regardless of index type:
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]},
index=[0, 2, 8, 9])
# Col A B
# 0 B 1 5
# 2 A 2 6
# 8 A 3 7
# 9 B 4 8
df['Val'] = df.apply(lambda row: row[row['Col']], axis=1)
df:
Col A B Val
0 B 1 5 5
2 A 2 6 2
8 A 3 7 3
9 B 4 8 8
When dealing with Missing/Non-Corresponding Values we can use Series.get can be used to remedy this issue:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'C', np.nan],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
# Col A B
# 0 B 1 5
# 1 A 2 6
# 2 C 3 7 <- Non Corresponding
# 3 NaN 4 8 <- Missing
df['Val'] = df.apply(lambda row: row.get(row['Col']), axis=1)
Col A B Val
0 B 1 5 5.0
1 A 2 6 2.0
2 C 3 7 NaN # Missing value
3 NaN 4 8 NaN # Missing value
With Default Value
df['Val'] = df.apply(lambda row: row.get(row['Col'], default=-1), axis=1)
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 C 3 7 -1 # Default -1
3 NaN 4 8 -1 # Default -1
apply is extremely flexible and modifications are straightforward, however, the general iterative approach, as well as all the individual Series lookups can become extremely costly in large DataFrames.
get_indexer (limited)
Index.get_indexer can be used to convert the column to index values into an indexer for the DataFrame. This means there is no reason to reindex the DataFrame as the indexer corresponds to the DataFrame as a whole.
import pandas as pd
df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df
Col A B Val
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This approach is reasonably fast, however, missing values are represented by -1 meaning that if a value is missing it will grab the value from the -1 column (The last column in the DataFrame).
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'Col': ['B', 'A', 'A', 'C']})
# A B Col <- Col is now the Last Col
# 0 1 5 B
# 1 2 6 A
# 2 3 7 A
# 3 4 8 C <- Notice Col `C` does not correspond to a Valid Column Header
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df:
A B Col Val
0 1 5 B 5
1 2 6 A 2
2 3 7 A 3
3 4 8 C C # <- Value from the last column in the DataFrame (index -1)
It is also notable that not reindexing the DataFrame means converting the entire DataFrame to numpy. This can be very costly if there are many unrelated columns that all need converted:
import numpy as np
import pandas as pd
df = pd.DataFrame({1: 10,
2: 20,
3: 't',
4: 40,
5: np.nan,
'Col': ['B', 'A', 'A', 'B'],
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]})
df['Val'] = df.to_numpy()[df.index, df.columns.get_indexer(df['Col'])]
df.to_numpy()
[[10 20 't' 40 nan 'B' 1 5 5]
[10 20 't' 40 nan 'A' 2 6 2]
[10 20 't' 40 nan 'A' 3 7 3]
[10 20 't' 40 nan 'B' 4 8 8]]
Compared to the reindexing approach which only contains columns relevant to the column values:
df.reindex(columns=['B', 'A']).to_numpy()
[[5 1]
[6 2]
[7 3]
[8 4]]
Another option is to build a tuple of the lookup columns, pivot the dataframe, and select the relevant columns with the tuples:
cols = [(ent, ent) for ent in df.Col.unique()]
df.assign(Val = df.pivot(index = None, columns = 'Col')
.reindex(columns = cols)
.ffill(axis=1)
.iloc[:, -1])
Col A B Val
0 B 1 5 5.0
2 A 2 6 2.0
8 A 3 7 3.0
9 B 4 8 8.0
Another possible method is to use melt:
df['value'] = (df.melt('Col', ignore_index=False)
.loc[lambda x: x['Col'] == x['variable'], 'value'])
print(df)
# Output:
Col A B value
0 B 1 5 5
1 A 2 6 2
2 A 3 7 3
3 B 4 8 8
This method also works with Missing/Non-Corresponding Values:
df['value'] = (df.melt('Col', ignore_index=False)
.loc[lambda x: x['Col'] == x['variable'], 'value'])
print(df)
# Output
Col A B value
0 B 1 5 5.0
1 A 2 6 2.0
2 C 3 7 NaN
3 NaN 4 8 NaN
You can replace .loc[...] by query(...) but it's little slower although more expressive:
df['value'] = df.melt('Col', ignore_index=False).query('Col == variable')['value']

Iterate over rows and subtract values in pandas df

I have the following table:
ID
Qty_1
Qty_2
A
1
10
A
2
0
A
3
0
B
3
29
B
2
0
B
1
0
I want to iterate based on the ID, and subtract Qty_2 - Qty_1 and update the next row with that result.
The result would be:
ID
Qty_1
Qty_2
A
1
10
A
2
8
A
3
5
B
3
29
B
2
27
B
1
26
Ideally, I would also like to start by subtracting the first row end a new ID appears and only after that start the loop:
ID
Qty_1
Qty_2
A
1
9
A
2
7
A
3
4
B
3
26
B
2
24
B
1
23
Each of the solutions is ok! Thank you!
First compute the difference between 'Qty_1' and 'Qty_2' row by row, then group by 'ID' and compute cumulative sum:
df['Qty_2'] = df.assign(Qty_2=df['Qty_2'].sub(df['Qty_1'])) \
.groupby('ID')['Qty_2'].cumsum()
print(df)
# Output:
ID Qty_1 Qty_2
0 A 1 9
1 A 2 7
2 A 3 4
3 B 3 26
4 B 2 24
5 B 1 23
Setup:
data = {'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Qty_1': [1, 2, 3, 3, 2, 1],
'Qty_2': [10, 0, 0, 29, 0, 0]}
df = pd.DataFrame(data)

Why does groupby in Pandas place counts under existing column names?

I'm coming from R and do not understand the default groupby behavior in pandas. I create a dataframe and groupby the column 'id' like so:
d = {'id': [1, 2, 3, 4, 2, 2, 4], 'color': ["r","r","b","b","g","g","r"], 'size': [1,2,1,2,1,3,4]}
df = DataFrame(data=d)
freq = df.groupby('id').count()
When I check the header of the resulting dataframe, all the original columns are there instead of just 'id' and 'freq' (or 'id' and 'count').
list(freq)
Out[117]: ['color', 'size']
When I display the resulting dataframe, the counts have replaced the values for the columns not employed in the count:
freq
Out[114]:
color size
id
1 1 1
2 3 3
3 1 1
4 2 2
I was planning to use groupby and then to filter by the frequency column. Do I need to delete the unused columns and add the frequency column manually? What is the usual approach?
count aggregate all columns of DataFrame with excluding NaNs values, if need id as column use as_index=False parameter or reset_index():
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 1
1 2 3 3
2 3 1 1
3 4 2 2
So if add NaNs in each column should be differences:
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id', as_index=False).count()
print (freq)
id color size
0 1 1 0
1 2 3 3
2 3 1 1
3 4 2 2
You can specify columns for count:
freq = df.groupby('id', as_index=False)['color'].count()
print (freq)
id color
0 1 1
1 2 3
2 3 1
3 4 2
If need count with NaNs:
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
d = {'id': [1, 2, 3, 4, 2, 2, 4],
'color': ["r","r","b","b","g","g","r"],
'size': [np.nan,2,1,2,1,3,4]}
df = pd.DataFrame(data=d)
freq = df.groupby('id').size().reset_index(name='count')
print (freq)
id count
0 1 1
1 2 3
2 3 1
3 4 2
Thanks Bharath for pointed for another solution with value_counts, differences are explained here:
freq = df['id'].value_counts().rename_axis('id').to_frame('freq').reset_index()
print (freq)
id freq
0 2 3
1 4 2
2 3 1
3 1 1

Separate aggregated data in different rows [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.