Separate aggregated data in different rows [duplicate] - pandas

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.

You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8

It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8

Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

Related

Passing Tuple to a function via apply

I am trying to run below function which takes two points..
point A=(2,3)
point B=(4,5
def Somefunc(pointA, point B):
x= pointA[0] + pointB[1]
return x
Now, when in try to create a separate column based on this fucntion, it is throwing me errors like cannot convert the series to <class 'float'>, so I tried this
df['T']=df.apply(Somefunc((df['A'].apply(lambda x: float(x)),df['B'].apply(lambda x: float(x))),\
(df['C'].apply(lambda x: float(x)),df['D'].apply(lambda x: float(x)))),axis=0))
Sample dataframe below;
A B C D
1 2 3 5
2 4 7 8
4 7 9 0
Any help will be appreciated.
This is the best guess I can make as to what you're trying to do:
df['T']=df.apply(lambda row: [(row['A'],row['B']),(row['C'],row['D'])],axis=1)
Edit: to apply your function;
df['T'] = df.apply(lambda row: SomeFunc((row['A'],row['B']),(row['C'],row['D'])),axis=1)
that being said, the same result can be achieved much quicker and idiomatically like so:
>>> df
A B C D
0 2 7 3 3
1 3 1 5 7
2 2 0 6 2
3 3 9 5 9
4 0 2 3 7
>>> df['T']=df.apply(tuple,axis=1)
>>> df
A B C D T
0 2 7 3 3 (2, 7, 3, 3)
1 3 1 5 7 (3, 1, 5, 7)
2 2 0 6 2 (2, 0, 6, 2)
3 3 9 5 9 (3, 9, 5, 9)
4 0 2 3 7 (0, 2, 3, 7)

Dataframe count of columns matching value in another column in that row

How to find the count of columns with same value as a specified column in the dataframe with large number of rows.
For instance, below df has
df = pd.DataFrame(np.random.randint(0,10,size=(5, 4)), columns=list('ABCD'))
df.index.name = 'id'
A B C D
id
0 7 6 6 2
1 6 5 3 5
2 8 8 0 9
3 0 2 8 9
4 4 3 8 5
bc_cols = ['B', 'C']
df['max'] = df[bc_cols].max(axis=1)
A B C D BC_max
id
0 7 6 6 2 6
1 6 5 3 5 5
2 8 8 0 9 8
3 0 2 8 9 8
4 4 3 8 5 8
For each row, we want to get the number of columns with the value matching the max. I was able to get to by doing this.
df["freq"] = df[bc_cols].stack().groupby(by='id').apply(lambda g: g[g==g.max()].count())
A B C D BC_max BC_freq
id
0 7 6 6 2 6 2
1 6 5 3 5 5 1
2 8 8 0 9 8 1
3 0 2 8 9 8 1
4 4 3 8 5 8 1
But this is turning out to be very inefficient and slow. We need to do this on a fairly large dataframe with several hundred thousand rows so I am looking for an efficient way to do this. Any ideas?
Once you have BC_max why not re-use it:
def get_bc_freq(row):
if (row.B == row.BC_max) and (row.C == row.BC_max):
return 2
elif (row.B == row.BC_max) or (row.C == row.BC_max):
return 1
return 0
df['freq'] = df.apply(lambda row: get_bc_freq(row), axis=1)
Or the prettier one-liner:
df['freq'] = df.apply(lambda row: [row.B, row.C].count(row.BC_max), axis=1)
UPDATE - to make the columns you use more dynamic you could use list comprehension (not sure how much this helps with performance but...):
cols_to_use = ['B', 'C']
df['freq'] = df.apply(lambda row: [row[x] for x in cols_to_use].count(row.BC_max), axis=1)

Append dataframe in specific row

I have dataframe in the following format
a b label
1 5 A
2 6 A
3 7 A
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C
I want append with new dataframe
a b label
3 4 A
The result become this
a b label
1 5 A
2 6 A
3 7 A
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C
3 4 A <-- New Data
My question is how order new data become this every append new data
a b label
1 5 A
2 6 A
3 7 A
3 4 A <-- New Data
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C
This is my code
import pandas as pd
df1 = pd.DataFrame({"a":[1, 2, 3, 4, 1, 2,5,3],
"b":[5, 6, 7, 8, 5, 6,6,2],
"label":['A','A','A','B','B','B','C','C']})
new_data = pd.DataFrame({"a":[3],
"b":[4],
"label":['A']})
df1 = df1.append(new_data,ignore_index = True)
You can simply sort it on the label column after the data frame append
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"a":[1, 2, 3, 4, 1, 2,5,3],
"b":[5, 6, 7, 8, 5, 6,6,2],
"label":['A','A','A','B','B','B','C','C']})
new_data = pd.DataFrame({"a":[3],
"b":[4],
"label":['A']})
df1 = df1.append(new_data,ignore_index = True).sort_values(by='label')
Result :
a b label
1 5 A
2 6 A
3 7 A
3 4 A <-- new data here
4 8 B
1 5 B
2 6 B
5 6 C
3 2 C

Unstack a single column dataframe

I have a dataframe that looks like this:
statistics
0 2013-08
1 4
2 8
3 2013-09
4 7
5 13
6 2013-10
7 2
8 10
And I need it to look like this:
statistics X Y
0 2013-08 4 8
1 2013-09 7 13
2 2013-10 2 10
it would be useful to find a way that doesnt depend on the number of rows as I want to use it in a loop and the number of original rows might be changing. However, the output should always have these 3 columns
What you are doing is not an unstack operation, you are trying to do a reshape.
You can do this by using the reshape method of numpy. The variable n_cols is the number of columns you are looking for.
Here you have an example:
df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], columns=['col'])
df
col
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
n_cols = 3
pd.DataFrame(df.values.reshape(int(len(df)/n_cols), n_cols))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L
import pandas as pd
data = pd.read_csv('data6.csv')
x=[]
y=[]
statistics= []
for i in range(0,len(data)):
if i%3==0:
statistics.append(data['statistics'][i])
elif i%3==1:
x.append(data['statistics'][i])
elif i%3 == 2:
y.append(data['statistics'][i])
data1 = pd.DataFrame({'statistics':statistics,'x':x,'y':y})
data1

How to get column name based on multiple columns in pandas?

Goal:
Create columns
fst_imp: return column name in which value is index of the min value of each row.
snd_imp: value is column name in which value is index of the second small value of each row.
trd_imp: value is column name in which value is index of the third small value of each row.
Example result:
A B C fst_imp snd_imp trd_imp
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B
Here is one potential solution using numpy.argsort, the pandas.DataFrame constructor and DataFrame.join:
# Setup
import numpy as np
df = pd.DataFrame({'A': {0: 1, 1: 6, 2: 7}, 'B': {0: 2, 1: 5, 2: 9}, 'C': {0: 3, 1: 4, 2: 8}})
df.join(pd.DataFrame([df.columns.values[x] for x in np.argsort(df.values)],
columns=['fst_imp', 'snd_imp', 'trd_imp']))
[out]
A B C fst_imp snd_imp trd_imp
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B
Or a bit more scalable...
df.join(pd.DataFrame([df.columns.values[x] for x in np.argsort(df.values)]))
[out]
A B C 0 1 2
0 1 2 3 A B C
1 6 5 4 C B A
2 7 9 8 A C B