In Pandas I have a series and a multi-index:
s = pd.Series([1,2,3,4], index=['w', 'x', 'y', 'z'])
idx = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']])
What is the best way for me to create a DataFrame that has idx as index, and s as value for each row, preserving the index in S as columns?
df =
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
Use the pd.DataFrame constructor followed by assign
pd.DataFrame(index=idx).assign(**s)
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
You can use numpy.repeat with numpy.ndarray.reshape for duplicate data and last DataFrame constructor:
arr = np.repeat(s.values, len(idx)).reshape(-1, len(idx))
df = pd.DataFrame(arr, index=idx, columns=s.index)
print (df)
w x y z
a c 1 1 1 1
d 2 2 2 2
b c 3 3 3 3
d 4 4 4 4
Timings:
np.random.seed(123)
s = pd.Series(np.random.randint(10, size=1000))
s.index = s.index.astype(str)
idx = pd.MultiIndex.from_product([np.random.randint(10, size=250), ['a','b','c', 'd']])
In [32]: %timeit (pd.DataFrame(np.repeat(s.values, len(idx)).reshape(len(idx), -1), index=idx, columns=s.index))
100 loops, best of 3: 3.94 ms per loop
In [33]: %timeit (pd.DataFrame(index=idx).assign(**s))
1 loop, best of 3: 332 ms per loop
In [34]: %timeit pd.DataFrame([s]*len(idx),idx,s.index)
10 loops, best of 3: 82.9 ms per loop
Use [s]*len(s) as data, idx as index and s.index as column to reconstruct a df.
pd.DataFrame([s]*len(s),idx,s.index)
Out[56]:
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
Related
I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?
Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
I have a dataframe that looks like this:
statistics
0 2013-08
1 4
2 8
3 2013-09
4 7
5 13
6 2013-10
7 2
8 10
And I need it to look like this:
statistics X Y
0 2013-08 4 8
1 2013-09 7 13
2 2013-10 2 10
it would be useful to find a way that doesnt depend on the number of rows as I want to use it in a loop and the number of original rows might be changing. However, the output should always have these 3 columns
What you are doing is not an unstack operation, you are trying to do a reshape.
You can do this by using the reshape method of numpy. The variable n_cols is the number of columns you are looking for.
Here you have an example:
df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], columns=['col'])
df
col
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
n_cols = 3
pd.DataFrame(df.values.reshape(int(len(df)/n_cols), n_cols))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L
import pandas as pd
data = pd.read_csv('data6.csv')
x=[]
y=[]
statistics= []
for i in range(0,len(data)):
if i%3==0:
statistics.append(data['statistics'][i])
elif i%3==1:
x.append(data['statistics'][i])
elif i%3 == 2:
y.append(data['statistics'][i])
data1 = pd.DataFrame({'statistics':statistics,'x':x,'y':y})
data1
import pandas as pd
df = pd.DataFrame(columns=['A','B'])
df['A']=['A','B','A','A','B','B','B']
df['B']=[2,4,3,5,6,7,8]
df
A B
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
df.columns=['id','num']
df
id num
0 A 2
1 B 4
2 A 3
3 A 5
4 B 6
5 B 7
6 B 8
I would like to apply groupby on id column but some condition on num column
I want to have 2 columns is_even_count and is_odd_count columns in final data frame where is_even_count only counts even numbers from num column after grouping and is_odd_count only counts odd numbers from num column after grouping.
My output will be
is_even_count is_odd_count
A 1 2
B 3 1
how can i do this in pandas
Use modulo division by 2 and compare by 1 with map:
d = {True:'is_odd_count', False:'is_even_count'}
df = df.groupby(['id', (df['num'] % 2 == 1).map(d)]).size().unstack(fill_value=0)
print (df)
num is_even_count is_odd_count
id
A 1 2
B 3 1
Another solution with crosstab:
df = pd.crosstab(df['id'], (df['num'] % 2 == 1).map(d))
Alternative with numpy.where:
a = np.where(df['num'] % 2 == 1, 'is_odd_count', 'is_even_count')
df = pd.crosstab(df['id'], a)
I want to selectively remove elements of a pandas group based on their properties within the group.
Here's an example: remove all elements except the row with the highest value in the 'A' column
>>> dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc'), 'C': list('lmnopqrt')})
>>> dff
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
>>> grped = dff.groupby('B')
>>> grped.groups
{'a': [0, 1], 'c': [6, 7], 'b': [2, 3, 4, 5]}
apply custom function/method to the groups (sort within group on col 'A', filter elements).
>>> yourGenius(grped,'A').reset_index()
returns dataframe:
A B C
0 2 a m
1 9 b p
2 10 c t
maybe there is a compact way to do this with a lambda function or .filter()? thanks
If you want to select one row per group, you could use groupby/agg
to return index values and select the rows using loc.
For example, to group by B and then select the row with the highest A value:
In [171]: dff
Out[171]:
A B C
0 0 a l
1 2 a m
2 4 b n
3 1 b o
4 9 b p
5 2 b q
6 3 c r
7 10 c t
[8 rows x 3 columns]
In [172]: dff.loc[dff.groupby('B')['A'].idxmax()]
Out[172]:
A B C
1 2 a m
4 9 b p
7 10 c t
another option (suggested by jezrael) which in practice is faster for a wide range of DataFrames is
dff.sort_values(by=['A'], ascending=False).drop_duplicates('B')
If you wish to select many rows per group, you could use groupby/apply with a function that returns sub-DataFrames for
each group. apply will then try to merge these sub-DataFrames for you.
For example, to select every row except the last from each group:
In [216]: df = pd.DataFrame(np.arange(15).reshape(5,3), columns=list('ABC'), index=list('vwxyz')); df['A'] %= 2; df
Out[216]:
A B C
v 0 1 2
w 1 4 5
x 0 7 8
y 1 10 11
z 0 13 14
In [217]: df.groupby(['A']).apply(lambda grp: grp.iloc[:-1]).reset_index(drop=True, level=0)
Out[217]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
Another way is to use groupby/apply to return a Series of index values. Again apply will try to join the Series into one Series. You could then use df.loc to select rows by index value:
In [218]: df.loc[df.groupby(['A']).apply(lambda grp: pd.Series(grp.index[:-1]))]
Out[218]:
A B C
v 0 1 2
x 0 7 8
w 1 4 5
I don't think groupby/filter will do what you wish, since
groupby/filter filters whole groups. It doesn't allow you to select particular rows from each group.