Unstack a single column dataframe - pandas

I have a dataframe that looks like this:
statistics
0 2013-08
1 4
2 8
3 2013-09
4 7
5 13
6 2013-10
7 2
8 10
And I need it to look like this:
statistics X Y
0 2013-08 4 8
1 2013-09 7 13
2 2013-10 2 10
it would be useful to find a way that doesnt depend on the number of rows as I want to use it in a loop and the number of original rows might be changing. However, the output should always have these 3 columns

What you are doing is not an unstack operation, you are trying to do a reshape.
You can do this by using the reshape method of numpy. The variable n_cols is the number of columns you are looking for.
Here you have an example:
df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], columns=['col'])
df
col
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
n_cols = 3
pd.DataFrame(df.values.reshape(int(len(df)/n_cols), n_cols))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L

import pandas as pd
data = pd.read_csv('data6.csv')
x=[]
y=[]
statistics= []
for i in range(0,len(data)):
if i%3==0:
statistics.append(data['statistics'][i])
elif i%3==1:
x.append(data['statistics'][i])
elif i%3 == 2:
y.append(data['statistics'][i])
data1 = pd.DataFrame({'statistics':statistics,'x':x,'y':y})
data1

Related

pandas - replace rows in dataframe with rows of another dataframe by latest matching column entry

I have 2 dataframes, df1
A B C
0 a 1 x
1 b 2 y
2 c 3 z
3 d 4 g
4 e 5 h
and df2:
0 A B C
0 1 a 6 i
1 2 a 7 j
2 3 b 8 k
3 3 d 10 k
What I want to do is the following:
Whenever an entry in column A of df1 matches an entry in column A of df2, replace the matching row in df1 with parts of the row in df2
In my approach, in the below code, I tried to replace the first row
(a,1,x) by (a,6,i) and consecutively with (a,7,j).
Also all other matching rows should be replaced:
So: (b,2,y) with (b,8,k) and (d,4,g) with (d,10,k)
Meaning that every row in df1 should be replaced by the latest match of column A in df2.
import numpy as np
import pandas as pd
columns = ["0","A", "B", "C"]
s1 = pd.Series(['a', 1, 'x'])
s2 = pd.Series(['b', 2, 'y'])
s3 = pd.Series(['c', 3, 'z'])
s4 = pd.Series(['d', 4, 'g'])
s5 = pd.Series(['e', 5, 'h'])
df1 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4),list(s5)], columns = columns[1::])
s1 = pd.Series([1, 'a', 6, 'i'])
s2 = pd.Series([2, 'a', 7, 'j'])
s3 = pd.Series([3, 'b', 8, 'k'])
s4 = pd.Series([3, 'd', 10, 'k'])
df2 = pd.DataFrame([list(s1), list(s2),list(s3),list(s4)], columns = columns)
cols = ["A", "B", "C"]
print(df1[columns[1::]])
print("---")
print(df2[columns])
print("---")
df1.loc[df1["A"].isin(df2["A"]), columns[1::]] = df2[columns[1::]]
print(df1)
The expected result would therefor be:
A B C
0 a 7 j
1 b 2 y
2 c 3 z
3 d 10 k
4 e 5 h
But the above approach results in:
A B C
0 a 6 i
1 a 7 j
2 c 3 z
3 d 10 k
4 e 5 h
I know i could do what I want with iterrows() but I don't think this is the supposed way of doing this right? (Also I have quite some data to process so I think this would not be the most effective - but please correct me If I'm wrong here, and in this case it would be ok to use it)
Or is there there any other easy approach to achieve this?
Use:
df = pd.concat([df1, df2]).drop_duplicates(['A'], keep='last').sort_values('A').drop('0', axis=1)
print (df)
A B C
1 a 7 j
2 b 8 k
2 c 3 z
3 d 10 k
4 e 5 h
You can try merge then update
df1.update(df1[['A']].merge(df2.drop_duplicates('A', keep='last'), on='A', how='left')[['B', 'C']])
print(df1)
A B C
0 a 7.0 j
1 b 8.0 k
2 c 3.0 z
3 d 10.0 k
4 e 5.0 h

Split and concatenate dataframe

So i have dataframe which looks like this one:
>>>df = pd.DataFrame({
'id': [i for i in range(5)],
'1': ['a', 'b', 'c', 'd', 'e'],
'2': ['f', 'g', 'h', 'i', 'g']
})
>>>df
id 1 2
0 0 a f
1 1 b g
2 2 c h
3 3 d i
4 4 e g
I want to convert this dataframe to following dataframe
>>>df_concatenated
id val
1 0 a
1 1 b
1 2 c
1 3 d
1 4 e
2 0 f
2 1 g
2 2 h
2 3 i
2 4 g
One way is to pd.melt
pd.melt(df, id_vars=['id'], value_vars=['1','2']).set_index('variable',append=True)
The other is by splitting by .loc accessor and concatenating. Long but it works
res1=df.iloc[:,[0,2]]
res1.columns=['id','val']
res=df.iloc[:,:2]
res.columns=['id','val']
res2=pd.concat([res1,res])
res2
variable id value
0 1 0 a
1 1 1 b
2 1 2 c
3 1 3 d
4 1 4 e
5 2 0 f
6 2 1 g
7 2 2 h
8 2 3 i
9 2 4 g
You can try this:
df = df.rename({"1":"val"},axis=1)
df_temp = df[["id","2"]]
df_temp = df_temp.rename({"2":"val"},axis=1)
df.drop("2",axis=1,inplace=True)
out_df = pd.concat([df,df_temp],axis=0).reset_index(drop=True)
print(out_df)
output:
id val
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
5 0 f
6 1 g
7 2 h
8 3 i
9 4 g

Create pandas dataframe by repeating one row with new multiindex

In Pandas I have a series and a multi-index:
s = pd.Series([1,2,3,4], index=['w', 'x', 'y', 'z'])
idx = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']])
What is the best way for me to create a DataFrame that has idx as index, and s as value for each row, preserving the index in S as columns?
df =
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
Use the pd.DataFrame constructor followed by assign
pd.DataFrame(index=idx).assign(**s)
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
You can use numpy.repeat with numpy.ndarray.reshape for duplicate data and last DataFrame constructor:
arr = np.repeat(s.values, len(idx)).reshape(-1, len(idx))
df = pd.DataFrame(arr, index=idx, columns=s.index)
print (df)
w x y z
a c 1 1 1 1
d 2 2 2 2
b c 3 3 3 3
d 4 4 4 4
Timings:
np.random.seed(123)
s = pd.Series(np.random.randint(10, size=1000))
s.index = s.index.astype(str)
idx = pd.MultiIndex.from_product([np.random.randint(10, size=250), ['a','b','c', 'd']])
In [32]: %timeit (pd.DataFrame(np.repeat(s.values, len(idx)).reshape(len(idx), -1), index=idx, columns=s.index))
100 loops, best of 3: 3.94 ms per loop
In [33]: %timeit (pd.DataFrame(index=idx).assign(**s))
1 loop, best of 3: 332 ms per loop
In [34]: %timeit pd.DataFrame([s]*len(idx),idx,s.index)
10 loops, best of 3: 82.9 ms per loop
Use [s]*len(s) as data, idx as index and s.index as column to reconstruct a df.
pd.DataFrame([s]*len(s),idx,s.index)
Out[56]:
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4

Apply an element-wise function on a pandas dataframe with index and column values as inputs

I often have this need, and I can't seem to find the way to do it efficiently.
Let's say I have a pandas DataFrame object and I want the value of each element (i,j) to be equal to f(index[i], columns[j]).
Using applymap, value of index and column for each element is lost.
What is the best way to do it?
It depends on what you are trying to do specifically.
clever hack
using pd.Panel.apply
it works because it will iterate over each series along the major and minor axes. It's name will be the tuple we need.
df = pd.DataFrame(index=range(5), columns=range(5))
def f1(x):
n = x.name
return n[0] + n[1] ** 2
pd.Panel(dict(A=df)).apply(f1, 0)
0 1 2 3 4
0 0 1 4 9 16
1 1 2 5 10 17
2 2 3 6 11 18
3 3 4 7 12 19
4 4 5 8 13 20
example 1
Here is one such use case and one possible solution for that use case
df = pd.DataFrame(index=range(5), columns=range(5))
f = lambda x: x[0] + x[1]
s = df.stack(dropna=False)
s.loc[:] = s.index.map(f)
s.unstack()
0 1 2 3 4
0 0 1 2 3 4
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
or this will do the same thing
df.stack(dropna=False).to_frame().apply(lambda x: f(x.name), 1).unstack()
example 2
df = pd.DataFrame(index=list('abcd'), columns=list('xyz'))
v = df.values
c = df.columns.values
i = df.index.values
pd.DataFrame(
(np.tile(i, len(c)) + c.repeat(len(i))).reshape(v.shape),
i, c
)
x y z
a ax bx cx
b dx ay by
c cy dy az
d bz cz dz

Separate aggregated data in different rows [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.