merge two matrix (dataframe) into one in between columns - dataframe

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!

I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!

Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

Related

Setting value_counts that lower than a threshold as others

I want to set item with count<=1 as others, code for input table:
import pandas as pd
df=pd.DataFrame({"item":['a','a','a','b','b','c','d']})
input table:
item
0 a
1 a
2 a
3 b
4 b
5 c
6 d
expected output:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
How could I achieve that?
Use Series.where with check if all values are duplciates by Series.duplicated with keep=False:
df['result'] = df.item.where(df.item.duplicated(keep=False), 'other')
Or use GroupBy.transform with greater by 1 by Series.gt:
df['result'] = df.item.where(df.groupby('item')['item'].transform('size').gt(1), 'other')
Or use Series.map with Series.value_counts:
df['result'] = df.item.where(df['item'].map(df['item'].value_counts()).gt(1), 'other')
print (df)
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
Use numpy.where with Groupby.transform and Series.le:
In [926]: import numpy as np
In [927]: df['result'] = np.where(df.groupby('item')['item'].transform('count').le(1), 'other', df.item)
In [928]: df
Out[928]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other
OR use Groupby.size with merge:
In [917]: x = df.groupby('item').size().reset_index()
In [919]: ans = df.merge(x)
In [921]: ans['result'] = np.where(ans[0].le(1), 'other', ans.item)
In [923]: ans = ans.drop(0, 1)
In [924]: ans
Out[924]:
item result
0 a a
1 a a
2 a a
3 b b
4 b b
5 c other
6 d other

Group by year and get count and total count [duplicate]

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?
Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Pandas use several rows for a column MultiIndex in a DataFrame

When you load a CSV in pandas you can easily specify the number of rows to use as column indexes, as such:
import pandas
from six import StringIO
df = """a | X | X | Y | Y | Z | Z
b | C | N | C | N | C | N
c | i | i | i | j | j | j
d | 3 | 10 | 4 | 98 | 81 | 0"""
df = StringIO(df.replace(' ',''))
df = pandas.read_csv(df, sep="|", header=[0,1,2])
>>> df
a X Y Z
b C N C N C N
c i i i j j j
0 d 3 10 4 98 81 0
But how do you produce the same result from a Dataframe in memory? How do you simply specifying which set of rows should be used for the column index ?
Without of course going through this hack:
>>> df
0 1 2 3 4 5 6
0 a X X Y Y Z Z
1 b C N C N C N
2 c i i i j j j
3 d 3 10 4 98 81 0
path = '~/test/temp.csv'
df.to_csv(path, header=None, index=None)
df = pandas.read_csv(path, header=[0,1,2])
Or even this hack:
>>> df
0 1 2 3 4 5 6
0 a X X Y Y Z Z
1 b C N C N C N
2 c i i i j j j
3 d 3 10 4 98 81 0
df = df.transpose().set_index([0,1,2]).transpose()
I tried using this method, but it does not accept an axis parameter:
df.set_index(['a', 'b', 'c'], axis=1)
Your alternative solution should be improved a bit:
df = df.T.set_index([0,1,2]).T
Another solution without transpose:
df.columns = pd.MultiIndex.from_tuples(df.iloc[:3].apply(tuple))
df = df.iloc[3:].reset_index(drop=True)
print (df)
a X Y Z
b C N C N C N
c i i i j j j
0 d 3 10 4 98 81 0

How to impute column values on Dask Dataframe?

I would like to impute negative values of Dask Dataframe, with pandas i use this code:
df.loc[(df.column_name < 0),'column_name'] = 0
I think need dask.dataframe.Series.clip_lower:
ddf['B'] = ddf['B'].clip_lower(0)
Sample:
import pandas as pd
df = pd.DataFrame({'F':list('abcdef'),
'B':[-4,5,4,-5,5,4],
'A':list('aaabbb')})
print (df)
A B F
0 a -4 a
1 a 5 b
2 a 4 c
3 b -5 d
4 b 5 e
5 b 4 f
from dask import dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)
#print (ddf)
ddf['B'] = ddf['B'].clip_lower(0)
print (ddf.compute())
A B F
0 a 0 a
1 a 5 b
2 a 4 c
3 b 0 d
4 b 5 e
5 b 4 f
For more general solution use dask.dataframe.Series.mask`:
ddf['B'] = ddf['B'].mask(ddf['B'] > 0, 3)
print (ddf.compute())
A B F
0 a -4 a
1 a 3 b
2 a 3 c
3 b -5 d
4 b 3 e
5 b 3 f

Create pandas dataframe by repeating one row with new multiindex

In Pandas I have a series and a multi-index:
s = pd.Series([1,2,3,4], index=['w', 'x', 'y', 'z'])
idx = pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']])
What is the best way for me to create a DataFrame that has idx as index, and s as value for each row, preserving the index in S as columns?
df =
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
Use the pd.DataFrame constructor followed by assign
pd.DataFrame(index=idx).assign(**s)
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4
You can use numpy.repeat with numpy.ndarray.reshape for duplicate data and last DataFrame constructor:
arr = np.repeat(s.values, len(idx)).reshape(-1, len(idx))
df = pd.DataFrame(arr, index=idx, columns=s.index)
print (df)
w x y z
a c 1 1 1 1
d 2 2 2 2
b c 3 3 3 3
d 4 4 4 4
Timings:
np.random.seed(123)
s = pd.Series(np.random.randint(10, size=1000))
s.index = s.index.astype(str)
idx = pd.MultiIndex.from_product([np.random.randint(10, size=250), ['a','b','c', 'd']])
In [32]: %timeit (pd.DataFrame(np.repeat(s.values, len(idx)).reshape(len(idx), -1), index=idx, columns=s.index))
100 loops, best of 3: 3.94 ms per loop
In [33]: %timeit (pd.DataFrame(index=idx).assign(**s))
1 loop, best of 3: 332 ms per loop
In [34]: %timeit pd.DataFrame([s]*len(idx),idx,s.index)
10 loops, best of 3: 82.9 ms per loop
Use [s]*len(s) as data, idx as index and s.index as column to reconstruct a df.
pd.DataFrame([s]*len(s),idx,s.index)
Out[56]:
w x y z
a c 1 2 3 4
d 1 2 3 4
b c 1 2 3 4
d 1 2 3 4