How to combine two Pandas dataframes into a single one across the axis=2 (ie. so that the cell values are tuples)? - pandas

I have two (large) dataframes. They have the same index & columns, and I want to combine them so that they have tuple values in each cell.
The example explains it best:
pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
# Desired output:
pd.DataFrame({
'A':[(True, 1), (True, 2), (False, 3)],
'B':[(False, 5), (True, 6), (False, 7)],
})
The DataFrames are large (1m rows+), so looking to do this somewhat efficiently.
I tried np.stack([df1.values, df2.values], axis=2) and that got me the right value array, but I could not convert it into a dataframe.
Any ideas?

I got your desired output with this solution
import pandas as pd
df1 = pd.DataFrame({
'A':[True, True, False],
'B':[False, True, False],
})
df2 = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 6, 7],
})
for df_1k, df_2k in zip(df1.columns, df2.columns):
df1[df_1k] = list(map(tuple, zip(df1[df_1k], df2[df_2k])))
print(df1)

Related

Set index for aggregated dataframe

I did some calculation to a list of dataframes. I'd like the result dataframe uses rangeindex. However, it uses one of the column name as index, even I set index=None
d1 = {'id': [1, 2, 3, 4, 5], 'is_free': [True, False, False, True, True], 'level': ['Top', 'Mid', 'Top', 'Top', 'Low']}
d2 = {'id': [1, 3, 4, 5, 7], 'is_free': [True, True, False, False, False], 'level': ['Top', 'High', 'Top', 'Top', 'Low']}
d1 = pd.DataFrame(data=d1)
d2 = pd.DataFrame(data=d2)
df_list = [d1, d2]
dfs = []
for i, df in enumerate(df_list):
df = df.groupby('is_free')['id'].count()
dfs.append(df)
df = pd.DataFrame(data=dfs, index=None)
It returns
is_free False True
id 2 3
id 3 2
df.index returns
Index(['id', 'id'], dtype='object')
From your code:
df = pd.DataFrame(data=dfs, index=None).reset_index(drop=True)
However, in general, I would avoid append iteratively. Try concat:
pd.concat({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)},
axis=1).T
Or use pd.DataFrame:
pd.DataFrame({i:d.groupby('is_free')['id'].count()
for i,d in enumerate(df_list)}).T
Output:
is_free False True
0 2 3
1 3 2

numpy unique over multiple arrays

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

Printing unique list of indices in multiindex pandas dataframe

I am just starting out with pandas and have the following code:
import pandas as pd
d = {'num_legs': [4, 4, 2, 2, 2],
'num_wings': [0, 0, 2, 2, 2],
'class': ['mammal', 'mammal','bird-mammal', 'mammal', 'bird'],
'animal': ['cat', 'dog','cat', 'bat', 'penguin'],
'locomotion': ['walks', 'walks','hops', 'flies', 'walks']}
df = pd.DataFrame(data=d)
df = df.set_index(['class', 'animal', 'locomotion'])
I want to print everything that the animal cat does; here, that will be 'walks' and 'hops'.
I can filter to just the cat cross-section using
df2=df.xs('cat', level=1)
But from here, how do I access the level 'locomotion'?
You can do get_level_values
df.xs('cat', level=1).index.get_level_values(1)
Out[181]: Index(['walks', 'hops'], dtype='object', name='locomotion')

pandas dataframe subplot grouping by columns

df = pd.DataFrame([[0, 1, 2], [0, 1, 2]])
df.plot(subplots=True)
I want subplot by group [0, 1] and [2] columns. is there the way?
You can use DataFrameGroupBy.plot by Index.map by dictionary for 2 groups:
mapping = {0:'a', 1:'a', 2:'b'}
df.groupby(df.columns.map(mapping.get), axis=1).plot()
Detail:
print (df.columns.map(mapping.get))
Index(['a', 'a', 'b'], dtype='object')