populate a dense dataframe given a key-value dataframe - pandas

I have a key-value dataframe:
pd.DataFrame(columns=['X','Y','val'],data= [['a','z',5],['b','g',3],['b','y',6],['e','r',9]])
> X Y val
0 a z 5
1 b g 3
2 b y 6
3 e r 9
Which I'd like to convert into a denser dataframe:
X z g y r
0 a 5 0 0 0
1 b 0 3 6 0
2 e 0 0 0 9
Before I resort to a pure-python I was wondering if there was a simple way to do this with pandas.

You can use get_dummies:
In [11]: dummies = pd.get_dummies(df['Y'])
In [12]: dummies
Out[12]:
g r y z
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
and then multiply by the val column:
In [13]: res = dummies.mul(df['val'], axis=0)
In [14]: res
Out[14]:
g r y z
0 0 0 0 5
1 3 0 0 0
2 0 0 6 0
3 0 9 0 0
To fix the index, you could just add the X as this index, you could first apply set_index:
In [21]: df1 = df.set_index('X', append=True)
In [22]: df1
Out[22]:
Y val
X
0 a z 5
1 b g 3
2 b y 6
3 e r 9
In [23]: dummies = pd.get_dummies(df['Y'])
In [24]: dummies.mul(df['val'], axis=0)
Out[24]:
g r y z
X
0 a 0 0 0 5
1 b 3 0 0 0
2 b 0 0 6 0
3 e 0 9 0 0
If you wanted to do this pivot (you can also use pivot_table):
In [31]: df.pivot('X', 'Y').fillna(0)
Out[31]:
val
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
Perhaps you want to reset_index, to make X a column (I'm not sure whether than makes sense):
In [32]: df.pivot('X', 'Y').fillna(0).reset_index()
Out[32]:
X val
Y g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
For completeness, the pivot_table:
In [33]: df.pivot_table('val', 'X', 'Y', fill_value=0)
Out[33]:
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
In [34]: df.pivot_table('val', 'X', 'Y', fill_value=0).reset_index()
Out[34]:
Y X g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
Note: the column name are named Y, after reseting the index, not sure if this makes sense (and easy to rectify via res.columns.name = None).

If you want something that feels more direct. Something akin to DataFrame.lookup but for np.put might make sense.
def lookup_index(self, row_labels, col_labels):
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise ValueError('One or more row labels was not found')
if (cidx == -1).any():
raise ValueError('One or more column labels was not found')
flat_index = ridx * len(self.columns) + cidx
return flat_index
flat_index = lookup_index(df, vals.X, vals.Y)
np.put(df.values, flat_index, vals.val.values)
This assumes that df has the appropriate columns and index to hold the X/Y values. Here's an ipython notebook http://nbviewer.ipython.org/6454120

Related

create new pandas column that is a tabulation of rows above

I have:
pd.DataFrame({'col1':['A','A','B','F']})
col1
0 A
1 A
2 B
3 F
I want:
pd.DataFrame({'col1':['A','A','B','F'],'col2':['1A:0B:0C:0D:0E:0F','2A:0B:0C:0D:0E:0F','2A:1B:0C:0D:0E:0F','2A:1B:0C:0D:0E:1F']})
col1 col2
0 A 1A:0B:0C:0D:0E:0F
1 A 2A:0B:0C:0D:0E:0F
2 B 2A:1B:0C:0D:0E:0F
3 F 2A:1B:0C:0D:0E:1F
Requirements:
I have a column that can take one of 6 values (A:F). I want to create a new column that shows the running total of the values of that row and above.
any suggestions?
You can use get_dummies + cumsum. That output is generally easier to work with, but if you need that single string output, you can join the columns with the counts. The .reindex and .fillna ensure everything is ordered and includes exactly the categories you want.
import pandas as pd
df = pd.DataFrame({'col1':['A','A','B','F']})
df = (pd.get_dummies(df['col1'])
.reindex(list('ABCDEF'), axis=1)
.fillna(0, downcast='infer')
.cumsum())
# A B C D E F
#0 1 0 0 0 0 0
#1 2 0 0 0 0 0
#2 2 1 0 0 0 0
#3 2 1 0 0 0 1
df['res'] = [':'.join(x) for x in (df.astype(str)+df.columns).to_numpy()]
# A B C D E F res
#0 1 0 0 0 0 0 1A:0B:0C:0D:0E:0F
#1 2 0 0 0 0 0 2A:0B:0C:0D:0E:0F
#2 2 1 0 0 0 0 2A:1B:0C:0D:0E:0F
#3 2 1 0 0 0 1 2A:1B:0C:0D:0E:1F

Python Select N number of rows dataframe

I have a dataframe with 2 columns and I want to select N number of row from column B per column A
A B
0 A
0 B
0 I
0 D
1 A
1 F
1 K
1 L
2 R
For each unique number in Column A give me N random rows from Column B: if N == 2 then the resulting dataframe would look like. If Column A doesn't have up to N rows then return all of column A
A B
0 A
0 D
1 F
1 K
2 R
Use DataFrame.sample per groups in GroupBy.apply with test length of groups with if-else:
N = 2
df1 = df.groupby('A').apply(lambda x: x.sample(N) if len(x) >=N else x).reset_index(drop=True)
print (df1)
A B
0 0 I
1 0 D
2 1 A
3 1 K
4 2 R
Or:
N = 2
df1 = df.groupby('A', group_keys=False).apply(lambda x: x.sample(N) if len(x) >=N else x)
print (df1)
A B
0 0 A
3 0 D
5 1 F
6 1 K
8 2 R

How to display nth-row and last row

I need to display nth rows and last one using pandas. I know that nth-rows could be displayed by using iloc
for example:
data = {"x1": [1,2,3,4,5,6,7,8,9,10], "x2": ["a","b","c","d","e","f","g","h","i","j"]}
df = pd.DataFrame(data=data)
a = df.iloc[::2]
print(a)
will display
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
But I need it to be:
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
9 10 j
how it could be achieved?
Use union of indices and select by loc if default RangeIndex:
a = df.loc[df.index[::2].union([df.index[-1]])]
print(a)
x1 x2
0 1 a
2 3 c
4 5 e
6 7 g
8 9 i
9 10 j
Detail:
print(df.index[::2].union([df.index[-1]]))
Int64Index([0, 2, 4, 6, 8, 9], dtype='int64')
Another more general solution:
data = {"x1": [1,2,3,4,5,6,7,8,9,10], "x2": ["a","b","c","d","e","f","g","h","i","j"]}
df = pd.DataFrame(data=data, index=[0]*10)
print (df)
x1 x2
0 1 a
0 2 b
0 3 c
0 4 d
0 5 e
0 6 f
0 7 g
0 8 h
0 9 i
0 10 j
arr = np.arange(len(df.index))
a = df.iloc[np.union1d(arr[::2], [arr[-1]])]
print(a)
x1 x2
0 1 a
0 3 c
0 5 e
0 7 g
0 9 i
0 10 j

Select rows if columns meet condition

I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1

create new column based on other columns in pandas dataframe

What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7