Reshape neural network input based on condition - numpy

Numerical data, using DL Neural Networks. I'm using Keras library for this purpose
p u d ms action B x y-c pre area finger
0 0 36 3 1334893336790 0 1 262 262 262 0.044444 0.0
1 0 36 3 1334893336790 2 1 262 271 0.32 0.044444 0.0
2 0 36 3 1334893336795 2 1 123 327 0.28 0.044444 0.0
3 0 36 3 1334893336800 1 1 123 327 0.28 0.044444 0.0
4 0 36 3 1334893336885 0 1 216 298 0.34 0.044444 0.0
5 0 36 3 1334893336907 2 1 216 298 0.38 0.044444 0.0
6 0 36 3 1334893336926 2 1 147 312 0.60 0.088889 0.0
7 0 36 3 1334893336949 2 1 115 328 0.63 0.044444 0.0
8 0 36 3 1334893336952 2 1 98 336 0.17 0.133333 0.0
9 0 36 3 1334893336971 1 1 98 336 0.17 0.133333 0.0
1 0 36 3 1334893337798 0 1 108 339 0.48 0.044444 0.0
The below code is working but, as I understand neural network inputs are rowed by row as input, here I'm trying to make the input and output based on action column like the below example when it starts with 0 and ends with 1 then the first input for the neural network from row [0 to 3] 3 included and the second input is [4 to 9] 9 is included and etc...
The elements in the action column represent the hand movement, if the value 0 means that the finger of the hand presses on the screen, either if 1 means that the hand was lifted from the screen, so I try to divide into n number of stroke,trying to make input on neural network based on finger pressure and lifting(stroke),based on this idea input will decrease from 900k to 20k but each time the input will be multi-rows based
the first input will be as below:
p u d ms action B x y-c pre area finger
0 0 36 3 1334893336790 0 1 262 262 262 0.044444 0.0
1 0 36 3 1334893336790 2 1 262 271 0.32 0.044444 0.0
2 0 36 3 1334893336795 2 1 123 327 0.28 0.044444 0.0
3 0 36 3 1334893336800 1 1 123 327 0.28 0.044444 0.0
and the second input will be :
p u d ms action B x y-c pre area finger
4 0 36 3 1334893336885 0 1 216 298 0.34 0.044444 0.0
5 0 36 3 1334893336907 2 1 216 298 0.38 0.044444 0.0
6 0 36 3 1334893336926 2 1 147 312 0.60 0.088889 0.0
7 0 36 3 1334893336949 2 1 115 328 0.63 0.044444 0.0
8 0 36 3 1334893336952 2 1 98 336 0.17 0.133333 0.0
9 0 36 3 1334893336971 1 1 98 336 0.17 0.133333 0.0
here is my code and its working well in the normal cycle for NN but im trying to change it based on my idea..
#o = no_of_click
o=0
lenf=len(dataset)
for h in dataset.index[dataset.iloc[:, 4] == 0]:
if dataset.iloc[h+1,4]==1 :
dataset.iloc[h+1,4]=-1
dataset.iloc[h , 4] = -1
o=o+1
dataset=dataset.drop(dataset[dataset.iloc[:, 4] == -1].index)
lenf=(o*2)
X = dataset.iloc[:, 2:].values #here 3to 11 for x
y = dataset.iloc[:, 1].values #here user id 2 only y
binariz = LabelBinarizer()
s = binariz.fit_transform(X[:, 0])
X = np.delete(X, [0], axis=1)
X = np.hstack([s,X])
y = binariz.fit_transform(y)
# X Features scaling
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
# Splitting Data
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 50, activation = 'relu', input_dim = X_train.shape[1]))
# Adding the second hidden layer
classifier.add(Dense(units = 50, activation = 'relu'))
# Adding the output layer
classifier.add(Dense(units = y.shape[1], activation = 'softmax'))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 100, epochs = 10)

I'm not sure if I understood your question correctly; apologies in advance if I'm making incorrect assumptions.
It seems to me that you are asking whether you can reshape the input vector so in one case it has shape=(4,) and in the other shape=(6,).
I don't believe you can, since when you add a Dense layer after an input layer this Dense layer has a weights matrix that is shaped as (input_dims, output_dims). This is selected when the graph is built.
Even if you could, I don't think you would want to.
The input vector to a NN is a set of features; it seems that in your case this is a set of different measurements. You don't want to feed the network with measurement of feature0 in the input tensor position 0 in one scenario and measurement of feature4 in another scenario. That makes it much harder for the network to understand how to process these values.
Given that you have a small set of features, is there any reason you don't just pass all the data all the time ?

Related

Calculate Weights of a Column in Pandas

this is a basic quesiton and easy to do in excel but have not an idea in python and every example online uses groupby with multiple names in the name column. So, all I need is a row value of weights from a single column. Suppose I have data that looks like this:
name value
0 A 45
1 B 76
2 C 320
3 D 210
The answer should look like this:
0 name value weights
1 A 45 0.069124
2 B 76 0.116743
3 C 320 0.491551
4 D 210 0.322581
thank you,
Use GroupBy.transform for repeat sum per groups, so possible divide original column:
print (df.groupby('name')['value'].transform('sum'))
0 651
1 651
2 651
3 651
Name: value, dtype: int64
df['weights'] = df['value'].div(df.groupby('name')['value'].transform('sum'))
print (df)
name value weights
0 A 45 0.069124
1 A 76 0.116743
2 A 320 0.491551
3 A 210 0.322581
EDIT:
df['weights'] = df['value'].div(df['value'].sum())
print (df)
name value weights
0 A 45 0.069124
1 B 76 0.116743
2 C 320 0.491551
3 D 210 0.322581
You can also groupby 'name' and then apply a function that divides each value by its group sum:
df['weights'] = df.groupby('name')['value'].apply(lambda x: x / x.sum())
Output:
name value weights
0 A 45 0.069124
1 A 76 0.116743
2 A 320 0.491551
3 A 210 0.322581
For new data:
df['weights'] = df['value'] / df['value'].sum()
name value weights
0 A 45 0.069124
1 B 76 0.116743
2 C 320 0.491551
3 D 210 0.322581

NaNs when using Pandas subtract [duplicate]

The question
Given a Series s and DataFrame df, how do I operate on each column of df with s?
df = pd.DataFrame(
[[1, 2, 3], [4, 5, 6]],
index=[0, 1],
columns=['a', 'b', 'c']
)
s = pd.Series([3, 14], index=[0, 1])
When I attempt to add them, I get all np.nan
df + s
a b c 0 1
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
What I thought I should get is
a b c
0 4 5 6
1 18 19 20
Objective and motivation
I've seen this kind of question several times over and have seen many other questions that involve some element of this. Most recently, I had to spend a bit of time explaining this concept in comments while looking for an appropriate canonical Q&A. I did not find one and so I thought I'd write one.
These questions usually arises with respect to a specific operation, but equally applies to most arithmetic operations.
How do I subtract a Series from every column in a DataFrame?
How do I add a Series from every column in a DataFrame?
How do I multiply a Series from every column in a DataFrame?
How do I divide a Series from every column in a DataFrame?
It is helpful to create a mental model of what Series and DataFrame objects are.
Anatomy of a Series
A Series should be thought of as an enhanced dictionary. This isn't always a perfect analogy, but we'll start here. Also, there are other analogies that you can make, but I am targeting a dictionary in order to demonstrate the purpose of this post.
index
These are the keys that we can reference to get at the corresponding values. When the elements of the index are unique, the comparison to a dictionary becomes very close.
values
These are the corresponding values that are keyed by the index.
Anatomy of a DataFrame
A DataFrame should be thought of as a dictionary of Series or a Series of Series. In this case the keys are the column names and the values are the columns themselves as Series objects. Each Series agrees to share the same index which is the index of the DataFrame.
columns
These are the keys that we can reference to get at the corresponding Series.
index
This the the index that all of the Series values agree to share.
Note: RE: columns and index objects
They are the same kind of things. A DataFrames index can be used as another DataFrames columns. In fact, this happens when you do df.T to get a transpose.
values
This is a two-dimensional array that contains the data in a DataFrame. The reality is that values is not what is stored inside the DataFrame object. (Well, sometimes it is, but I'm not about to try to describe the block manager). The point is, it is better to think of this as access to a two-dimensional array of the data.
Define Sample Data
These are sample pandas.Index objects that can be used as the index of a Series or DataFrame or can be used as the columns of a DataFrame:
idx_lower = pd.Index([*'abcde'], name='lower')
idx_range = pd.RangeIndex(5, name='range')
These are sample pandas.Series objects that use the pandas.Index objects above:
s0 = pd.Series(range(10, 15), idx_lower)
s1 = pd.Series(range(30, 40, 2), idx_lower)
s2 = pd.Series(range(50, 10, -8), idx_range)
These are sample pandas.DataFrame objects that use the pandas.Index objects above:
df0 = pd.DataFrame(100, index=idx_range, columns=idx_lower)
df1 = pd.DataFrame(
np.arange(np.product(df0.shape)).reshape(df0.shape),
index=idx_range, columns=idx_lower
)
Series on Series
When operating on two Series, the alignment is obvious. You align the index of one Series with the index of the other.
s1 + s0
lower
a 40
b 43
c 46
d 49
e 52
dtype: int64
Which is the same as when I randomly shuffle one before I operate. The indices will still align.
s1 + s0.sample(frac=1)
lower
a 40
b 43
c 46
d 49
e 52
dtype: int64
And is not the case when instead I operate with the values of the shuffled Series. In this case, Pandas doesn't have the index to align with and therefore operates from a positions.
s1 + s0.sample(frac=1).values
lower
a 42
b 42
c 47
d 50
e 49
dtype: int64
Add a scalar
s1 + 1
lower
a 31
b 33
c 35
d 37
e 39
dtype: int64
DataFrame on DataFrame
The similar is true when operating between two DataFrames. The alignment is obvious and does what we think it should do:
df0 + df1
lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124
It shuffles the second DataFrame on both axes. The index and columns will still align and give us the same thing.
df0 + df1.sample(frac=1).sample(frac=1, axis=1)
lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124
It is the same shuffling, but it adds the array and not the DataFrame. It is no longer aligned and will get different results.
df0 + df1.sample(frac=1).sample(frac=1, axis=1).values
lower a b c d e
range
0 123 124 121 122 120
1 118 119 116 117 115
2 108 109 106 107 105
3 103 104 101 102 100
4 113 114 111 112 110
Add a one-dimensional array. It will align with columns and broadcast across rows.
df0 + [*range(2, df0.shape[1] + 2)]
lower a b c d e
range
0 102 103 104 105 106
1 102 103 104 105 106
2 102 103 104 105 106
3 102 103 104 105 106
4 102 103 104 105 106
Add a scalar. There isn't anything to align with, so broadcasts to everything:
df0 + 1
lower a b c d e
range
0 101 101 101 101 101
1 101 101 101 101 101
2 101 101 101 101 101
3 101 101 101 101 101
4 101 101 101 101 101
DataFrame on Series
If DataFrames are to be thought of as dictionaries of Series and Series are to be thought of as dictionaries of values, then it is natural that when operating between a DataFrame and Series that they should be aligned by their "keys".
s0:
lower a b c d e
10 11 12 13 14
df0:
lower a b c d e
range
0 100 100 100 100 100
1 100 100 100 100 100
2 100 100 100 100 100
3 100 100 100 100 100
4 100 100 100 100 100
And when we operate, the 10 in s0['a'] gets added to the entire column of df0['a']:
df0 + s0
lower a b c d e
range
0 110 111 112 113 114
1 110 111 112 113 114
2 110 111 112 113 114
3 110 111 112 113 114
4 110 111 112 113 114
The heart of the issue and point of the post
What about if I want s2 and df0?
s2: df0:
| lower a b c d e
range | range
0 50 | 0 100 100 100 100 100
1 42 | 1 100 100 100 100 100
2 34 | 2 100 100 100 100 100
3 26 | 3 100 100 100 100 100
4 18 | 4 100 100 100 100 100
When I operate, I get the all np.nan as cited in the question:
df0 + s2
a b c d e 0 1 2 3 4
range
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
This does not produce what we wanted, because Pandas is aligning the index of s2 with the columns of df0. The columns of the result includes a union of the index of s2 and the columns of df0.
We could fake it out with a tricky transposition:
(df0.T + s2).T
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
But it turns out Pandas has a better solution. There are operation methods that allow us to pass an axis argument to specify the axis to align with.
- sub
+ add
* mul
/ div
** pow
And so the answer is simply:
df0.add(s2, axis='index')
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
It turns out axis='index' is synonymous with axis=0.
As is axis='columns' synonymous with axis=1:
df0.add(s2, axis=0)
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
The rest of the operations
df0.sub(s2, axis=0)
lower a b c d e
range
0 50 50 50 50 50
1 58 58 58 58 58
2 66 66 66 66 66
3 74 74 74 74 74
4 82 82 82 82 82
df0.mul(s2, axis=0)
lower a b c d e
range
0 5000 5000 5000 5000 5000
1 4200 4200 4200 4200 4200
2 3400 3400 3400 3400 3400
3 2600 2600 2600 2600 2600
4 1800 1800 1800 1800 1800
df0.div(s2, axis=0)
lower a b c d e
range
0 2.000000 2.000000 2.000000 2.000000 2.000000
1 2.380952 2.380952 2.380952 2.380952 2.380952
2 2.941176 2.941176 2.941176 2.941176 2.941176
3 3.846154 3.846154 3.846154 3.846154 3.846154
4 5.555556 5.555556 5.555556 5.555556 5.555556
df0.pow(1 / s2, axis=0)
lower a b c d e
range
0 1.096478 1.096478 1.096478 1.096478 1.096478
1 1.115884 1.115884 1.115884 1.115884 1.115884
2 1.145048 1.145048 1.145048 1.145048 1.145048
3 1.193777 1.193777 1.193777 1.193777 1.193777
4 1.291550 1.291550 1.291550 1.291550 1.291550
It's important to address some higher level concepts first. Since my motivation is to share knowledge and teach, I wanted to make this as clear as possible.
I prefer the method mentioned by piSquared (i.e., df.add(s, axis=0)), but another method uses apply together with lambda to perform an action on each column in the dataframe:
>>>> df.apply(lambda col: col + s)
a b c
0 4 5 6
1 18 19 20
To apply the lambda function to the rows, use axis=1:
>>> df.T.apply(lambda row: row + s, axis=1)
0 1
a 4 18
b 5 19
c 6 20
This method could be useful when the transformation is more complex, e.g.:
df.apply(lambda col: 0.5 * col ** 2 + 2 * s - 3)
Just to add an extra layer from my own experience. It extends what others have done here. This shows how to operate on a Series with a DataFrame that has extra columns that you want to keep the values for. Below is a short demonstration of the process.
import pandas as pd
d = [1.056323, 0.126681,
0.142588, 0.254143,
0.15561, 0.139571,
0.102893, 0.052411]
df = pd.Series(d, index = ['const', '426', '428', '424', '425', '423', '427', '636'])
print(df)
const 1.056323
426 0.126681
428 0.142588
424 0.254143
425 0.155610
423 0.139571
427 0.102893
636 0.052411
d2 = {
'loc': ['D', 'D', 'E', 'E', 'F', 'F', 'G', 'G', 'E', 'D'],
'426': [9, 2, 3, 2, 4, 0, 2, 7, 2, 8],
'428': [2, 4, 1, 0, 2, 1, 3, 0, 7, 8],
'424': [1, 10, 5, 8, 2, 7, 10, 0, 3, 5],
'425': [9, 2, 6, 8, 9, 1, 7, 3, 8, 6],
'423': [4, 2, 8, 7, 9, 6, 10, 5, 9, 9],
'423': [2, 7, 3, 10, 8, 1, 2, 9, 3, 9],
'427': [4, 10, 4, 0, 8, 3, 1, 5, 7, 7],
'636': [10, 5, 6, 4, 0, 5, 1, 1, 4, 8],
'seq': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
df2 = pd.DataFrame(d2)
print(df2)
loc 426 428 424 425 423 427 636 seq
0 D 9 2 1 9 2 4 10 1
1 D 2 4 10 2 7 10 5 1
2 E 3 1 5 6 3 4 6 1
3 E 2 0 8 8 10 0 4 1
4 F 4 2 2 9 8 8 0 1
5 F 0 1 7 1 1 3 5 1
6 G 2 3 10 7 2 1 1 1
7 G 7 0 0 3 9 5 1 1
8 E 2 7 3 8 3 7 4 1
9 D 8 8 5 6 9 7 8 1
To multiply a DataFrame by a Series and keep dissimilar columns
Create a list of the elements in the DataFrame and Series you want to operate on:
col = ['426', '428', '424', '425', '423', '427', '636']
Perform your operation using the list and indicate the axis to use:
df2[col] = df2[col].mul(df[col], axis=1)
print(df2)
loc 426 428 424 425 423 427 636 seq
0 D 1.140129 0.285176 0.254143 1.40049 0.279142 0.411572 0.524110 1
1 D 0.253362 0.570352 2.541430 0.31122 0.976997 1.028930 0.262055 1
2 E 0.380043 0.142588 1.270715 0.93366 0.418713 0.411572 0.314466 1
3 E 0.253362 0.000000 2.033144 1.24488 1.395710 0.000000 0.209644 1
4 F 0.506724 0.285176 0.508286 1.40049 1.116568 0.823144 0.000000 1
5 F 0.000000 0.142588 1.779001 0.15561 0.139571 0.308679 0.262055 1
6 G 0.253362 0.427764 2.541430 1.08927 0.279142 0.102893 0.052411 1
7 G 0.886767 0.000000 0.000000 0.46683 1.256139 0.514465 0.052411 1
8 E 0.253362 0.998116 0.762429 1.24488 0.418713 0.720251 0.209644 1
9 D 1.013448 1.140704 1.270715 0.93366 1.256139 0.720251 0.419288 1

pandas - Replaceing values in a column

I have a dataframe df like this:
ID_USER CODE
0 433805 11.0
24 5448 44.0
48 3434 11.0
72 34434 11.0
96 3202 33.0
120 23766 33.0
153 39457 44.0
168 4113 33.0
172 3435 13.0
374 34093 11.0
And I try to replace the values from the 'CODE' column with other values.
11.0 and 44.0 -> 1
33.0 -> 0
all other -> 5
So I did among others the following:
df['CODE'] = df.apply(lambda s:func1(s))
def func1(x):
if (x['CODE'] == 11.0) or (x['CODE'] == 44.0):
return 1
elif (x['CODE'] == 33.0):
return 0
else:
return 5
And I get this error:
KeyError: ('NTL', u'occurred at index ID_UC')
How can I solve my problem?
You can use np.where
df1.CODE = np.where((df1.CODE == 11.0) | (df1.CODE == 44.0), 1, np.where((df1.CODE == 33.0), 0, 5))
ID_USER CODE
0 433805 1
24 5448 1
48 3434 1
72 34434 1
96 3202 0
120 23766 0
153 39457 1
168 4113 0
172 3435 5
374 34093 1
Short answer is that you forgot to specify the axis over which to apply. By default apply will iterate over every column. Your function is looking for x['CODE'] and therefore it's safe to assume you meant this to iterate over rows
df.apply(lambda s:func1(s), axis=1)
0 1
24 1
48 1
72 1
96 0
120 0
153 1
168 0
172 5
374 1
dtype: int64
You can shorten this up with
df.apply(func1, 1)
That said, I'd improve your function to assume you are iterating over a pd.Series and not rows of a pd.DataFrame and apply it to the targeted column.
def func2(x):
return 1 if (x == 11) or (x == 44) else 0 if (x == 33) else 5
df.CODE.apply(func2)
Even better, I like using map + lambda
m = {11: 1, 44: 1, 33: 0}
df.CODE.map(lambda x: m.get(x, 5))
All together
df.assign(CODE=df.CODE.map(lambda x: m.get(x, 5)))
ID_USER CODE
0 433805 1
24 5448 1
48 3434 1
72 34434 1
96 3202 0
120 23766 0
153 39457 1
168 4113 0
172 3435 5
374 34093 1

How to plot aggregated DataFrame using two columns?

I have the following, using a DF that has two columns that I would like to aggregate by:
df2.groupby(['airline_clean','sentiment']).size()
airline_clean sentiment
americanair -1 14
0 36
1 1804
2 722
3 171
4 1
jetblue -1 2
0 7
1 1074
2 868
3 250
4 11
southwestair -1 4
0 20
1 1320
2 829
3 237
4 4
united -1 7
0 74
1 2467
2 1026
3 221
4 5
usairways -1 5
0 62
1 1962
2 716
3 155
4 2
virginamerica -1 2
0 2
1 250
2 180
3 69
dtype: int64
Plotting the aggragated view:
dfc=df2.groupby(['airline_clean','sentiment']).size()
dfc.plot(kind='bar', stacked=True,figsize=(18,6))
Results in:
I would like to change two things:
plot the data in a stacked chart by airline
using % instead of raw numbers (by airline as well)
I am not sure how to achieve that. Any direction is appreciated.
The best way is to plot this dataset is to convert to % values first and use unstack() for plotting:
airline_sentiment = df3.groupby(['airline_clean', 'sentiment']).agg({'tweet_count': 'sum'})
airline = df3.groupby(['airline_clean']).agg({'tweet_count': 'sum'})
p = airline_sentiment.div(airline, level='airline_clean') * 100
p.unstack().plot(kind='bar',stacked=True,figsize=(9, 6),title='Sentiment % distribution by airline')
This results in a nice chart:

pandas dataframe multiply with a series [duplicate]

This question already has answers here:
How do I operate on a DataFrame with a Series for every column?
(3 answers)
Closed 4 years ago.
What is the best way to multiply all the columns of a Pandas DataFrame by a column vector stored in a Series? I used to do this in Matlab with repmat(), which doesn't exist in Pandas. I can use np.tile(), but it looks ugly to convert the data structure back and forth each time.
Thanks.
What's wrong with
result = dataframe.mul(series, axis=0)
?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul
This can be accomplished quite simply with the DataFrame method apply.
In[1]: import pandas as pd; import numpy as np
In[2]: df = pd.DataFrame(np.arange(40.).reshape((8, 5)), columns=list('abcde')); df
Out[2]:
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
5 25 26 27 28 29
6 30 31 32 33 34
7 35 36 37 38 39
In[3]: ser = pd.Series(np.arange(8) * 10); ser
Out[3]:
0 0
1 10
2 20
3 30
4 40
5 50
6 60
7 70
Now that we have our DataFrame and Series we need a function to pass to apply.
In[4]: func = lambda x: np.asarray(x) * np.asarray(ser)
We can pass this to df.apply and we are good to go
In[5]: df.apply(func)
Out[5]:
a b c d e
0 0 0 0 0 0
1 50 60 70 80 90
2 200 220 240 260 280
3 450 480 510 540 570
4 800 840 880 920 960
5 1250 1300 1350 1400 1450
6 1800 1860 1920 1980 2040
7 2450 2520 2590 2660 2730
df.apply acts column-wise by default, but it can can also act row-wise by passing axis=1 as an argument to apply.
In[6]: ser2 = pd.Series(np.arange(5) *5); ser2
Out[6]:
0 0
1 5
2 10
3 15
4 20
In[7]: func2 = lambda x: np.asarray(x) * np.asarray(ser2)
In[8]: df.apply(func2, axis=1)
Out[8]:
a b c d e
0 0 5 20 45 80
1 0 30 70 120 180
2 0 55 120 195 280
3 0 80 170 270 380
4 0 105 220 345 480
5 0 130 270 420 580
6 0 155 320 495 680
7 0 180 370 570 780
This could be done more concisely by defining the anonymous function inside apply
In[9]: df.apply(lambda x: np.asarray(x) * np.asarray(ser))
Out[9]:
a b c d e
0 0 0 0 0 0
1 50 60 70 80 90
2 200 220 240 260 280
3 450 480 510 540 570
4 800 840 880 920 960
5 1250 1300 1350 1400 1450
6 1800 1860 1920 1980 2040
7 2450 2520 2590 2660 2730
In[10]: df.apply(lambda x: np.asarray(x) * np.asarray(ser2), axis=1)
Out[10]:
a b c d e
0 0 5 20 45 80
1 0 30 70 120 180
2 0 55 120 195 280
3 0 80 170 270 380
4 0 105 220 345 480
5 0 130 270 420 580
6 0 155 320 495 680
7 0 180 370 570 780
Why not create your own dataframe tile function:
def tile_df(df, n, m):
dfn = df.T
for _ in range(1, m):
dfn = dfn.append(df.T, ignore_index=True)
dfm = dfn.T
for _ in range(1, n):
dfm = dfm.append(dfn.T, ignore_index=True)
return dfm
Example:
df = pandas.DataFrame([[1,2],[3,4]])
tile_df(df, 2, 3)
# 0 1 2 3 4 5
# 0 1 2 1 2 1 2
# 1 3 4 3 4 3 4
# 2 1 2 1 2 1 2
# 3 3 4 3 4 3 4
However, the docs note: "DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix." Which presumably should be interpreted as "use numpy if you are doing lots of matrix stuff".