I have two matrix: numpy square matrix and a panda multiindexed square matrix. They are the same size. The idea is to get the value from numpy into the multiindex panda matrix to navigate more easily into the data.
My matrix are around 100 000 x 100 000.
And my panda matrix has three level of index.
tuples = [('1','A','a'), ('1','A','b'), ('1','A','c'), ('1','B','a'), ('1','B','b'), ('1','B','c'), ('2','A','a'), ('2','A','b'), ('2','B','a')]
index = pd.MultiIndex.from_tuples(tuples, names=['geography', 'product','activity'])
df = pd.DataFrame(index=index, columns=index)
geography 1 2
product A B A B
activity a b c a b c a b a
geography product activity
1 A a 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0
B a 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0
c 0 0 0 0 0 0 0 0 0
2 A a 0 0 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0 0 0
B a 0 0 0 0 0 0 0 0 0
np.random.rand(9,9)
array([[ 0.27302806, 0.33926193, 0.01489047, 0.71959889, 0.43500806,
0.03607795, 0.03747561, 0.43000199, 0.8091691 ],
[ 0.96626878, 0.37613022, 0.7739084 , 0.16724657, 0.01144436,
0.0107722 , 0.73513494, 0.13305542, 0.2910334 ],
[ 0.00622779, 0.93699165, 0.62725798, 0.25009469, 0.14010666,
0.61826728, 0.72060106, 0.58864557, 0.29375779],
[ 0.14937979, 0.45269751, 0.68450964, 0.15986812, 0.69879559,
0.06573519, 0.57504452, 0.49540882, 0.77283616],
[ 0.60933817, 0.2701683 , 0.69067959, 0.22806386, 0.79456502,
0.75107457, 0.2805325 , 0.27659171, 0.33446821],
[ 0.82860687, 0.27055835, 0.37684942, 0.18962783, 0.59885119,
0.31246936, 0.94522335, 0.53487273, 0.00611481],
[ 0.27683582, 0.23653112, 0.41250374, 0.5024068 , 0.27621212,
0.81379001, 0.6704781 , 0.87521485, 0.04577144],
[ 0.95516958, 0.21844023, 0.86558273, 0.52300142, 0.91328259,
0.7587479 , 0.15201837, 0.15376074, 0.12092142],
[ 0.36835891, 0.0381736 , 0.36473176, 0.30510363, 0.19433639,
0.43431018, 0.00112607, 0.35334684, 0.82307449]])
How I can put the value of the numpy matrix into in the panda multiindex matrix. The two matrix by construction have the same structure, i.e. the numpy matrix is the panda one without label indexes.
I found a dozen of examples to transform multiindex df into numpy array, but not in this way. Only one example of a 3 dimensional numpy array, but mine is not a 3-d np array.
Thanks to Divakar.
Something, just df[:] = np.random.rand(9,9) and it is all right.
Related
I have difficulty in creating a classification matrix for multi-label classification to evaluate the performance of the MLPClassifier model. The confusion matrix output should be 10x10 but instead I get 8x8 as it doesn't shows label values for 9 and 10. The class labels of true and predicted labels are from 1 to 10 (unordered). The implementation of the code looks like this:
import matplotlib.pyplot as plt
import seaborn as sns
side_bar = [1,2,3,4,5,6,7,8,9,10]
f, ax = plt.subplots(figsize=(12,12))
sns.heatmap(cm, annot=True, linewidth=.5, linecolor="r", fmt=".0f", ax = ax)
ax.set_xticklabels(side_bar)
ax.set_yticklabels(side_bar)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
confusion matrix heatmap
Edit: The code & output of the constructed confusion matrix are as follows:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
str(cm)
[[20 0 0 1 0 5 1 0]
[ 3 0 0 0 0 0 0 0]
[ 1 1 0 1 0 1 0 0]
[ 3 0 0 0 0 3 1 1]
[ 0 0 0 0 0 1 0 0]
[ 3 0 0 1 0 2 1 1]
[ 3 0 0 0 0 0 0 2]
[ 1 0 0 0 0 0 0 1]]
'[[20 0 0 1 0 5 1 0]\n [ 3 0 0 0 0 0 0 0]\n [ 1 1 0 1 0
1 0 0]\n [ 3 0 0 0 0 3 1 1]\n [ 0 0 0 0 0 1 0 0]\n [ 3 0
0 1 0 2 1 1]\n [ 3 0 0 0 0 0 0 2]\n [ 1 0 0 0 0 0 0
1]]'
Could anyone provide me a solution on how can I fix this issue?
Can use please help with below problem:
Given two dataframes df1 and df2, need to get something like result dataframe.
import pandas as pd
import numpy as np
feature_list = [ str(i) for i in range(6)]
df1 = pd.DataFrame( {'value' : [0,3,0,4,2,5]})
df2 = pd.DataFrame(0, index=np.arange(6), columns=feature_list)
Expected Dataframe :
Need to be driven by comparing values from df1 with column names (features) in df2. if they match, we put 1 in resultDf
Here's expected output (or resultsDf):
I think you need:
(pd.get_dummies(df1['value'])
.rename(columns = str)
.reindex(columns = df2.columns,
index = df2.index,
fill_value = 0))
0 1 2 3 4 5
0 1 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 1
I have 3 dataframes where column names and number of rows are exactly the same in all 3 data frames. I want to plot all the columns from all three dataframes as a grouped boxplot into one image using seaborn or matplotlib. But I am having difficulties in combining and formating the data so that I can plot them as grouped box plot.
df=
A B C D E F G H I J
0 0.031810 0.000556 0.007798 0.000741 0 0 0 0.000180 0.002105 0
1 0.028687 0.000571 0.009356 0.000000 0 0 0 0.000183 0.001250 0
2 0.029635 0.001111 0.009121 0.000000 0 0 0 0.000194 0.001111 0
3 0.030579 0.002424 0.007672 0.000000 0 0 0 0.000194 0.001176 0
4 0.028544 0.002667 0.007973 0.000000 0 0 0 0.000179 0.001333 0
5 0.027286 0.003226 0.006881 0.000000 0 0 0 0.000196 0.001111 0
6 0.031597 0.003030 0.006695 0.000000 0 0 0 0.000180 0.002353 0
7 0.034226 0.003030 0.010804 0.000667 0 0 0 0.000179 0.003333 0
8 0.035105 0.002941 0.010176 0.000645 0 0 0 0.000364 0.003529 0
9 0.035171 0.003125 0.012666 0.001250 0 0 0 0.000612 0.005556 0
df1 =
A B C D E F G H I J
0 0.034898 0.003750 0.014091 0.001290 0 0 0 0.001488 0.005333 0
1 0.042847 0.003243 0.011559 0.000625 0 0 0 0.002272 0.010769 0
2 0.046087 0.005455 0.013101 0.000588 0 0 0 0.002147 0.008750 0
3 0.042719 0.003684 0.010496 0.001333 0 0 0 0.002627 0.004444 0
4 0.042410 0.004211 0.011580 0.000645 0 0 0 0.003007 0.006250 0
5 0.044515 0.003500 0.013990 0.000000 0 0 0 0.003954 0.007000 0
6 0.046062 0.004865 0.013278 0.000714 0 0 0 0.004035 0.011111 0
7 0.043666 0.004444 0.013460 0.000625 0 0 0 0.003826 0.010000 0
8 0.039888 0.006857 0.014351 0.000690 0 0 0 0.004314 0.011474 0
9 0.048203 0.006667 0.016338 0.000741 0 0 0 0.005294 0.013603 0
df3 =
A B C D E F G H I J
0 0.048576 0.006471 0.020130 0.002667 0 0 0 0.005536 0.015179 0
1 0.056270 0.007179 0.021519 0.001429 0 0 0 0.005524 0.012333 0
2 0.054020 0.008235 0.024464 0.001538 0 0 0 0.005926 0.010445 0
3 0.047297 0.008649 0.026650 0.002198 0 0 0 0.005870 0.010000 0
4 0.049347 0.009412 0.022808 0.002838 0 0 0 0.006541 0.012222 0
5 0.052026 0.010000 0.019935 0.002714 0 0 0 0.005062 0.012222 0
6 0.055124 0.010625 0.022950 0.003499 0 0 0 0.005954 0.008964 0
7 0.044411 0.010909 0.019129 0.005709 0 0 0 0.005209 0.007222 0
8 0.047697 0.010270 0.017234 0.008800 0 0 0 0.004808 0.008355 0
9 0.048562 0.010857 0.020219 0.008504 0 0 0 0.005665 0.004862 0
I can do single boxplots by using the following:
g = sns.boxplot(data=df, color = 'white', fliersize=1, linewidth=2, meanline = True, showmeans=True)
But how to get all three in one figure seems a bit difficult. I see I need to re-arrange the whole data and use hue in order to get every thing from combined data frame, but how exactly should I format the data is a question. Any help?
You can do all in one sns.boxplot run by concatenate the dataframes and passing hue:
tmp = (pd.concat([d.assign(data=i) # assign adds the column `data` with values i
for i,d in enumerate([df,df1,df3])] # enumerate gives you a generator of pairs (0,df), (1,df1), (2,df2)
)
.melt(id_vars='data') # melt basically turns `id_vars` columns into index,
# and stacks other columns
)
sns.boxplot(data=tmp, x='variable', hue='data', y='value')
Output:
I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0
Let's say I have a pandas DataFrame.
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df:
a b c d e f
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330
how to I get set 1 to rowwise max and 0 to other elements?
I came up with:
>>> for i in range(len(df)):
... df.loc[i][df.loc[i].idxmax(axis=1)] = 1
... df.loc[i][df.loc[i] != 1] = 0
generates
df:
a b c d e f
0 0 0 0 0 0 1
1 0 0 0 1 0 0
2 0 1 0 0 0 0
3 0 0 0 0 0 1
4 0 0 0 0 1 0
5 0 0 0 1 0 0
6 0 1 0 0 0 0
7 0 0 1 0 0 0
8 0 0 0 0 1 0
9 0 0 0 1 0 0
Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?
Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:
In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df
Out[21]:
a b c d e f
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829
In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df
Out[22]:
a b c d e f
0 0 0 0 1 0 0
1 0 0 0 0 1 0
2 1 0 0 0 0 0
3 0 0 0 1 0 0
4 1 0 0 0 0 0
5 1 0 0 0 0 0
6 0 1 0 0 0 0
7 0 1 0 0 0 0
8 0 0 0 0 1 0
9 0 0 1 0 0 0
Timings
In [24]:
# #Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop
In [25]:
# #Nader Hisham's method
%%timeit
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop
You can see that my method is over 12X faster than #Raihan's method
In [4]:
%%timeit
for i in range(len(df)):
df.loc[i][df.loc[i].idxmax(axis=1)] = 1
df.loc[i][df.loc[i] != 1] = 0
10 loops, best of 3: 21.1 ms per loop
The for loop is also significantly slower
import numpy as np
def max_binary(df):
binary = np.where( df == df.max() , 1 , 0 )
return binary
df.apply( max_binary , axis = 1)
Following Nader's pattern, this is a shorter version:
df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)