add missing values in pandas dataframe - datacleaning - pandas

I have measurements stored in a data frame that looks like the one below.
Those are measurements of PMs. Sensors are measuring the four of them pm1, pm2.5, pm5, pm10 contained in the column indicator, under conditions x1..x56, and it gives the measurement in the column area and count. The problem is that under some condition (columns x1..x56) sensors didn't catch all the PMs. And I want for every combination of column conditions (x1..x56) to have all 4 PM values in column indicator. And if the sensor didn't catch it (if there is no PM value for some combination of Xs) I should add it, and area and count column should be 0.
x1 x2 x3 x4 x5 x6 .. x56 indicator area count
0 0 0 0 0 0 .. 0 pm1 10 56
0 0 0 0 0 0 .. 0 pm10 9 1
0 0 0 0 0 0 .. 0 pm5 1 454
.............................................
1 0 0 0 0 0 .. 0 pm1 3 4
ssl ax w 45b g g .. gb pm1 3 4
1 wdf sw d78 b fd .. b pm1 3 4
In this example for the first combination of all zeros, pm2.5 is missing so I should add it and put its area and count to be 0. Similar for the second combination (the one that starts with 1). So my dummy example should look like this after I finish:
x1 x2 x3 x4 x5 x6 .. x56 indicator area count
0 0 0 0 0 0 .. 0 pm1 10 56
0 0 0 0 0 0 .. 0 pm10 9 1
0 0 0 0 0 0 .. 0 pm5 1 454
0 0 0 0 0 0 .. 0 pm2.5 0 0
.............................................
1 0 0 0 0 0 .. 0 pm1 3 4
1 0 0 0 0 0 .. 0 pm10 0 0
1 0 0 0 0 0 .. 0 pm5 0 0
1 0 0 0 0 0 .. 0 pm2.5 0 0
ssl ax w 45b g g .. gb pm1 3 4
ssl ax w 45b g g .. gb pm10 0 0
ssl ax w 45b g g .. gb pm5 0 0
ssl ax w 45b g g .. gb pm2.5 0 0
1 wdf sw d78 b fd .. b pm1 3 4
1 wdf sw d78 b fd .. b pm10 0 0
1 wdf sw d78 b fd .. b pm5 0 0
1 wdf sw d78 b fd .. b pm2.5 0 0
How I can do that? Thanks in advance!

The key here is to create a MultiIndex from all combinations of x and indicator then fill missing records.
Step 1.
Create a vector of x columns:
df['x'] = df.filter(regex='^x\d+').apply(tuple, axis=1)
print(df)
# Output:
x1 x2 x3 x4 x5 x6 x56 indicator area count x
0 0 0 0 0 0 0 0 pm1 10 56 (0, 0, 0, 0, 0, 0, 0)
1 0 0 0 0 0 0 0 pm10 9 1 (0, 0, 0, 0, 0, 0, 0)
2 0 0 0 0 0 0 0 pm5 1 454 (0, 0, 0, 0, 0, 0, 0)
3 1 0 0 0 0 0 0 pm1 3 4 (1, 0, 0, 0, 0, 0, 0)
Step 2.
Create the MultiIindex from vector x and indicator list then reindex your dataframe.
mi = pd.MultiIndex.from_product([df['x'].unique(),
['pm1', 'pm2.5', 'pm5', 'pm10']],
names=['x', 'indicator'])
out = df.set_index(['x', 'indicator']).reindex(mi, fill_value=0)
print(out)
# Output:
x1 x2 x3 x4 x5 x6 x56 area count
x indicator
(0, 0, 0, 0, 0, 0, 0) pm1 0 0 0 0 0 0 0 10 56
pm2.5 0 0 0 0 0 0 0 0 0
pm5 0 0 0 0 0 0 0 1 454
pm10 0 0 0 0 0 0 0 9 1
(1, 0, 0, 0, 0, 0, 0) pm1 1 0 0 0 0 0 0 3 4
pm2.5 *0* 0 0 0 0 0 0 0 0
pm5 *0* 0 0 0 0 0 0 0 0
pm10 *0* 0 0 0 0 0 0 0 0
# Need to be fixed ----^
Step 3.
Group by x index to update x columns by keeping the highest value for each column of the group (1 > 0).
out = out.filter(regex='^x\d+').groupby(level='x') \
.apply(lambda x: pd.Series(dict(zip(x.columns, x.name)))) \
.join(out[['area', 'count']]).reset_index()[df.columns[:-1]]
print(out)
# Output:
x1 x2 x3 x4 x5 x6 x56 indicator area count
0 0 0 0 0 0 0 0 pm1 10 56
1 0 0 0 0 0 0 0 pm2.5 0 0
2 0 0 0 0 0 0 0 pm5 1 454
3 0 0 0 0 0 0 0 pm10 9 1
4 1 0 0 0 0 0 0 pm1 3 4
5 1 0 0 0 0 0 0 pm2.5 0 0
6 1 0 0 0 0 0 0 pm5 0 0
7 1 0 0 0 0 0 0 pm10 0 0

Related

How to fix "tf.math.confusion_matrix()" error

I'm trying to find the confusion matrix of a multiclass classification problem. I'm using tf.math.confusion_matrix() to do that. The code snippet is as follows,
y_pred = model.predict(x_test)
y_pred = tf.argmax(y_pred, axis=1)
Y_test = tf.argmax(y_test, axis=1)
matrix = tf.math.confusion_matrix(Y_test, y_pred)
The output of Y_test is,
tf.Tensor(
[[0 2 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 3 ... 0 0 0]
[0 0 2 ... 0 0 0]], shape=(2124, 279), dtype=int64)
The output of y_pred is,
tf.Tensor(
[[1 2 2 ... 0 0 0]
[0 2 3 ... 0 0 0]
[3 2 0 ... 3 1 3]
...
[3 1 0 ... 2 3 2]
[1 0 3 ... 1 1 2]
[1 0 2 ... 1 1 2]], shape=(2124, 279), dtype=int64)
Y_test[1] looks like the following,
tf.Tensor(
[0 2 0 1 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], shape=(279,), dtype=int64)
y_pred[1] looks like the following,
tf.Tensor(
[0 2 3 1 3 3 2 2 2 3 2 3 2 3 3 2 1 0 0 0 0 3 1 0 2 3 1 2 0 1 0 0 1 0 0 0 0
2 0 2 1 0 0 0 0 1 0 0 0 3 2 0 0 3 2 0 0 3 3 0 3 0 0 0 0 1 0 2 1 0 2 3 0 3
3 0 2 3 1 3 2 0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 3 3 0 0 0 0 0 3 0 0 1
0 3 0 3 3 0 1 0 3 0 0 0 0 0 0 3 0 1 0 0 0 0 0 0 0 0 0 3 0 0 3 0 0 0 0 0 0
3 0 3 3 0 0 0 3 0 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 3 0 3 0 0 0 0 0 0 0 2 0
0 1 0 0 0 0 2 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 3 0 3 3 0 2 3 0 3 3
3 3 3 0 3 0 3 0 0 3 0 0 0 0 3 3 3 2 0 0 0 0 0 0 0 0 2 3 0 0 3 0 0 0 3 0 2
0 0 3 0 0 0 0 0 3 1 2 0 3 2 3 0 3 0 0 0], shape=(279,), dtype=int64)
And the error I'm getting is,
InvalidArgumentError: Dimensions [0,2) of indices[shape=[2124,2,279]] must match dimensions [0,2) of updates[shape=[2124,279]] [Op:ScatterNd]
How this can be solved?

Data formatting for grouped boxplot using seaborn or matplotlib

I have 3 dataframes where column names and number of rows are exactly the same in all 3 data frames. I want to plot all the columns from all three dataframes as a grouped boxplot into one image using seaborn or matplotlib. But I am having difficulties in combining and formating the data so that I can plot them as grouped box plot.
df=
A B C D E F G H I J
0 0.031810 0.000556 0.007798 0.000741 0 0 0 0.000180 0.002105 0
1 0.028687 0.000571 0.009356 0.000000 0 0 0 0.000183 0.001250 0
2 0.029635 0.001111 0.009121 0.000000 0 0 0 0.000194 0.001111 0
3 0.030579 0.002424 0.007672 0.000000 0 0 0 0.000194 0.001176 0
4 0.028544 0.002667 0.007973 0.000000 0 0 0 0.000179 0.001333 0
5 0.027286 0.003226 0.006881 0.000000 0 0 0 0.000196 0.001111 0
6 0.031597 0.003030 0.006695 0.000000 0 0 0 0.000180 0.002353 0
7 0.034226 0.003030 0.010804 0.000667 0 0 0 0.000179 0.003333 0
8 0.035105 0.002941 0.010176 0.000645 0 0 0 0.000364 0.003529 0
9 0.035171 0.003125 0.012666 0.001250 0 0 0 0.000612 0.005556 0
df1 =
A B C D E F G H I J
0 0.034898 0.003750 0.014091 0.001290 0 0 0 0.001488 0.005333 0
1 0.042847 0.003243 0.011559 0.000625 0 0 0 0.002272 0.010769 0
2 0.046087 0.005455 0.013101 0.000588 0 0 0 0.002147 0.008750 0
3 0.042719 0.003684 0.010496 0.001333 0 0 0 0.002627 0.004444 0
4 0.042410 0.004211 0.011580 0.000645 0 0 0 0.003007 0.006250 0
5 0.044515 0.003500 0.013990 0.000000 0 0 0 0.003954 0.007000 0
6 0.046062 0.004865 0.013278 0.000714 0 0 0 0.004035 0.011111 0
7 0.043666 0.004444 0.013460 0.000625 0 0 0 0.003826 0.010000 0
8 0.039888 0.006857 0.014351 0.000690 0 0 0 0.004314 0.011474 0
9 0.048203 0.006667 0.016338 0.000741 0 0 0 0.005294 0.013603 0
df3 =
A B C D E F G H I J
0 0.048576 0.006471 0.020130 0.002667 0 0 0 0.005536 0.015179 0
1 0.056270 0.007179 0.021519 0.001429 0 0 0 0.005524 0.012333 0
2 0.054020 0.008235 0.024464 0.001538 0 0 0 0.005926 0.010445 0
3 0.047297 0.008649 0.026650 0.002198 0 0 0 0.005870 0.010000 0
4 0.049347 0.009412 0.022808 0.002838 0 0 0 0.006541 0.012222 0
5 0.052026 0.010000 0.019935 0.002714 0 0 0 0.005062 0.012222 0
6 0.055124 0.010625 0.022950 0.003499 0 0 0 0.005954 0.008964 0
7 0.044411 0.010909 0.019129 0.005709 0 0 0 0.005209 0.007222 0
8 0.047697 0.010270 0.017234 0.008800 0 0 0 0.004808 0.008355 0
9 0.048562 0.010857 0.020219 0.008504 0 0 0 0.005665 0.004862 0
I can do single boxplots by using the following:
g = sns.boxplot(data=df, color = 'white', fliersize=1, linewidth=2, meanline = True, showmeans=True)
But how to get all three in one figure seems a bit difficult. I see I need to re-arrange the whole data and use hue in order to get every thing from combined data frame, but how exactly should I format the data is a question. Any help?
You can do all in one sns.boxplot run by concatenate the dataframes and passing hue:
tmp = (pd.concat([d.assign(data=i) # assign adds the column `data` with values i
for i,d in enumerate([df,df1,df3])] # enumerate gives you a generator of pairs (0,df), (1,df1), (2,df2)
)
.melt(id_vars='data') # melt basically turns `id_vars` columns into index,
# and stacks other columns
)
sns.boxplot(data=tmp, x='variable', hue='data', y='value')
Output:

Get_dummies produces more columns than its supposed to

I'm using get_dummies on a column of data that has zeroes or 'D' or "E". Instead of producing 2 columns it produces 5 - C, D, E, N, O. I'm not sure what they are and how to make it do just 2 as its supposed to.
When I just pull that column shows 0's and D and E, but when I put it in get_dummies adds extra columns
data[[2]]
0
0
D
0
0
0
0
D
0
0
When I do this:
dummy = pd.get_dummies(data[2], dummy_na = False)
dummy.head()
I get
0 C D E N O PreferredContactTime
0 0 0 0 0 0 1
1 0 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 0 0 0
What are C , N and O? I don't understand what it is displaying at all.
Setup
dtype = pd.CategoricalDtype([0, 'C', 'D', 'E', 'N', 'O', 'PreferredContactTime'])
data = pd.DataFrame({2: [
'PreferredContactTime', 0, 0, 'D', 0, 0, 0, 0, 'D', 0, 0
]}).astype(dtype)
Your result
dummy = pd.get_dummies(data[2], dummy_na=False )
dummy.head()
0 C D E N O PreferredContactTime
0 0 0 0 0 0 0 1
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0
4 1 0 0 0 0 0 0

How can I change my index vector into sparse feature vector that can be used in sklearn?

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0

tensorflow 0.8 one hot encoding

the data that i wanna encode looks as follows:
print (train['labels'])
[ 0 0 0 ..., 42 42 42]
there are 43 classes going from 0-42
Now i read that tensorflow in version 0.8 has a new feature for one hot encoding so i tried to use it as following:
trainhot=tf.one_hot(train['labels'], 43, on_value=1, off_value=0)
only problem is that i think the output is not what i need
print (trainhot[1])
Tensor("strided_slice:0", shape=(43,), dtype=int32)
Can someone nudge me in the right direction please :)
The output is correct and expected. trainhot[1] is the label of the second (0-based index) training sample, which is of 1D shape (43,). You can play with the code below to better understand tf.one_hot:
onehot = tf.one_hot([0, 0, 41, 42], 43, on_value=1, off_value=0)
with tf.Session() as sess:
onehot_v = sess.run(onehot)
print("v: ", onehot_v)
print("v shape: ", onehot_v.shape)
print("v[1] shape: ", onehot[1])
output:
v: [[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1]]
v shape: (4, 43)
v[1] shape: Tensor("strided_slice:0", shape=(43,), dtype=int32)