What's the problem of this one-hot encoding? - pandas

In [4]: data = pd.read_csv('student_data.csv')
In [5]: data[:10]
Out[5]:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
5 1 760 3.00 2
6 1 560 2.98 1
7 0 400 3.08 2
8 1 540 3.39 3
9 0 700 3.92 2
one_hot_data = pd.get_dummies(data['rank'])
# TODO: Drop the previous rank column
data = data.drop('rank', axis=1)
data = data.join(one_hot_data)
# Print the first 10 rows of our data
data[:10]
It always gives an error:
KeyError: 'rank'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-25-6a749c8f286e> in <module>()
1 # TODO: Make dummy variables for rank
----> 2 one_hot_data = pd.get_dummies(data['rank'])
3
4 # TODO: Drop the previous rank column
5 data = data.drop('rank', axis=1)

If get:
KeyError: 'rank'
it means there is no column rank. Obviously problem is with traling whitespace or encoding.
print (data.columns.tolist())
['admit', 'gre', 'gpa', 'rank']
Your solution should be simplify by DataFrame.pop - it select column and remove from original DataFrame:
data = data.join(pd.get_dummies(data.pop('rank')))
# Print the first 10 rows of our data
print(data[:10])
admit gre gpa 1 2 3 4
0 0 380 3.61 0 0 1 0
1 1 660 3.67 0 0 1 0
2 1 800 4.00 1 0 0 0
3 1 640 3.19 0 0 0 1
4 0 520 2.93 0 0 0 1
5 1 760 3.00 0 1 0 0
6 1 560 2.98 1 0 0 0
7 0 400 3.08 0 1 0 0
8 1 540 3.39 0 0 1 0
9 0 700 3.92 0 1 0 0

I tried your code and it works fine. You can need to rerun the previous cells which includes loading of the data

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

Convert multichannel image into pixelwise pandas dataframe

If you have a multiband image of, say, dimensions 1024 * 1024 * 200 (columns * lines * bands) and want to convert that to a pandas dataframe of the form:
Band Value
1 1 0.14
2 1 1.18
3 1 2.56
.
.
.
209715198 200 1.01
209715199 200 1.15
209715200 200 2.00
So basically all pixels in sequential form, with the band number (or wavelength) and the pixel value as columns.
Is there a clever and efficient way of doing this without a lot of loops, appending to arrays and so on?
Answer
You can do it with numpy. I'll try my best to walk you through it below. First you need the input images in a 3D numpy array. I'm just going to use a randomly generated small one for illustration. This is the full code, with an explanation below.
import numpy as np
import pandas as pd
images = np.random.randint(0,9,(2,5,5))
z, y, x = images.shape ## 2, 5, 5 (200, 1024, 1024 for your example)
arr = np.column_stack((np.repeat(np.arange(z),y*x), images.ravel()))
df = pd.DataFrame(arr, columns = ['Bands', 'Value'])
Explanation
The images output array looks like this (basically 2 images at 5x5 pixels):
[[[5 2 3 6 2]
[6 1 6 3 2]
[8 3 2 2 1]
[5 1 2 6 0]
[3 4 7 0 2]]
[[1 7 0 7 3]
[7 4 5 4 3]
[1 5 4 7 4]
[2 0 2 7 2]
[7 0 1 6 7]]]
The next step is to use np.ravel() to flatten it. Which will output your required Value column:
#images.ravel()
[5 2 3 6 2 6 1 6 3 2 8 3 2 2 1 5 1 2 6 0 3 4 7 0 2 1 7 0 7 3 7 4 5 4 3 1 5
4 7 4 2 0 2 7 2 7 0 1 6 7]
To create the band column, you need to repeat the z value for an array, x*y times. You can do this with np.repeat() and np.arange(). Which gives you a 1D array:
#(np.repeat(np.arange(z),y*x))
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]
This is the required Band column. To combine, them use np.column_stack() and then turn it into a dataframe. All of the above steps combined. Would be:
arr = np.column_stack((np.repeat(np.arange(z),y*x), images.ravel()))
df = pd.DataFrame(arr, columns = ['Bands', 'Value'])
Which will output:
Bands Value
0 0 5
1 0 2
2 0 3
3 0 6
4 0 2
5 0 6
6 0 1
7 0 6
8 0 3
9 0 2
10 0 8
11 0 3
12 0 2
13 0 2
14 0 1
15 0 5
16 0 1
17 0 2
18 0 6
19 0 0
20 0 3
21 0 4
22 0 7
23 0 0
24 0 2
25 1 1
26 1 7
27 1 0
...
As required. I hope this at least gets you moving in the right direction.

Conditional sum after groupby based on value in another column

I have the dataframe as below.
Cycle Type Count Value
1 1 5 0.014
1 1 40 -0.219
1 1 5 0.001
1 1 100 -0.382
1 1 5 0.001
1 1 25 -0.064
2 1 5 0.003
2 1 110 -0.523
2 1 10 0.011
2 1 5 -0.009
2 1 5 0.012
2 1 156 -0.612
3 1 5 0.002
3 1 45 -0.167
3 1 5 0.003
3 1 10 -0.052
3 1 5 0.001
3 1 80 -0.194
I want to sum the 'Count' of all the positive & negative 'Value' AFTER groupby
The answer would something like
1 1 15 (sum of count when Value is positive),
1 1 165 (sum of count when Value is negative),
2 1 20,
2 1 171,
3 1 15,
3 1 135
I think this will work (grouped.set_index('Count').groupby(['Cycle','Type'])['Value']....... but i am unable to figure out how to specify positive & negative values to sum()
If I understood correctly, You can try below code,
df= pd.DataFrame (data)
df_negative=df[df['Value'] < 0]
df_positive=df[df['Value'] > 0]
df_negative = df_negative.groupby(['Cycle','Type']).Count.sum().reset_index()
df_positive = df_positive.groupby(['Cycle','Type']).Count.sum().reset_index()
df_combine = pd.concat([df_positive,df_negative]).sort_values('Cycle')
df_combine

Pandas: Calculate percentage of column for each class

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64

Set value from another dataframe

Having a data frame exex as
EXEX I J
1 702 2 3
2 3112 2 4
3 1360 2 5
4 702 3 2
5 221 3 5
6 591 3 11
7 3112 4 2
8 394 4 5
9 3416 4 11
10 1360 5 2
11 221 5 3
12 394 5 4
13 108 5 11
14 591 11 3
15 3416 11 4
16 108 11 5
is there a more efficient pandas approach to update the value of an existing dataframe df of 0 to the value exex.EXEX where the exex.I field is the index and the exex.J field is the column? Is there a way in where to update the data by specifing the name instead of the row index? This is because if the name fields change, the row index would be different and could lead to an erroneous result.
i get it by:
df = pd.DataFrame(0, index = range(1,908), columns=range(1,908))
for index, row in exex12.iterrows():
df.set_value(row[1],row[2],row[0])
Assign to df.values
df.values[exex.I.values - 1, exex.J.values - 1] = exex.EXEX.values
print(df.iloc[:5, :5])
1 2 3 4 5
1 0 0 0 0 0
2 0 0 702 3112 1360
3 0 702 0 0 221
4 0 3112 0 0 394
5 0 1360 221 394 0