Machine Learning: getting a Dataframe after a OneHotEncoder

Machine Learning: getting a Dataframe after a OneHotEncoder - pandas

I have been stacked on how do I convert back the result of a OneHotEnocder to a DataFrame.The Idea that I have separated numeric columns from categorical columns as follows:
feats = df.drop(["Transported"], axis=1)
target = df["Transported"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feats, target, test_size = 0.2,
random_state=42)
here after doing the split, I needed to separet the num from cat for training set and i did this:
num_train = X_train.select_dtypes(include=['float64', 'int64'])
cat_train = X_train.select_dtypes(include=['object'])
num_test = X_test.select_dtypes(include=['float64', 'int64'])
cat_test = X_test.select_dtypes(include=['object'])
After this I did the the Simple imputer and it worked.
imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputer_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
num = ["Age", "RoomService", "FoodCourt", "ShoppingMall","Spa","VRDeck"]
num_train.loc[:,num] = imputer_median.fit_transform(num_train[num])
num_test.loc[:,num] = imputer_median.transform(num_test[num])
cat = ["HomePlanet", "CryoSleep", "Destination","VIP"]
cat_train.loc[:,cat] = imputer_most_frequent.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = imputer_most_frequent.transform(cat_test[cat])
and this the head of the cat_train:
cat_train.head()
HomePlanet CryoSleep Destination VIP
2333 Earth False TRAPPIST-1e False
2589 Earth False TRAPPIST-1e False
8302 Europa True 55 Cancri e False
8177 Mars False TRAPPIST-1e False
500 Europa True 55 Cancri e False

But, after this I needed to apply the OneHotEncoder just like this:
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder( drop='first',sparse=False)
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
And I got this error:
shape mismatch: value array of shape (6954,6) could not be broadcast to indexing result
of shape (6954,4)
I tried several ways, but everytime I could not succeed to have a DataFrame back after the OneHotEncoder. Please help me out, I am stacked on this and I cannot continue the rest of the work. Thanks in advance
here is the full traceback error:
ValueError Traceback (most recent
call last)
~\AppData\Local\Temp\ipykernel_16200\2252764984.py in <module>
3 oneh = OneHotEncoder( drop='first',sparse=False)
4
----> 5 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
6 cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
__setitem__(self, key, value)
714
715 iloc = self if self.name == "iloc" else self.obj.iloc
--> 716 iloc._setitem_with_indexer(indexer, value,
self.name)
717
718 def _validate_key(self, key, axis: int):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
_setitem_with_indexer(self, indexer, value, name)
1691 self._setitem_with_indexer_split_path(indexer,
value, name)
1692 else:
-> 1693 self._setitem_single_block(indexer, value,
name)
1694
1695 def _setitem_with_indexer_split_path(self, indexer, value,
name: str):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in
_setitem_single_block(self, indexer, value, name)
1941
1942 # actually do the set
-> 1943 self.obj._mgr =
self.obj._mgr.setitem(indexer=indexer, value=value)
1944 self.obj._maybe_update_cacher(clear=True,
inplace=True)
1945
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
setitem(self, indexer, value)
335 For SingleBlockManager, this backs s[indexer] = value
336 """
--> 337 return self.apply("setitem", indexer=indexer,
value=value)
338
339 def putmask(self, mask, new, align: bool = True):
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in
apply(self, f, align_keys, ignore_failures, **kwargs)
302 applied = b.apply(f, **kwargs)
303 else:
--> 304 applied = getattr(b, f)(**kwargs)
305 except (TypeError, NotImplementedError):
306 if not ignore_failures:
~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in
setitem(self, indexer, value)
957 else:
958 value = setitem_datetimelike_compat(values,
len(values[indexer]), value)
--> 959 values[indexer] = value
960
961 return self
ValueError: shape mismatch: value array of shape (6954,6) could not
be broadcast to indexing result of shape (6954,4)
I tried this time the next move:
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown='ignore')
cat_train.loc[:,cat] = oneh.fit_transform(cat_train[cat])
cat_test.loc[:,cat] = oneh.transform(cat_test)
and I got this dataframe, but this is not what I am looking for:
HomePlanet CryoSleep Destination VIP
2333 (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0,
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0,
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0,
7)\t1.0\n ...
2589 (0, 0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0,
0)\t1.0\n (0, 3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0,
3)\t1.0\n (0, 7)\t1.0\n ... (0, 0)\t1.0\n (0, 3)\t1.0\n (0,
7)\t1.0\n ...
I also used Columntransformer; but It's not getting me back the dataframe I want to(i mean the dataframe with the original columns used before the onehotencoder (look above the cat_train)) this is the steps I did:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(
transformers=[("OneHotEncoder", OneHotEncoder(drop='first',
sparse=False), cat)],
remainder='passthrough'
)
cat_train = ct.fit_transform(cat_train)
cat_test = ct.transform(cat_test)
cat_train = pd.DataFrame(cat_train, columns=ct.get_feature_names())
cat_test = pd.DataFrame(cat_test, columns=ct.get_feature_names())
cat_train
and the cat_train.head() I got is :
OneHotEncoder__x0_Europa OneHotEncoder__x0_Mars OneHotEncoder__x1_True OneHotEncoder__x2_PSO J318.5-22 OneHotEncoder__x2_TRAPPIST-1e OneHotEncoder__x3_True
0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0
2 1.0 0.0 1.0 0.0 0.0
this is weird because next I need to concatenat the cat_train with num_train and also for the test, and I done this , alot of NAN values will appears, wherease I already imputed all the nan values before. Any Idea?

The first error is because you try to assign the one-hot-encoded data, which has more columns than the original, back to the same original columns. You need to instead add these dummy columns and delete the original ones. Anyway, applying fit_transform to both train and test (assuming the repeated train row is a typo) is a bad idea.
The second error appears to be due to the one-hot-encoded data being sparse. You can specify sparse=False in the OneHotEncoder to fix that, but then probably you'll have the same issue as above.
The best thing to do is to use a ColumnTransformer; it would handle all the concatenation for you.

I succeeded to find the solution. In fact, I was looking to get back the original(since I had 4 columns so I thought I should get these columns back) columns as they were before the OneHotEnoder, which is not generally POSSIBLE. In my case I have ,for each cat_train columns, a different modality(more than one) so the result after a OneHotEncoder must be a more columns than before. So, and based on this, I ve regenerated the code as follow:
feats = df.drop(["Transported"], axis=1)
target = df["Transported"]
X_train, X_test, y_train, y_test = train_test_split(feats, target,
test_size = 0.2, random_state=42)
Separate numeric columns from categorical columns
import numpy as np
num_train = X_train.select_dtypes(include=[np.number])
cat_train = X_train.select_dtypes(exclude=[np.number])
num_test = X_test.select_dtypes(include=[np.number])
cat_test = X_test.select_dtypes(exclude=[np.number])
Fill in missing values
num_imp = SimpleImputer(strategy='median')
num_train = num_imp.fit_transform(num_train)
num_test = num_imp.transform(num_test)
cat_imp = SimpleImputer(strategy='most_frequent')
cat_train = cat_imp.fit_transform(cat_train)
cat_test = cat_imp.transform(cat_test)
Encode categorical variables
cat_enc = OneHotEncoder(handle_unknown='ignore')
cat_train = cat_enc.fit_transform(cat_train)
cat_test = cat_enc.transform(cat_test)
And Now the magic part; Reconstitute training and test sets
X_train = pd.concat([pd.DataFrame(num_train),
pd.DataFrame(cat_train.toarray())], axis=1)
X_test = pd.concat([pd.DataFrame(num_test),
pd.DataFrame(cat_test.toarray())], axis=1)
the dataframe is now as it should be
X_train.head()
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9
0 28.0 0.0 55.0 0.0 656.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
1 17.0 0.0 1195.0 31.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
2 28.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
3 20.0 0.0 2.0 289.0 976.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
4 36.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0

Related

OneHotEncoding : TypeError: cannot perform reduce with flexible type

I was trying to fit OneHotEncoder on the X_train and then transform on X_train, X_test
However this resulted in error:
# One hot encoding
from sklearn.preprocessing import OneHotEncoder
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])
X_test = enc.transform(X_test[encode_columns])
X_train.head()
Error:
4
5 enc = OneHotEncoder(handle_unknown='ignore')
----> 6 enc.fit(X_train[encode_columns])
7 X_train = enc.transform(X_train[encode_columns])
8 X_test = enc.transform(X_test[encode_columns])
TypeError: cannot perform reduce with flexible type
Sample row of X_train:

TLDR: You probably run the cell with fit and transform multiple times, and .transform() doesn't work the way, you think it work.
Why are you getting this error?
If you have data definition in one cell:
X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
'building_class_category': ["01", "02", "02", "01", "13"],
'commercial_units': ["O", "O", "O", "O", "A"],
'residential_units': [1,2,2,1,1]})
And fitting one hot-encoder in second one:
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])
The cell above would work first time, but since you overwrite X_train if you run the cell second time:
TypeError: cannot perform reduce with flexible type
So the first part of the answer will be - have different name for the input and output.
What does OneHotEncoder transform returns?
If you'll print out enc.transform(X_train[encode_columns]) you'll get:
<5x9 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
Defaultly the OneHotEncoder transform doesn't return the pandas DataFrame (or even a numpy array) but a sparse matrix. To get a numpy array yo have to either transform it:
enc.transform(X_train[encode_columns]).toarray()
or set sparse=False in definition of OneHotEncoder:
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
Bonus: How to have descriptive names of features?
After setting sparse=False, enc.transform(X_train[encode_columns]) would return numpy array. Even if you would transform it to pd.DataFrame, column names won't tell you much:
pd.DataFrame(enc.transform(X_train[encode_columns]))
# 0 1 2 3 4 5 6 7 8
#0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
#1 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
#2 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
#3 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
#4 1.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
To get proper column names, you have to use get_feature_names_out() method:
pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())
# borough_Brooklyn borough_Queens ... residential_units_2
#0 0.0 1.0 ... 0.0
#1 1.0 0.0 ... 1.0
#2 0.0 1.0 ... 1.0
#3 0.0 1.0 ... 0.0
#4 1.0 0.0 ... 0.0
Whole code:
X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
'building_class_category': ["01", "02", "02", "01", "13"],
'commercial_units': ["O", "O", "O", "O", "A"],
'residential_units': [1,2,2,1,1]})
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(X_train[encode_columns])
X_train_encoded = pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())

Pandas load from_records non-sparsely

I am trying to load a list of dictionaries into pandas as efficiently as possible. Here is a minimal example for constructing my data, which I call below, mylist:
import pandas as pd
import random
from string import ascii_lowercase
random.seed(100)
mylist = []
for i in range(100):
random_string_variable = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
random_string = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
for j in range(10):
myrecord = {"i": i,
"identifier" : random_string,
f"var_{ascii_lowercase[j].upper()}_xx" : random.random(),
f"var_{ascii_lowercase[j].upper()}_yy" : random.random()*10,
f"var_{ascii_lowercase[j].upper()}_zz" : random.random()*100
}
mylist.append(myrecord)
pprint(mylist[0:5])
[{'i': 0,
'identifier': 'NROUIDSA',
'var_A_xx': 0.03694960304368877,
'var_A_yy': 4.4615792434297585,
'var_A_zz': 68.37385464983947},
{'i': 0,
'identifier': 'NROUIDSA',
'var_B_xx': 0.7476846773635049,
'var_B_yy': 3.2014779786116643,
'var_B_zz': 58.91595571819701},
{'i': 0,
'identifier': 'NROUIDSA',
'var_C_xx': 0.3502573960649995,
'var_C_yy': 6.713087131908023,
'var_C_zz': 74.36827046647622},
{'i': 0,
'identifier': 'NROUIDSA',
'var_D_xx': 0.23513409285324904,
'var_D_yy': 3.894932754840866,
'var_D_zz': 65.35552900764706},
{'i': 0,
'identifier': 'NROUIDSA',
'var_E_xx': 0.6660170004345193,
'var_E_yy': 1.9094479278081555,
'var_E_zz': 36.84983796653053}]
When I try to load this into pandas, it makes the data frame very non-sparse, with a lot of NaN repetition:
df = pd.DataFrame.from_records(mylist)
df
produces:
df
i identifier var_A_xx var_A_yy var_A_zz var_B_xx var_B_yy ... var_H_zz var_I_xx var_I_yy var_I_zz var_J_xx var_J_yy var_J_zz
0 0 NROUIDSA 0.03695 4.461579 68.373855 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
1 0 NROUIDSA NaN NaN NaN 0.747685 3.201478 ... NaN NaN NaN NaN NaN NaN NaN
2 0 NROUIDSA NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
3 0 NROUIDSA NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
4 0 NROUIDSA NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
996 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN
997 99 SORIUDAN NaN NaN NaN NaN NaN ... 63.72333 NaN NaN NaN NaN NaN NaN
998 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN 0.367797 4.162167 84.699542 NaN NaN NaN
999 99 SORIUDAN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 0.634893 7.628154 75.903316
[1000 rows x 32 columns]
What I would like it to look like is:
var_A_xx var_A_yy var_A_zz var_B_xx var_B_yy var_B_zz ... var_I_xx var_I_yy var_I_zz var_J_xx var_J_yy var_J_zz
i identifier ...
0 NROUIDSA 0.036950 4.461579 68.373855 0.747685 3.201478 58.915956 ... 0.962999 7.332500 13.216899 0.847280 6.504308 8.552283
1 NURDASOI 0.814194 9.570388 21.239626 0.468727 6.180384 24.260818 ... 0.346681 9.865105 82.261586 0.221160 8.481875 92.645263
2 OARNDUIS 0.813418 1.103359 1.198749 0.646912 2.409214 76.037434 ... 0.404528 2.112085 8.461932 0.621124 5.372169 36.500880
3 DISORNAU 0.533450 1.094177 44.053734 0.804385 5.947438 28.360524 ... 0.121844 5.806337 85.657067 0.735207 4.011567 38.368097
4 SIONUDRA 0.672725 3.724022 58.280713 0.346717 7.432624 49.726532 ... 0.238869 0.769056 58.188641 0.415537 6.828866 38.802765
... ... ... ... ... ... ... ... ... ... ... ... ... ...
95 URIADNSO 0.231775 3.114448 65.241238 0.116461 4.330002 12.864624 ... 0.516712 5.589706 87.261427 0.572551 4.060943 80.102004
96 ISDONRAU 0.295684 8.406004 22.817404 0.160434 8.415922 47.288958 ... 0.050647 8.720049 44.407892 0.038166 5.027924 73.852513
97 OIAUSDNR 0.331393 9.480417 90.311381 0.985708 6.384429 55.459062 ... 0.947673 4.406426 68.098531 0.377523 5.258620 61.035638
98 DIONAURS 0.690593 4.316975 9.866558 0.822896 3.822044 68.863371 ... 0.994493 3.550660 22.769721 0.199187 7.254650 91.232969
99 SORIUDAN 0.960168 6.769579 49.488535 0.671168 1.577146 78.835216 ... 0.367797 4.162167 84.699542 0.634893 7.628154 75.903316
[100 rows x 30 columns]
You can see it is a 10x waste of memory to have the first representation. Obviously, there are a variety of ways to get from A to B. How can I tell pandas to /read in/ this list of records as non-sparse, as I assume this would be the most performant? You can see extra records are inserted with NaN values. I'm expecting 100 rows, where the index is given by ["i", "identifier"] and 30 columns.
My preference is to do this at load time with the correct keywords and data load method, rather than relying after the fact on a pivot operations in pandas as they are comparatively slow. I'm asking this question largely for performance, for example with much larger i and somewhat larger j.
df = pd.DataFrame.from_records(mylist, index=["i", "identifier"])
df
Did not do the job.
pd.DataFrame.from_records(mylist, index=["i", "identifier"]).unstack()
ValueError: Index contains duplicate entries, cannot reshape
Also fails.
If there do not exist arguments to ingest the list of dictionaries non-sparsely into a dataframe---this is the focus of my question---which of the .agg, pivot_table, reshape, long_to_wide, and unstack methods would be the fastest at getting from A to B for larger data sets?

There’s a number of ways to load the data as it is:
>>> df_idx = pd.DataFrame.from_records(mylist, index=['i', 'identifier'])
>>> df = pd.DataFrame.from_records(mylist)
>>> df = pd.DataFrame.from_dict(mylist)
>>> df = pd.DataFrame(mylist)
Then we could group by columns or levels, and take the first non-NA value:
>>> df_idx.groupby(level=[0, 1]).first()
>>> df.groupby(['i', 'identifier']).first()
Those are both pretty much improved versions of .agg() as you don’t have to specify a lambda function.
Then with the same loading, we can try to reshape the data, with stack/unstack or melt/pivot
>>> df_idx.stack().unstack()
>>> pd.melt(pd.DataFrame(mylist), id_vars=['i', 'identifier']).dropna().pivot(index=['i', 'identifier'], columns='variable', values='value')
If that’s not satisfactory, there’s also reshaping before loading, which could be done through list comprehensions, and either ChainMap or dict comprehensions, or with numpy. This relies on the fact that there’s always 10 dictionaries with the same keys in a row, and chaining the same iterator the appropriate number of times with zip():
>>> pd.DataFrame({k: v for d in tup for k, v in d.items()} for tup in zip(*[iter(mylist)] * 10))
>>> pd.DataFrame(ChainMap(*tup) for tup in zip(*[iter(mylist)] * 10))
>>> df = pd.concat([pd.DataFrame(mylist[n::10]) for n in range(10)], axis='columns')
>>> df.groupby(df.columns, axis='columns').first()
>>> reshaped = np.reshape(mylist, (100, 10))
>>> df = pd.concat([pd.json_normalize(reshaped[:,n]) for n in range(10)], axis='columns')
>>> df.groupby(df.columns, axis='columns').first()
I’ve measured with i=100 and j=10 as in your example and with i=1000 and j=100.
You can see which way you get the data into a dataframe does not matter: all groupby variants have the same results. As you suspect loading the data and then “fixing” it performs pretty bad. pd.concat does not work too well on 100x10 but scales better on the 1000x100 data, and what seems the best is pure-python dict iteration (maybe because it’ a comprehension and not a list? Not sure). The reshaping techniques, stack/unstack and melt/pivot, are always the worst.
Of course these results may change with different data sizes, and you probably know better what the right sizes to test are, based on your real data. Here’s the full script I used to run the tests so you can run some yourself:
#!/usr/bin/python3
import numpy as np
import pandas as pd
from collections import ChainMap
from matplotlib import pyplot as plt
import timeit
def gen(imax, jmax):
mylist = []
l = ord('A')
for i in range(imax):
random_string = ''.join(np.random.permutation(list('DINOSAUR')))
for j in range(jmax):
var = f'var_{chr(l + (j // 26)) if j >= 26 else ""}{chr(l + (j % 26))}'
mylist.append({
'i': i,
'identifier' : random_string,
var + '_xx' : np.random.random(),
var + '_yy' : np.random.random()*10,
var + '_zz' : np.random.random()*100
})
return mylist
def load_rec_idx_groupby(mylist, j):
return pd.DataFrame.from_records(mylist, index=['i', 'identifier']).groupby(level=[0, 1]).first()
def load_rec_groupby(mylist, j):
return pd.DataFrame.from_records(mylist).groupby(['i', 'identifier']).first()
def load_dict_groupby(mylist, j):
return pd.DataFrame.from_dict(mylist).groupby(['i', 'identifier']).first()
def load_constr_groupby(mylist, j):
return pd.DataFrame(mylist).groupby(['i', 'identifier']).first()
def load_constr_stack(mylist, j):
return pd.DataFrame(mylist).set_index(['i', 'identifier']).stack().unstack()
def load_constr_melt_pivot(mylist, j):
return pd.melt(pd.DataFrame(mylist), id_vars=['i', 'identifier']).dropna().pivot(index=['i', 'identifier'], columns='variable', values='value')
def load_zip_iter_dict(mylist, j):
return pd.DataFrame({k: v for d in tup for k, v in d.items()} for tup in zip(*[iter(mylist)] * j))
def load_zip_iter_chainmap(mylist, j):
return pd.DataFrame(ChainMap(*tup) for tup in zip(*[iter(mylist)] * 10))
def load_concat_step(mylist, j):
return pd.concat([pd.DataFrame(mylist[n::10]).drop(columns=['i', 'identifier'] if n else []) for n in range(10)], axis='columns')
def load_concat_reshape(mylist, j):
reshaped = np.reshape(mylist, (len(mylist) // j, j))
return pd.concat([pd.json_normalize(reshaped[:,n]).drop(columns=['i', 'identifier'] if n else []) for n in range(j)], axis='columns')
def plot_results(df):
mins = df.groupby(level=0).median().min(axis='columns')
rel = df.unstack().T.div(mins)
ax = rel.groupby(level=0).median().plot.barh()
ax.set_xlabel('slowdown over fastest')
ax.axvline(1, color='black', lw=1)
ax.set_xticks([1, *ax.get_xticks()[1:]])
ax.set_xticklabels([f'{n:.0f}×' for n in ax.get_xticks()])
plt.subplots_adjust(left=.4, bottom=.15)
plt.show()
def run():
candidates = {n: f for n, f in globals().items() if n.startswith('load_') and callable(f)}
df = {}
for tup in [(100, 10), (1000, 100)]:
glob = {'mylist': gen(*tup), **candidates}
dat = pd.DataFrame({name:
timeit.Timer(f'{name}(mylist, {10})', globals=glob).repeat(5, int(100000 / np.multiply(*tup)))
for name in candidates
})
print(dat)
df['{}×{}'.format(*tup)] = dat
df = pd.concat(df).rename(columns=lambda s: s.replace('load_', '').replace('_', ' '))
print(df)
plot_results(df)
if __name__ == '__main__':
run()

I don't think this is possible with a pandas argument at load time, but you can use a comprehension to collapse your list of dicts into a single dict for each row
Data:
a = [
{'i': 0,
'identifier': 'NROUIDSA',
'var_A_xx': 0.03694960304368877,
'var_A_yy': 4.4615792434297585,
'var_A_zz': 68.37385464983947},
{'i': 0,
'identifier': 'NROUIDSA',
'var_B_xx': 0.7476846773635049,
'var_B_yy': 3.2014779786116643,
'var_B_zz': 58.91595571819701},
{'i': 0,
'identifier': 'NROUIDSA',
'var_C_xx': 0.3502573960649995,
'var_C_yy': 6.713087131908023,
'var_C_zz': 74.36827046647622},
{'i': 0,
'identifier': 'NROUIDSA',
'var_D_xx': 0.23513409285324904,
'var_D_yy': 3.894932754840866,
'var_D_zz': 65.35552900764706},
{'i': 0,
'identifier': 'NROUIDSA',
'var_E_xx': 0.6660170004345193,
'var_E_yy': 1.9094479278081555,
'var_E_zz': 36.84983796653053},
{'i': 1,
'identifier': 'SORIUDAN',
'var_A_xx': 0.03694960304368877,
'var_A_yy': 4.4615792434297585,
'var_A_zz': 68.37385464983947},
{'i': 1,
'identifier': 'SORIUDAN',
'var_B_xx': 0.7476846773635049,
'var_B_yy': 3.2014779786116643,
'var_B_zz': 58.91595571819701},
{'i': 1,
'identifier': 'SORIUDAN',
'var_C_xx': 0.3502573960649995,
'var_C_yy': 6.713087131908023,
'var_C_zz': 74.36827046647622},
{'i': 1,
'identifier': 'SORIUDAN',
'var_D_xx': 0.23513409285324904,
'var_D_yy': 3.894932754840866,
'var_D_zz': 65.35552900764706},
{'i': 1,
'identifier': 'SORIUDAN',
'var_E_xx': 0.6660170004345193,
'var_E_yy': 1.9094479278081555,
'var_E_zz': 36.84983796653053}
]
Cleaning the list of dicts:
# get list of keys--assumed here to be the identifier dict value
l_key = list(dict.fromkeys([l.get('identifier') for l in a]))
# a data dict we'll append the properly parsed dict to
data = list()
# iterate through original dict and append.
for i in l_key:
l_data = [l for l in a if l.get('identifier') == i]
data.append({k: v for d in l_data for k, v in d.items()})
adding data to the pandas df.
import pandas as pd
df = pd.DataFrame.from_records(data)
print(df)
print(df.dtypes)
I don't know that this would be faster than dealing with the data after you've loaded it into the DataFrame but it is another approach.

I think your issue is not from pandas, you can create the records according to what you want as a result. I edit your implementation as below,
import random
from string import ascii_lowercase
import pandas as pd
random.seed(100)
mylist = []
for i in range(100):
random_string_variable = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
random_string = "".join(random.sample("DINOSAUR", len("DINOSAUR")))
record = {
"i": i,
"identifier": random_string
}
for j in range(10):
record[f"var_{ascii_lowercase[j].upper()}_xx"] = random.random()
record[f"var_{ascii_lowercase[j].upper()}_yy"] = random.random() * 10,
record[f"var_{ascii_lowercase[j].upper()}_zz"] = random.random() * 100
mylist.append(record)
print(len(mylist))
df = pd.DataFrame.from_records(mylist)
df
As it is about loading the records into Pandas, maybe it can be easier to process the list before passing into pandas such as
from itertools import groupby
from collections import ChainMap
records = []
for k, v in groupby(mylist, key=lambda x: (x['i'], x['identifier'])):
record = dict(ChainMap(*v))
records.append(record)
df = pd.DataFrame.from_records(records)
print(df)

Doing a ML predict but prices appears as in CSV

I am learning ML and running my code on prediction. When I run the code, I find the prices in the csv is the same as the predict, what am I doing wrong?
----CODE---
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
melbourne_file_path = 'melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data = melbourne_data.dropna(axis=0)
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Price', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
print(X.describe())
print(X.head())
melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
-----OUTPUT----
Rooms Price ... Lattitude Longtitude
count 6196.000000 6.196000e+03 ... 6196.000000 6196.000000
mean 2.931407 1.068828e+06 ... -37.807904 144.990201
std 0.971079 6.751564e+05 ... 0.075850 0.099165
min 1.000000 1.310000e+05 ... -38.164920 144.542370
25% 2.000000 6.200000e+05 ... -37.855438 144.926198
50% 3.000000 8.800000e+05 ... -37.802250 144.995800
75% 4.000000 1.325000e+06 ... -37.758200 145.052700
max 8.000000 9.000000e+06 ... -37.457090 145.526350
[8 rows x 6 columns]
Rooms Price Bathroom Landsize Lattitude Longtitude
1 2 1035000.0 1.0 156.0 -37.8079 144.9934
2 3 1465000.0 2.0 134.0 -37.8093 144.9944
4 4 1600000.0 1.0 120.0 -37.8072 144.9941
6 3 1876000.0 2.0 245.0 -37.8024 144.9993
7 2 1636000.0 1.0 256.0 -37.8060 144.9954
Making predictions for the following 5 houses:
Rooms Price Bathroom Landsize Lattitude Longtitude
1 2 1035000.0 1.0 156.0 -37.8079 144.9934
2 3 1465000.0 2.0 134.0 -37.8093 144.9944
4 4 1600000.0 1.0 120.0 -37.8072 144.9941
6 3 1876000.0 2.0 245.0 -37.8024 144.9993
7 2 1636000.0 1.0 256.0 -37.8060 144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]

First, split your data into a train and test file.
Next, train the model using the .fit() function using your x_train and y_train datasets.
Then, run the .predict() function to make a prediction and assign the values as a list in the y_pred variable.
Finally, Make sure not to include the column that you are trying to predict in melbourne_features.
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
melbourne_file_path = 'melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data = melbourne_data.dropna(axis=0)
y = melbourne_data.Price
#Make sure not to include the column that you are trying to predict.
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
print(X.describe())
print(X.head())
#Enter 0.50 when you wanted to have 50 percent of your data to be tested and 50 percent to be trained.
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size = 0.50)
melbourne_model = DecisionTreeRegressor(random_state=1)
#Alternatively, you can use RandomForestRegressor to lower down your mean absolute error compare to DecisionTreeRegressor.
#melbourne_model = RandomForestRegressor(n_estimators = 1000)
#Fit the x_train and y_train data only. In other words, train the model.
melbourne_model.fit(x_train, y_train)
#Finally, make a prediction.
y_pred = melbourne_model.predict(x_test)
print("Making predictions for the following 5 houses:")
print(x_test.head())
print("The predictions are")
print(pd.DataFrame({'Actual Price':y_test,
'Predicted Price': y_pred
}
)
)
#The mean absolute error is a single number that you can plus or minus
#from your prediction price to get the best estimate of the actual price
#Your goal is to have as low mean absolute error as possible.
print(f'Mean Absolute Error : {mean_absolute_error(y_test, y_pred)}')
Source:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html
Additional Reference:
https://www.youtube.com/watch?v=PaFPbb66DxQ
https://www.youtube.com/watch?v=YSB7FtzeicA
https://www.youtube.com/watch?v=BFaadIqWlAg
https://www.youtube.com/watch?v=ENvSybznF_o
https://www.youtube.com/watch?v=yXoxdXMvD7c

Copy values from one column in a pandas dataframe only when target field is blank in target dataframe

I have 2 dataframes of equal length. The source has one column, ML_PREDICTION that I want to copy to the target dataframe, which has some values already that I don't want to overwrite.
#Select only blank values in target dataframe
mask = br_df['RECOMMENDED_ACTION'] == ''
# Attempt 1 - Results: KeyError: "['Retain' 'Retain' '' ... '' '' 'Retain'] not in index"
br_df.loc[br_df['RECOMMENDED_ACTION'][mask]] = ML_df['ML_PREDICTION'][mask]
br_df.loc['REASON_CODE'][mask] = 'ML01'
br_df.loc['COMMENT'][mask] = 'Automated Prediction'
# Attempt 2 - Results: Overwrites all values in target dataframe
br_df['RECOMMENDED_ACTION'].where(mask, other=ML_df['ML_PREDICTION'], inplace=True)
br_df['REASON_CODE'].where(mask, other='ML01', inplace=True)
br_df['COMMENT'].where(mask, other='Automated Prediction', inplace=True)
# Attempt 3 - Results: Overwrites all values in target dataframe
br_df['RECOMMENDED_ACTION'] = [x for x in ML_df['ML_PREDICTION'] if [mask] ]
br_df['REASON_CODE'] = ['ML01' for x in ML_df['ML_PREDICTION'] if [mask]]
br_df['COMMENT'] = ['Automated Prediction' for x in ML_df['ML_PREDICTION'] if [mask]]
Attempt 4 - Results: Values in target (br_df) were unchanged
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'REASON_CODE'] = 'ML01'
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'COMMENT'] = 'Automated Prediction'
br_df.loc[br_df['RECOMMENDED_ACTION'].isnull(), 'RECOMMENDED_ACTION'] = ML_df['ML_PREDICTION']
Attempt 5
#Dipanjan
` # Before - br_df['REASON_CODE'].value_counts()
BR03 10
BR01 8
Name: REASON_CODE, dtype: int64
#Attempt 5
br_df.loc['REASON_CODE'] = br_df['REASON_CODE'].fillna('ML01')
br_df.loc['COMMENT'] = br_df['COMMENT'].fillna('Automated Prediction')
br_df.loc['RECOMMENDED_ACTION'] = br_df['RECOMMENDED_ACTION'].fillna(ML_df['ML_PREDICTION'])
# After -- print(br_df['REASON_CODE'].value_counts())
BR03 10
BR01 8
ML01 2
Automated Prediction 1
Name: REASON_CODE, dtype: int64
#WTF? -- br_df[br_df['REASON_CODE'] == 'Automated Prediction']
PERSON_STATUS ... RECOMMENDED_ACTION REASON_CODE COMMENT
COMMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Automated Prediction Automated Prediction Automated Prediction
What am I missing here?

use below options -
df.loc[df['A'].isnull(), 'A'] = df['B']
or
df['A'] = df['A'].fillna(df['B'])
import numpy as np
df_a = pd.DataFrame([0,1,np.nan])
df_b = pd.DataFrame([0,np.nan,2])
df_a
0
0 0.0
1 1.0
2 NaN
df_b
0
0 0.0
1 NaN
2 2.0
df_a[0] = df_a[0].fillna(df_b[0])
final_output-
df_a
0
0 0.0
1 1.0
2 2.0

Ultimately, this is the syntax that appears to solve my problem:
mask = mask[:len(br_df)] # create the boolean index
br_df = br_df[:len(mask)] # make sure they are the same length
br_df['RECOMMENDED_ACTION'].loc[mask] = ML_df['ML_PREDICTION'].loc[mask]
br_df['REASON_CODE'].loc[mask] = 'ML01'
br_df['COMMENT'].loc[mask] = 'Automated Prediction'

fillNa(0) instead producing None

My question is just below the code snippet, below:
I have raw sensor time series data that .. is getting really close to be usable now :)
locDf = locationDf.copy()
locDf.set_index('date', inplace=True)
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()
aDf = appDf.copy()
aDf.set_index('date', inplace=True)
userLocAppDfs = {}
appDfs = []
for user, a2_df in aDf.groupby('user'):
userDf = a2_df.resample('15T').agg('min')
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(0, downcast='infer')
userDf['user'] = user
userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)
appDfs.append(userDf)
userLocAppDfs[user] = userDf
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('user', axis=1, inplace=True)
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
userLocAppDfs[user]['user'].fillna(user, inplace=True)
#for app in a2_df['app'].unique():
# userLocAppDfs[user][app] = userLocAppDfs[user][app].fillna(0, inplace=True)
userLocAppDfs['user_1'].head(5)
Question
If I uncomment those last two lines to try to fill the NaN's at the start, I dont' get zeros. I get None. :( Can anyone tell me why?
I'd like to.. you know, get 0's there:
2017-08-28 00:00:00 0 0 user_1 0.0 0.0 0.0 1.0 0.0
2017-08-28 00:15:00 0 0 user_1 0.0 0.0 1.0 0.0 0.0
2017-08-28 00:30:00 0 0 user_1 0.0 0.0 1.0 0.0 0.0
2017-08-28 00:45:00 0 0 user_1 0.0 0.0 1.0 0.0 0.0
2017-08-28 01:00:00 0 0 user_1 0.0 0.0 1.0 0.0 0.0
The last step will be for me to get the rolling average of those app_* numbers, so that I get a curve.

Try
for app in a2_df['app'].unique():
userLocAppDfs[user][app].fillna(0, inplace=True)
# or userLocAppDfs[user][app] = userLocAppDfs[user][app].fillna(0)
So it is because you have specified inplace = True and at the same time you assign it back.
Note that df.fillna(0, inplace=True) will not return a value. Rather it will directly modify the originaldf. Try print(df.fillna(0, inplace=True)), it will give you None. So what you've done above was assigning None to column apps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Machine Learning: getting a Dataframe after a OneHotEncoder - pandas

Related

OneHotEncoding : TypeError: cannot perform reduce with flexible type

Pandas load from_records non-sparsely

Doing a ML predict but prices appears as in CSV

Copy values from one column in a pandas dataframe only when target field is blank in target dataframe

fillNa(0) instead producing None

Categories

Resources