Does sklearn use pandas index as a feature? - pandas

I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?
df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])
estimator = RandomForestClassifier()
estimator.fit(X, y)

No, sklearn doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array function will be applied. And now if you dig deep into the check_array function, you can find that you are converting your input into array using np.array function which essentially strips the indices from your dataframe as shown below:
import pandas as pd
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df
Name Age
0 tom 10
1 nick 15
2 juli 14
np.array(df)
array([['tom', 10],
['nick', 15],
['juli', 14]], dtype=object)

Related

Converting dict to dataframe of Solution point values & plotting

I am trying to plot some results obtained after optimisation using Gurobi.
I have converted the dictionary to python dataframe.
it is 96*1
But now how do I use this dataframe to plot as 1st row-value, 2nd row-value, I am attaching the snapshot of the same.
Please anyone can help me in this?
x={}
for t in time1:
x[t]= [price_energy[t-1]*EnergyResource[174,t].X]
df = pd.DataFrame.from_dict(x, orient='index')
df
You can try pandas.DataFrame(data=x.values()) to properly create a pandas DataFrame while using row numbers as indices.
In the example below, I have generated a (pseudo) random dictionary with 10 values, and stored it as a data frame using pandas.DataFrame giving a name to the only column as xyz. To understand how indexing works, please see Indexing and selecting data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Create a dictionary 'x'
rng = np.random.default_rng(121)
x = dict(zip(np.arange(10), rng.random((1, 10))[0]))
# Create a dataframe from 'x'
df = pd.DataFrame(x.values(), index=x.keys(), columns=["xyz"])
print(df)
print(df.index)
# Plot the dataframe
plt.plot(df.index, df.xyz)
plt.show()
This prints df as:
xyz
0 0.632816
1 0.297902
2 0.824260
3 0.580722
4 0.593562
5 0.793063
6 0.444513
7 0.386832
8 0.214222
9 0.029993
and gives df.index as:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
and also plots the figure:

pandas DataFrame value_counts on column that stores DataFrame

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?
I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

Running df.apply, dask and pd.get_dummies together

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)

How to split a cell which contains nested array in a pandas DataFrame

I have a pandas DataFrame, which contains 610 rows, and every row contains a nested list of coordinate pairs, it looks like that:
[1377778.4800000004, 6682395.377599999] is one coordinate pair.
I want to unnest every row, so instead of one row containing a list of coordinates I will have one row for every coordinate pair, i.e.:
I've tried s.apply(pd.Series).stack() from this question Split nested array values from Pandas Dataframe cell over multiple rows but unfortunately that didn't work.
Please any ideas? Many thanks in advance!
Here my new answer to your problem. I used "reduce" to flatten your nested array and then I used "itertools chain" to turn everything into a 1d list. After that I reshaped the list into a 2d array which allows you to convert it to the dataframe that you need. I tried to be as generic as possible. Please let me know if there are any problems.
#libraries
import operator
from functools import reduce
from itertools import chain
#flatten lists of lists using reduce. Then turn everything into a 1d list using
#itertools chain.
reduced_coordinates = list(chain.from_iterable(reduce(operator.concat,
geometry_list)))
#reshape the coordinates 1d list to a 2d and convert it to a dataframe
df = pd.DataFrame(np.reshape(reduced_coordinates, (-1, 2)))
df.columns = ['X', 'Y']
One thing you can do is use numpy. It allows you to perform a lot of list/ array operations in a fast and efficient way. This includes "unnesting" (reshaping) lists. Then you only have to convert to pandas dataframe.
For example,
import numpy as np
#your list
coordinate_list = [[[1377778.4800000004, 6682395.377599999],[6582395.377599999, 2577778.4800000004], [6582395.377599999, 2577778.4800000004]]]
#convert list to array
coordinate_array = numpy.array(coordinate_list)
#print shape of array
coordinate_array.shape
#reshape array into pairs of
reshaped_array = np.reshape(coordinate_array, (3, 2))
df = pd.DataFrame(reshaped_array)
df.columns = ['X', 'Y']
The output will look like this. Let me know if there is something I am missing.
import pandas as pd
import numpy as np
data = np.arange(500).reshape([250, 2])
cols = ['coord']
new_data = []
for item in data:
new_data.append([item])
df = pd.DataFrame(data=new_data, columns=cols)
print(df.head())
def expand(row):
row['x'] = row.coord[0]
row['y'] = row.coord[1]
return row
df = df.apply(expand, axis=1)
df.drop(columns='coord', inplace=True)
print(df.head())
RESULT
coord
0 [0, 1]
1 [2, 3]
2 [4, 5]
3 [6, 7]
4 [8, 9]
x y
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9

Pytables table into pandas DataFrame

Lots of information on how to read a csv into a pandas dataframe, but I what I have is a pyTable table and want a pandas DataFrame.
I've found how to store my pandas DataFrame to pytables... then read I want to read it back, at this point it will have:
"kind = v._v_attrs.pandas_type"
I could write it out as csv and re-read it in but that seems silly. It is what I am doing for now.
How should I be reading pytable objects into pandas?
import tables as pt
import pandas as pd
import numpy as np
# the content is junk but we don't care
grades = np.empty((10,2), dtype=(('name', 'S20'), ('grade', 'u2')))
# write to a PyTables table
handle = pt.openFile('/tmp/test_pandas.h5', 'w')
handle.createTable('/', 'grades', grades)
print handle.root.grades[:].dtype # it is a structured array
# load back as a DataFrame and check types
df = pd.DataFrame.from_records(handle.root.grades[:])
df.dtypes
Beware that your u2 (unsigned 2-byte integer) will end as an i8 (integer 8 byte), and the strings will be objects, because Pandas does not yet support the full range of dtypes that are available for Numpy arrays.
The docs now include an excellent section on using the HDF5 store and there are some more advanced strategies discussed in the cookbook.
It's now relatively straightforward:
In [1]: store = HDFStore('store.h5')
In [2]: print store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty
In [3]: df = DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [4]: store['df'] = df
In [5]: store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[2,2])
And to retrieve from HDF5/pytables:
In [6]: store['df'] # store.get('df') is an equivalent
Out[6]:
A B
0 1 2
1 3 4
You can also query within a table.