pandas DataFrame value_counts on column that stores DataFrame - pandas

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?

I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

Related

Running df.apply, dask and pd.get_dummies together

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)

Does sklearn use pandas index as a feature?

I'm passing a pandas DataFrame containing various features to sklearn and I do not want the estimator to use the dataframe index as one of the features. Does sklearn use the index as one of the features?
df_features = pd.DataFrame(columns=["feat1", "feat2", "target"])
# Populate the dataframe (not shown here)
y = df_features["target"]
X = df_features.drop(columns=["target"])
estimator = RandomForestClassifier()
estimator.fit(X, y)
No, sklearn doesn't use the index as one of your feature. It essentially happens here, when you call the fit method the check_array function will be applied. And now if you dig deep into the check_array function, you can find that you are converting your input into array using np.array function which essentially strips the indices from your dataframe as shown below:
import pandas as pd
import numpy as np
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df
Name Age
0 tom 10
1 nick 15
2 juli 14
np.array(df)
array([['tom', 10],
['nick', 15],
['juli', 14]], dtype=object)

pandas different column names same data

How do i use the same data storage and change only the columns.
If i do something like this:
In [30]: import pandas as pd
In [31]: import numpy as np
In [32]: df = pd.DataFrame(np.zeros((2,2)))
In [33]: df_new = pd.DataFrame(df)
In [34]: df[0][0]=5
In [35]: df_new
Out[35]:
0 1
0 5.0 0.0
1 0.0 0.0
In [36]: df_new.columns=["a", "b"]
In [37]: df_new.columns
Out[37]: Index(['a', 'b'], dtype='object')
In [38]: df.columns
Out[38]: Index(['a', 'b'], dtype='object')
Changes columns for both dataframes. Using dataframe rename inplace leads to columns being changed for both dataframes.
You should use pandas.DataFrame.copy() to create a copy of an existing dataframe.
For your code,
df_new = df.copy()
instead of
df_new = pd.DataFrame(df)
will do the trick.

Pandas dataframes and PyCharm IntelliSense

When I create new dataframes from old ones, using concat or merge, PyCharm intellisense stops working for the resulting dataframe unless I explicitly pass it to a DataFrame constructor
import pandas as pd
d1 = {1: [1, 2, 3], 2: [11, 22, 33]}
d2 = {1: [4], 2: [5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df3 = pd.concat([df1, df2], axis=0)
df3_ = pd.DataFrame(pd.concat([df1, df2], axis=0))
In the above example df3 and df3_ are the "same" dataframe, but intellisense only works on df3_. Am I doing something wrong? How can I avoid always having to call the DataFrame constructor and still get intellisense out of pycharm?
The answer is to use type hints like this:
df3 = pd.concat([df1, df2], axis=0) # type: pandas.DataFrame

Replace None with NaN in pandas dataframe

I have table x:
website
0 http://www.google.com/
1 http://www.yahoo.com
2 None
I want to replace python None with pandas NaN. I tried:
x.replace(to_replace=None, value=np.nan)
But I got:
TypeError: 'regex' must be a string or a compiled regular expression or a list or dict of strings or regular expressions, you passed a 'bool'
How should I go about it?
You can use DataFrame.fillna or Series.fillna which will replace the Python object None, not the string 'None'.
import pandas as pd
import numpy as np
For dataframe:
df = df.fillna(value=np.nan)
For column or series:
df.mycol.fillna(value=np.nan, inplace=True)
Here's another option:
df.replace(to_replace=[None], value=np.nan, inplace=True)
The following line replaces None with NaN:
df['column'].replace('None', np.nan, inplace=True)
If you use df.replace([None], np.nan, inplace=True), this changed all datetime objects with missing data to object dtypes. So now you may have broken queries unless you change them back to datetime which can be taxing depending on the size of your data.
If you want to use this method, you can first identify the object dtype fields in your df and then replace the None:
obj_columns = list(df.select_dtypes(include=['object']).columns.values)
df[obj_columns] = df[obj_columns].replace([None], np.nan)
This solution is straightforward because can replace the value in all the columns easily.
You can use a dict:
import pandas as pd
import numpy as np
df = pd.DataFrame([[None, None], [None, None]])
print(df)
0 1
0 None None
1 None None
# replacing
df = df.replace({None: np.nan})
print(df)
0 1
0 NaN NaN
1 NaN NaN
Its an old question but here is a solution for multiple columns:
values = {'col_A': 0, 'col_B': 0, 'col_C': 0, 'col_D': 0}
df.fillna(value=values, inplace=True)
For more options, check the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
DataFrame['Col_name'].replace("None", np.nan, inplace=True)