Running df.apply, dask and pd.get_dummies together - pandas

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')

I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)

Related

pandas DataFrame value_counts on column that stores DataFrame

I am trying to use value_counts() on a pandas DataFrame column that stores another DataFrame.
Is there a possibility to get the value_counts() function working (or something similar), without having to transform my DataFrames to Strings or Hashes or something like that?
I've tried to count the inner DataFrames, which completely breaks, and then I tried with Arrays, which it seems it cannot make the correct comparison also:
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
ar1 = np.array([11,22])
ar2 = np.array([11,22])
ar3 = np.array([33,44])
df = pd.DataFrame([
['0', ar1],
['1', ar2],
['2', ar3]
], columns =['str', 'ars'])
print(df["ars"].value_counts())
Expected:
[11, 22] 2
[33, 44] 1
Actual:
[11, 22] 1
[11, 22] 1
[33, 44] 1
# importing pandas
import pandas as pd
import numpy as np
# Creating Arrys
df1 = pd.DataFrame({'col1': [11], 'col2': [22]})
df2 = pd.DataFrame({'col1': [11], 'col2': [22]})
df3 = pd.DataFrame({'col1': [33], 'col2': [44]})
df = pd.DataFrame([
['0', df1],
['1', df2],
['2', df3]
], columns =['str', 'dfs'])
print(df["dfs"].value_counts())
Expected:
{} 2
{} 1
Actual:
BREAKS COMPLETELY
How can I achive the count of complex values in a DataFrame?
I'm honestly confused how either of those managed to run without raising an exception.
Neither np.array nor pd.DataFrame are hashable, and as far as I understood, hashing was necessary for value_count.
Case and point, neither of your examples can be translated to their DataFrame.value_counts equivalent, because underneath it's doing df.groupby(["ars"], dropna=True).grouper.size() which requires hashing.
>>> df.value_counts(["ars"])
TypeError: unhashable type: 'numpy.ndarray'
Overall, I would not count on any .value_count method working on non-hashable columns.

how to assign different markers to the max value found in each column in the plot

How would I assign markers of different symbols to each of the max values found in each curve?, Ie, 4 different markers showing the max value in each curve.
Here is my attempt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
maxValues=df.max()
m=['o', '.', ',', 'x',]
df.plot()
plt.plot(maxValues, marker=m)
In my real df, the number of columns will vary.
You can do it this way. Note that I used a V instead of , as the comma (pixel) wasn't showing up clearly.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df.plot(figsize=(20,5))
mrk = pd.DataFrame({'A': [df[['A']].idxmax()[0], df['A'].max(), 'o'],
'B': [df[['B']].idxmax()[0], df['B'].max(), '.'],
'C': [df[['C']].idxmax()[0], df['C'].max(), 'v'],
'D': [df[['D']].idxmax()[0], df['D'].max(), 'x']})
for col in range(len(mrk.columns)):
plt.plot(mrk.iloc[0,col], mrk.iloc[1, col], marker=mrk.iloc[2, col], markersize=20)
I created the mrk dataframe manually as it was small, but you can use loops to go through the various columns in your real data. The graph looks like this. Adjust markersize to increase/decrease size of the markers.

FacetGrid plot with aggregate in Seaborn/other library

I've toy-dataframe like this:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'cat': ['a', 'a', 'a', 'b', 'b', 'b'], 'n1': [1,1,1,4,5,6], 'n2': [6,5,2,2,2,1]})
I want to groupby by cat and plot histograms for n1 and n2, additionally I want to plot those histograms without grouping, so first, transform data to seaborn format:
df2 = pd.melt(df, id_vars='cat', value_vars=['n1', 'n2'], value_name='value')
second add "all":
df_all = df2.copy()
df_all['cat'] = 'all'
df3 = pd.concat([df2, df_all])
Finally plot:
g = sns.FacetGrid(df2, col="variable", row="cat")
g.map(plt.hist, 'value', ec="k")
I wonder, if it could be done in more elegant, concise way, without creating df3 or df2. Different library could be used.
As I mentioned in my comment, I think what you do is perfectly fine. Craft a function if needed to perform often. Nevertheless, you might be interested in pandas_profiling. This describes in detail the profile of your data, and in an interactive way. In my opinion, this is probably overkill for what you want to do, but I'll let you be the judge of that ;)
import pandas_profiling
df.profile_report()
Extract of the interactive output:

Iterate Over columns in pandas dataframe using list comprehension

I would like to pefrom following operation using list comprehension:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
df.head()
for i in df.columns:
print(df.loc[:, i].is_unique)
Using [x.is_unique for x in df.loc[:, i] for i in df.columns] does not work
Use Series.is_unique with one for:
out = [df[i].is_unique for i in df.columns]
Alternative solution (I prefer first for more clear iterate by columns):
out = [df[i].is_unique for i in df]

Matplotlib Bar Graph Yaxis not being set to 0 [duplicate]

My DataFrame's structure
trx.columns
Index(['dest', 'orig', 'timestamp', 'transcode', 'amount'], dtype='object')
I'm trying to plot transcode (transaction code) against amount to see the how much money is spent per transaction. I made sure to convert transcode to a categorical type as seen below.
trx['transcode']
...
Name: transcode, Length: 21893, dtype: category
Categories (3, int64): [1, 17, 99]
The result I get from doing plt.scatter(trx['transcode'], trx['amount']) is
Scatter plot
While the above plot is not entirely wrong, I would like the X axis to contain just the three possible values of transcode [1, 17, 99] instead of the entire [1, 100] range.
Thanks!
In matplotlib 2.1 you can plot categorical variables by using strings. I.e. if you provide the column for the x values as string, it will recognize them as categories.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
plt.scatter(df["x"].astype(str), df["y"])
plt.margins(x=0.5)
plt.show()
In order to optain the same in matplotlib <=2.0 one would plot against some index instead.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
u, inv = np.unique(df["x"], return_inverse=True)
plt.scatter(inv, df["y"])
plt.xticks(range(len(u)),u)
plt.margins(x=0.5)
plt.show()
The same plot can be obtained using seaborn's stripplot:
sns.stripplot(x="x", y="y", data=df)
And a potentially nicer representation can be done via seaborn's swarmplot:
sns.swarmplot(x="x", y="y", data=df)