This is possible in pandas.
I would like to do it with dask.
Edit: raised on dask here
FYI you can go from an xarray.Dataset to a Dask.DataFrame
Pandas solution using .to_xarry:
import pandas as pd
import numpy as np
df = pd.DataFrame([('falcon', 'bird', 389.0, 2),
('parrot', 'bird', 24.0, 2),
('lion', 'mammal', 80.5, 4),
('monkey', 'mammal', np.nan, 4)],
columns=['name', 'class', 'max_speed',
'num_legs'])
df.to_xarray()
<xarray.Dataset>
Dimensions: (index: 4)
Coordinates:
* index (index) int64 0 1 2 3
Data variables:
name (index) object 'falcon' 'parrot' 'lion' 'monkey'
class (index) object 'bird' 'bird' 'mammal' 'mammal'
max_speed (index) float64 389.0 24.0 80.5 nan
num_legs (index) int64 2 2 4 4
Dask solution?
import dask.dataframe as dd
ddf = dd.from_pandas(df, 1)
?
Could look a a solution using xarray but i think it only has .from_dataframe.
import xarray as xr
ds = xr.Dataset()
ds.from_dataframe(ddf.compute())
So this is possible and I've made a PR here that achieves it - https://github.com/pydata/xarray/pull/4659
It provides two methods Dataset.from_dask_dataframe and DataArray.from_dask_series.
The main reason behind not merging yet is that we're trying to compute the chunk sizes with as few computations of dask as possible.
There's some more context in these issues: https://github.com/pydata/xarray/issues/4650, https://github.com/pydata/xarray/issues/3929
I was looking for something similar and created this function (it is not perfect, but it works pretty well).
It also keeps all the dask data as dask arrays which saves memory etc.
import xarray as xr
import dask.dataframe as dd
def dask_2_xarray(ddf, indexname='index'):
ds = xr.Dataset()
ds[indexname] = ddf.index
for key in ddf.columns:
ds[key] = (indexname, ddf[key].to_dask_array().compute_chunk_sizes())
return ds
# use:
ds = dask_2_xarray(ddf)
Example:
path = LOCATION TO FILE
ddf_test = dd.read_hdf(path, key="/data*", sorted_index=True, mode='r')
ds = dask_2_xarray(ddf_test, indexname="time")
ds
Result:
Most time is spent computing the chunks sizes, so if somebody knows a better way to do that, it will be faster.
This method doesn't currently exist. If you think that it should exist then I encourage you to raise a github issue as a feature request. You might want to tag some Xarray people though.
Related
community,
given a simple example with an iris dataset:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df_iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df_iris['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df_iris = df_iris.drop("target", axis=1)
df_iris = df_iris[df_iris['species'] != 'setosa']
Why after filtering out setosa I still can see that category by printing df_iris.species?
This creates problems while trying to visualize data by seaborn afterward. Resetting the index for dataframe was not helpful. How I can remove setosa completely from dataframe?
Thank you
There's a function remove_unused_categories just for this: https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.remove_unused_categories.html
I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)
I am studying pyspark in databricks. I want to generate a correlation heatmap. Let's say this is my data:
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
['col1','col2','col3'])
And this is my code:
import pyspark
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from ggplot import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
['col1','col2','col3'])
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['col1','col2','col3'],
outputCol=vector_col)
myGraph_vector = assembler.transform(myGraph).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col)
matrix.collect()[0]["pearson({})".format(vector_col)].values
Until here, I can get the correlation matrix. The result looks like:
Now my problems are:
How to transfer matrix to data frame? I have tried the methods of How to convert DenseMatrix to spark DataFrame in pyspark? and How to get correlation matrix values pyspark. But it does not work for me.
How to generate a correlation heatmap which looks like:
Because I just studied pyspark and databricks. ggplot or matplotlib are both OK for my problem.
I think the point where you get confused is:
matrix.collect()[0]["pearson({})".format(vector_col)].values
Calling .values of a densematrix gives you a list of all values, but what you are actually looking for is a list of list representing correlation matrix.
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
columns = ['col1','col2','col3']
myGraph=spark.createDataFrame([(1.3,2.1,3.0),
(2.5,4.6,3.1),
(6.5,7.2,10.0)],
columns)
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=['col1','col2','col3'],
outputCol=vector_col)
myGraph_vector = assembler.transform(myGraph).select(vector_col)
matrix = Correlation.corr(myGraph_vector, vector_col)
Until now it was basically your code. Instead of calling .values you should use .toArray().tolist() to get a list of lists representing the correlation matrix:
matrix = Correlation.corr(myGraph_vector, vector_col).collect()[0][0]
corrmatrix = matrix.toArray().tolist()
print(corrmatrix)
Output:
[[1.0, 0.9582184104641529, 0.9780872729407004], [0.9582184104641529, 1.0, 0.8776695567739841], [0.9780872729407004, 0.8776695567739841, 1.0]]
The advantage of this approach is that you can turn a list of lists easily into a dataframe:
df = spark.createDataFrame(corrmatrix,columns)
df.show()
Output:
+------------------+------------------+------------------+
| col1| col2| col3|
+------------------+------------------+------------------+
| 1.0|0.9582184104641529|0.9780872729407004|
|0.9582184104641529| 1.0|0.8776695567739841|
|0.9780872729407004|0.8776695567739841| 1.0|
+------------------+------------------+------------------+
To answer your second question. Just one of the many solutions to plot a heatmap (like this or this even better with seaborn).
def plot_corr_matrix(correlations,attr,fig_no):
fig=plt.figure(fig_no)
ax=fig.add_subplot(111)
ax.set_title("Correlation Matrix for Specified Attributes")
ax.set_xticklabels(['']+attr)
ax.set_yticklabels(['']+attr)
cax=ax.matshow(correlations,vmax=1,vmin=-1)
fig.colorbar(cax)
plt.show()
plot_corr_matrix(corrmatrix, columns, 234)
I'm doing a k-means clustering of activities on some open source projects on GitHub and am trying to plot the results together with the cluster centroids using Seaborn Scatterplot Matrix.
I can successfully plot the results of the clustering analysis (example tsv output below)
user_id issue_comments issues_created pull_request_review_comments pull_requests category
1 0.14936519790888722 2.0100502512562812 0.0 0.60790273556231 Group 0
1882 0.11202389843166542 0.5025125628140703 0.0 0.0 Group 1
2 2.315160567587752 20.603015075376884 0.13297872340425532 1.21580547112462 Group 2
1789 36.8185212845407 82.91457286432161 75.66489361702128 74.46808510638297 Group 3
The problem I'm having is that I'd like to be able to also plot the centroids of the clusters on the matrix plot too. Currently I'm my plotting script looks like this:
import seaborn as sns
import pandas as pd
from pylab import savefig
sns.set()
# By default, Pandas assumes the first column is an index
# so it will be skipped. In our case it's the user_id
data = pd.DataFrame.from_csv('summary_clusters.tsv', sep='\t')
grid = sns.pairplot(data, hue="category", diag_kind="kde")
savefig('normalised_clusters.png', dpi = 150)
This produces the expected output:
I'd like to be able to mark on each of these plots the centroids of the clusters. I can think of two ways to do this:
Create a new 'CENTROID' category and just plot this together with the other points.
Manually add extra points to the plots after calling sns.pairplot(data, hue="category", diag_kind="kde").
If (1) is the solution then I'd like to be able to customise the marker (perhaps a star?) to make it more prominent.
If (2) I'm all ears. I'm pretty new to Seaborn and Matplotlib so any assistance would be very welcome :-)
pairplot isn't going to be all that well suited to this sort of thing, but it's possible to make it work with a few tricks. Here's what I would do.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
sns.set_color_codes()
# Make some random iid data
cov = np.eye(3)
ds = np.vstack([np.random.multivariate_normal([0, 0, 0], cov, 50),
np.random.multivariate_normal([1, 1, 1], cov, 50)])
ds = pd.DataFrame(ds, columns=["x", "y", "z"])
# Fit the k means model and label the observations
km = KMeans(2).fit(ds)
ds["label"] = km.labels_.astype(str)
Now comes the non-obvious part: you need to create a dataframe with the centroid locations and then combine it with the dataframe of observations while identifying the centroids as appropriate using the label column:
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
centroids["label"] = ["0 centroid", "1 centroid"]
full_ds = pd.concat([ds, centroids], ignore_index=True)
Then you just need to use PairGrid, which is a bit more flexible than pairplot and will allow you to map other plot attributes by the hue variable along with the color (at the expense of not being able to draw histograms on the diagonals):
g = sns.PairGrid(full_ds, hue="label",
hue_order=["0", "1", "0 centroid", "1 centroid"],
palette=["b", "r", "b", "r"],
hue_kws={"s": [20, 20, 500, 500],
"marker": ["o", "o", "*", "*"]})
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
An alternate solution would be to plot the observations as normal then change the data attributes on the PairGrid object and add a new layer. I'd call this a hack, but in some ways it's more straightforward.
# Plot the data
g = sns.pairplot(ds, hue="label", vars=["x", "y", "z"], palette=["b", "r"])
# Change the PairGrid dataset and add a new layer
centroids = pd.DataFrame(km.cluster_centers_, columns=["x", "y", "z"])
g.data = centroids
g.hue_vals = [0, 1]
g.map_offdiag(plt.scatter, s=500, marker="*")
I know I'm a bit late to the party, but here is a generalized version of mwaskom's code to work with n clusters. Might save someone a few minutes
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
def cluster_scatter_matrix(data_norm, cluster_number):
sns.set_color_codes()
km = KMeans(cluster_number).fit(data_norm)
data_norm["label"] = km.labels_.astype(str)
centroids = pd.DataFrame(km.cluster_centers_, columns=data_norm.columns)
centroids["label"] = [str(n)+" centroid" for n in range(cluster_number)]
full_ds = pd.concat([data_norm, centroids], ignore_index=True)
g = sns.PairGrid(full_ds, hue="label",
hue_order=[str(n) for n in range(cluster_number)]+[str(n)+" centroid" for n in range(cluster_number)],
#palette=["b", "r", "b", "r"],
hue_kws={"s": [ 20 for n in range(cluster_number)]+[500 for n in range(cluster_number)],
"marker": [ 'o' for n in range(cluster_number)]+['*' for n in range(cluster_number)]}
)
g.map(plt.scatter, linewidth=1, edgecolor="w")
g.add_legend()
Lots of information on how to read a csv into a pandas dataframe, but I what I have is a pyTable table and want a pandas DataFrame.
I've found how to store my pandas DataFrame to pytables... then read I want to read it back, at this point it will have:
"kind = v._v_attrs.pandas_type"
I could write it out as csv and re-read it in but that seems silly. It is what I am doing for now.
How should I be reading pytable objects into pandas?
import tables as pt
import pandas as pd
import numpy as np
# the content is junk but we don't care
grades = np.empty((10,2), dtype=(('name', 'S20'), ('grade', 'u2')))
# write to a PyTables table
handle = pt.openFile('/tmp/test_pandas.h5', 'w')
handle.createTable('/', 'grades', grades)
print handle.root.grades[:].dtype # it is a structured array
# load back as a DataFrame and check types
df = pd.DataFrame.from_records(handle.root.grades[:])
df.dtypes
Beware that your u2 (unsigned 2-byte integer) will end as an i8 (integer 8 byte), and the strings will be objects, because Pandas does not yet support the full range of dtypes that are available for Numpy arrays.
The docs now include an excellent section on using the HDF5 store and there are some more advanced strategies discussed in the cookbook.
It's now relatively straightforward:
In [1]: store = HDFStore('store.h5')
In [2]: print store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Empty
In [3]: df = DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [4]: store['df'] = df
In [5]: store
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
/df frame (shape->[2,2])
And to retrieve from HDF5/pytables:
In [6]: store['df'] # store.get('df') is an equivalent
Out[6]:
A B
0 1 2
1 3 4
You can also query within a table.