How to groupby one or two columns and sum 100 columns - pandas

*updated original question
Example code:
import pandas as pd
df = pd.DataFrame({'Weight': [1.2, 2.0, 1.8,2.4,1.9,2.3],
'Sex': ['Male', 'Female', 'Unknown','Male','Male','Female'],
'Neutered': ['Entire', 'Unknown', 'Neutered','Neutered','Neutered','Unknown'],
'Rabbit_Breed': ['Dutch', 'Lop', 'Dwarf','Giant','Cross-Breed','Dwarf'],
'Abscess-mouth': [0, 0, 1,0,0,0],
'Overweight': [0, 1, 0,1,0,1],
'underweight': [0, 0, 1,0,0,1],
'molars-long': [1, 0, 1,0,0,1]})
df.head()
NB: I have around 100 columns so I cannot list them all; I'm looking for a way to groupby and or sum through all the columns to have the most common disorders in relation to the breed or sex of a rabbit.
I've attached an image of my thought process:
original question:
I'm looking to groupby one or two columns and sum all the other columns. Not sure if I should use a range or what but I keep getting errors.
Unless I've misunderstood the purpose of groupby and sum. I've got about 100 columns of disorders in domestic rabbits and ultimately I'm trying to investigate the most common ones and plot them against breed or female/male etc.
Thank you!!

Plotting a histogram doesn't make much sense to me, since you want to plot bivariate data (disorder vs. breed), while histograms are meant for univariate data. I think you want a heatmap, which is basically the generalization of a histogram for two dimensions. For that, you can use seaborn.heatmap.
Is this what you want?
import pandas
import seaborn as sns
df = pd.DataFrame({'Weight': [1.2, 2.0, 1.8,2.4,1.9,2.3],
'Sex': ['Male', 'Female', 'Unknown','Male','Male','Female'],
'Neutered': ['Entire', 'Unknown', 'Neutered','Neutered','Neutered','Unknown'],
'Rabbit_Breed': ['Dutch', 'Lop', 'Dwarf','Giant','Cross-Breed','Dwarf'],
'Abscess-mouth': [0, 0, 1,0,0,0],
'Overweight': [0, 1, 0,1,0,1],
'underweight': [0, 0, 1,0,0,1],
'molars-long': [1, 0, 1,0,0,1]})
disorder_cols = ['Abscess-mouth', 'Overweight', 'underweight', 'molars-long']
disorder_by_breed = df.groupby('Rabbit_Breed')[disorder_cols].sum()
sns.heatmap(data=disorder_by_breed, annot=True, lw=1, cmap='Reds')
Output:

Related

How to define NetworkX graph from pandas dataframe having multiple columns

I have a pandas dataframe that captures the information whether a invoice has been raised as a dispute or not based on some characteristics. I would like to run a community detection on top of this to search for patterns. But confused on how to create a graph from this. Tried like the below :
import pandas as pd
import networkx as nx
from itertools import combinations as comb
data = [[4321, 543, 765, 3, 2014, 54, 0, 1, 0, 1, 0], [2321, 657, 654, 7, 2017, 59, 1, 0, 1, 0, 1]]
df = pd.DataFrame(data, columns = ['NetValueInDocCurr', 'NetWeight', 'Volume', 'BillingItems', 'FISCALYEAR', 'TaxAmtInDocCurr', 'Description_Bulk', 'Description_Car_Care', 'Description_Packed', 'Description_Services', 'Final_Dispute'])
edges = set(comb(df.columns,2))
G = nx.Graph()
G.add_edges_from(edges)
My current assumption is to define column name as node, pairwise relationship between all columns as edge and column value as edge weight. Is this the right approach? If yes, any help on the code to define weights? My idea is to start with a complete graph and use divisive methods like Girvan-Newman.

How to return one NumPy array per partition in Dask?

I need to compute many NumPy arrays (that can be up to 4-dimensional), one for each partition of a Dask dataframe, and then add them as arrays. However, I'm struggling to make map_partitions return an array for each partition instead of a single array for all of them.
import dask.dataframe as dd
import numpy as np, pandas as pd
df = pd.DataFrame(range(15), columns=['x'])
ddf = dd.from_pandas(df, npartitions=3)
def func(partition):
# Here I also tried returning the array in a list and in a tuple
return np.array([[1, 2], [3, 4]])
# Here I tried all the options available for 'meta'
results = ddf.map_partitions(func).compute()
Then results is:
array([[1, 2],
[3, 4],
[1, 2],
[3, 4],
[1, 2],
[3, 4]])
And if, instead, I do results.sum().compute() I get 30.
What I'd like to get is:
[np.array([[1, 2],[3, 4]]), np.array([[1, 2],[3, 4]]), np.array([[1, 2],[3, 4]])]
So that if I compute the sum, I get:
array([[ 3, 6],
[ 9, 12]])
How can you achieve this result with Dask?
I managed to make it work like this, but I don't know if this is the best way:
from dask import delayed
results = []
for partition in ddf.partitions:
result = delayed(func)(partition)
results.append(result)
delayed(sum)(results).compute()
The result of the computation is:
array([[ 3, 6],
[ 9, 12]])
You are right, a dask-array is usually to be viewed as a single logical array, which just happens to be made of pieces. Single you are not using the logical layer, you could have done your work with delayed alone. On the other hand, it seems like the end result you want really is a sum over all the data, so maybe even simpler would be an appropriate reshape and sum(axis=)?
ddf.map_partitions(func).compute_chunk_sizes().reshape(
-1, 2, 2).sum(axis=0).compute()
(compute_chunk_sizes is needed because although your original pandas dataframe had a known size, Dask did not evaluate your function yet to know what sizes it gave back)
However, given your setup, the following would work and be more similar to your original attempt, see .to_delayed()
list_of_delayed = ddf.map_partitions(func).to_delayed().tolist()
tuple_of_np_lists = dask.compute(*list_of_delayed)
(tolist forces evaluating the contained delayed objects)

Converting pandas dataframe to scipy sparse arrays

Converting pandas data frame with mixed column types -- numerical, ordinal as well as categorical -- to Scipy sparse arrays is a central problem in machine learning.
Now, if my pandas' data frame consists of only numerical data, then I can simply do the following to convert the data frame to sparse csr matrix:
scipy.sparse.csr_matrix(df.values)
and if my data frame consists of ordinal data types, I can handle them using LabelEncoder
from collections import defaultdict
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
Then, I can again use the following and the problem is solved:
scipy.sparse.csr_matrix(df.values)
Categorical variables with a low number of values is also not a concern. They can easily be handled using pd.get_dummies (Pandas or Scikit-Learn versions).
My main concern is for categorical variables with a large number of values.
The main problem: How to handle categorical variables with a large number of values?
pd.get_dummies(train_set, columns=[categorical_columns_with_large_number_of_values], sparse=True)
takes a lot of time.
This question seems to be giving interesting directions, but, it is not clear whether it handles all the data types efficiently.
Let me know if you know the efficient way. Thanks.
You can convert any single column to a sparse COO array very easily with factorize. This will be MUCH faster than building a giant dense dataframe.
import pandas as pd
import scipy.sparse
data = pd.DataFrame({"A": ["1", "2", "A", "C", "A"]})
c, u = pd.factorize(data['A'])
n, m = data.shape[0], u.shape[0]
one_hot = scipy.sparse.coo_matrix((np.ones(n, dtype=np.int16), (np.arange(n), c)), shape=(n,m))
You'll get something that looks like this:
>>> one_hot.A
array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 1, 0]], dtype=int16)
>>> u
Index(['1', '2', 'A', 'C'], dtype='object')
Where rows are your dataframe rows and columns are the factors of your column (u will have labels for those columns in order)

Inconsistent internal representation of dates in matplotlib/pandas

import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-02'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (736066.7, 736469.3)
Now, if we change the last date.
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (184.8, 189.2)
The first example seems consistent with the matplotlib docs:
Matplotlib represents dates using floating point numbers specifying the number of days since 0001-01-01 UTC, plus 1
Why does the second example return something seemingly completely different? I'm using pandas version 0.22.0 and matplotlib version 2.2.2.
In the second example, if you look at the plots, rather than giving dates matplotlib is giving quarter values:
The dates in this case are exactly six months and therefore two quarters apart, which is presumably why you're seeing this behavior. While I can't find it in the docs, the numbers given by xlim in this case are consistent with being the number of quarters since the Unix Epoch (Jan. 1, 1970).
Pandas uses different units to represents dates and times on the axes, depending on the range of dates/times in use. This means that different locators are in use.
In the first case,
print(ax.xaxis.get_major_locator())
# Out: pandas.plotting._converter.PandasAutoDateLocator
in the second case
print(ax.xaxis.get_major_locator())
# pandas.plotting._converter.TimeSeries_DateLocator
You may force pandas to always use the PandasAutoDateLocator using the x_compat argument,
df.plot(x_compat=True)
This would ensure to always get the same datetime definition, consistent with the matplotlib.dates convention.
The drawback is that this removes the nice quarterly ticking
and replaces it with the standard ticking
On the other hand it would then allow to use the very customizable matplotlib.dates tickers and formatters. For example to get quarterly ticks/labels
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot(x_compat=True)
# Quarterly ticks
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
# Formatting:
def func(x,pos):
q = (mdates.num2date(x).month-1)//3+1
tx = "Q{}".format(q)
if q == 1:
tx += "\n{}".format(mdates.num2date(x).year)
return tx
ax.xaxis.set_major_formatter(mticker.FuncFormatter(func))
plt.setp(ax.get_xticklabels(), rotation=0, ha="center")
plt.show()

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.