I have a pandas dataframe that captures the information whether a invoice has been raised as a dispute or not based on some characteristics. I would like to run a community detection on top of this to search for patterns. But confused on how to create a graph from this. Tried like the below :
import pandas as pd
import networkx as nx
from itertools import combinations as comb
data = [[4321, 543, 765, 3, 2014, 54, 0, 1, 0, 1, 0], [2321, 657, 654, 7, 2017, 59, 1, 0, 1, 0, 1]]
df = pd.DataFrame(data, columns = ['NetValueInDocCurr', 'NetWeight', 'Volume', 'BillingItems', 'FISCALYEAR', 'TaxAmtInDocCurr', 'Description_Bulk', 'Description_Car_Care', 'Description_Packed', 'Description_Services', 'Final_Dispute'])
edges = set(comb(df.columns,2))
G = nx.Graph()
G.add_edges_from(edges)
My current assumption is to define column name as node, pairwise relationship between all columns as edge and column value as edge weight. Is this the right approach? If yes, any help on the code to define weights? My idea is to start with a complete graph and use divisive methods like Girvan-Newman.
Related
*updated original question
Example code:
import pandas as pd
df = pd.DataFrame({'Weight': [1.2, 2.0, 1.8,2.4,1.9,2.3],
'Sex': ['Male', 'Female', 'Unknown','Male','Male','Female'],
'Neutered': ['Entire', 'Unknown', 'Neutered','Neutered','Neutered','Unknown'],
'Rabbit_Breed': ['Dutch', 'Lop', 'Dwarf','Giant','Cross-Breed','Dwarf'],
'Abscess-mouth': [0, 0, 1,0,0,0],
'Overweight': [0, 1, 0,1,0,1],
'underweight': [0, 0, 1,0,0,1],
'molars-long': [1, 0, 1,0,0,1]})
df.head()
NB: I have around 100 columns so I cannot list them all; I'm looking for a way to groupby and or sum through all the columns to have the most common disorders in relation to the breed or sex of a rabbit.
I've attached an image of my thought process:
original question:
I'm looking to groupby one or two columns and sum all the other columns. Not sure if I should use a range or what but I keep getting errors.
Unless I've misunderstood the purpose of groupby and sum. I've got about 100 columns of disorders in domestic rabbits and ultimately I'm trying to investigate the most common ones and plot them against breed or female/male etc.
Thank you!!
Plotting a histogram doesn't make much sense to me, since you want to plot bivariate data (disorder vs. breed), while histograms are meant for univariate data. I think you want a heatmap, which is basically the generalization of a histogram for two dimensions. For that, you can use seaborn.heatmap.
Is this what you want?
import pandas
import seaborn as sns
df = pd.DataFrame({'Weight': [1.2, 2.0, 1.8,2.4,1.9,2.3],
'Sex': ['Male', 'Female', 'Unknown','Male','Male','Female'],
'Neutered': ['Entire', 'Unknown', 'Neutered','Neutered','Neutered','Unknown'],
'Rabbit_Breed': ['Dutch', 'Lop', 'Dwarf','Giant','Cross-Breed','Dwarf'],
'Abscess-mouth': [0, 0, 1,0,0,0],
'Overweight': [0, 1, 0,1,0,1],
'underweight': [0, 0, 1,0,0,1],
'molars-long': [1, 0, 1,0,0,1]})
disorder_cols = ['Abscess-mouth', 'Overweight', 'underweight', 'molars-long']
disorder_by_breed = df.groupby('Rabbit_Breed')[disorder_cols].sum()
sns.heatmap(data=disorder_by_breed, annot=True, lw=1, cmap='Reds')
Output:
I'm making an array of sums of random choices from a negative binomial distribution (nbd), with each sum being of non-regular length. Right now I implement it as follows:
import numpy
from numpy.random import default_rng
rng = default_rng()
nbd = rng.negative_binomial(1, 0.5, int(1e6))
gmc = [12, 35, 4, 67, 2]
n_pp = np.empty(len(gmc))
for i in range(len(gmc)):
n_pp[i] = np.sum(rng.choice(nbd, gmc[i]))
This works, but when I perform it over my actual data it's very slow (gmc is of dimension 1e6), and I would like to vary this for multiple values of n and p in the nbd (in this example they're set to 1 and 0.5, respectively).
I'd like to work out a pythonic way to do this which eliminates the loop, but I'm not sure it's possible. I want to keep default_rng for the better random generation than the older way of doing it (np.random.choice), if possible.
The distribution of the sum of m samples from the negative binomial distribution with parameters (n, p) is the negative binomial distribution with parameters (m*n, p). So instead of summing random selections from a large, precomputed sample of negative_binomial(1, 0.5), you can generate your result directly with negative_binomial(gmc, 0.5):
In [68]: gmc = [12, 35, 4, 67, 2]
In [69]: npp = rng.negative_binomial(gmc, 0.5)
In [70]: npp
Out[70]: array([ 9, 34, 1, 72, 7])
(The negative_binomial method will broadcast its inputs, so we can pass gmc as an argument to generate all the samples with one call.)
More generally, if you want to vary the n that is used to generate nbd, you would multiply that n by the corresponding element in gmc and pass the product to rng.negative_binomial.
Converting pandas data frame with mixed column types -- numerical, ordinal as well as categorical -- to Scipy sparse arrays is a central problem in machine learning.
Now, if my pandas' data frame consists of only numerical data, then I can simply do the following to convert the data frame to sparse csr matrix:
scipy.sparse.csr_matrix(df.values)
and if my data frame consists of ordinal data types, I can handle them using LabelEncoder
from collections import defaultdict
d = defaultdict(LabelEncoder)
fit = df.apply(lambda x: d[x.name].fit_transform(x))
Then, I can again use the following and the problem is solved:
scipy.sparse.csr_matrix(df.values)
Categorical variables with a low number of values is also not a concern. They can easily be handled using pd.get_dummies (Pandas or Scikit-Learn versions).
My main concern is for categorical variables with a large number of values.
The main problem: How to handle categorical variables with a large number of values?
pd.get_dummies(train_set, columns=[categorical_columns_with_large_number_of_values], sparse=True)
takes a lot of time.
This question seems to be giving interesting directions, but, it is not clear whether it handles all the data types efficiently.
Let me know if you know the efficient way. Thanks.
You can convert any single column to a sparse COO array very easily with factorize. This will be MUCH faster than building a giant dense dataframe.
import pandas as pd
import scipy.sparse
data = pd.DataFrame({"A": ["1", "2", "A", "C", "A"]})
c, u = pd.factorize(data['A'])
n, m = data.shape[0], u.shape[0]
one_hot = scipy.sparse.coo_matrix((np.ones(n, dtype=np.int16), (np.arange(n), c)), shape=(n,m))
You'll get something that looks like this:
>>> one_hot.A
array([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[0, 0, 1, 0]], dtype=int16)
>>> u
Index(['1', '2', 'A', 'C'], dtype='object')
Where rows are your dataframe rows and columns are the factors of your column (u will have labels for those columns in order)
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-02'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (736066.7, 736469.3)
Now, if we change the last date.
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (184.8, 189.2)
The first example seems consistent with the matplotlib docs:
Matplotlib represents dates using floating point numbers specifying the number of days since 0001-01-01 UTC, plus 1
Why does the second example return something seemingly completely different? I'm using pandas version 0.22.0 and matplotlib version 2.2.2.
In the second example, if you look at the plots, rather than giving dates matplotlib is giving quarter values:
The dates in this case are exactly six months and therefore two quarters apart, which is presumably why you're seeing this behavior. While I can't find it in the docs, the numbers given by xlim in this case are consistent with being the number of quarters since the Unix Epoch (Jan. 1, 1970).
Pandas uses different units to represents dates and times on the axes, depending on the range of dates/times in use. This means that different locators are in use.
In the first case,
print(ax.xaxis.get_major_locator())
# Out: pandas.plotting._converter.PandasAutoDateLocator
in the second case
print(ax.xaxis.get_major_locator())
# pandas.plotting._converter.TimeSeries_DateLocator
You may force pandas to always use the PandasAutoDateLocator using the x_compat argument,
df.plot(x_compat=True)
This would ensure to always get the same datetime definition, consistent with the matplotlib.dates convention.
The drawback is that this removes the nice quarterly ticking
and replaces it with the standard ticking
On the other hand it would then allow to use the very customizable matplotlib.dates tickers and formatters. For example to get quarterly ticks/labels
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot(x_compat=True)
# Quarterly ticks
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
# Formatting:
def func(x,pos):
q = (mdates.num2date(x).month-1)//3+1
tx = "Q{}".format(q)
if q == 1:
tx += "\n{}".format(mdates.num2date(x).year)
return tx
ax.xaxis.set_major_formatter(mticker.FuncFormatter(func))
plt.setp(ax.get_xticklabels(), rotation=0, ha="center")
plt.show()
I want to zero out all of the elements of a dask.array except for the top few elements. How do I do this?
Example
Say I have a small dask array like the following:
import numpy as np
import dask.array as da
x = np.array([0, 4, 2, 3, 1])
x = da.from_array(x, chunks=(2,))
How do I zero out all but the two largest elements? I want something like the following:
>>> result.compute()
array([0, 4, 0, 3, 0])
You can do this with a combination of the topk function and inplace setitem
top = x.topk(2)
x[x < top[-1]] = 0
>>> x.compute()
array([0, 4, 0, 3, 0])
Note that this won't stream particularly nicely through memory. If you're using the single machine scheduler then you might want to do this in two passes by explicitly computing top ahead of time:
top = x.topk(2)
top = top.compute() # pass through data once to get top elements
x[x < top[-1]] = 0 # then pass through again applying filter
>>> x.compute()
array([0, 4, 0, 3, 0])
This only matters if you're trying to stream through a large dataset on a single machine and should not affect you much if you're on a distributed system.