How to plot multiple measurements from several different categories? - pandas

I am a bit confused about what sort of package to use in order to plot my data which typically consists of 10 different categories (e.g. Temperatures) with 3 or 4 parallel measurements each. Here I have tried just using pandas (Trial1+2) and seaborn (Trial3).
In the end, what I would like to have is a scatterplot showing the three measurements from each category, and additionally drawing an average line through all my data (see example A and B below in figure).
I know that I can place my data in a CSV file which I can import using the PANDAS package in jupyter notebook. Then I get to my problem; which I think now might be related to indexing or data types? I get a lot of error that x must equal y, or that the index 'Degrees' is not defined... I will show the most successful trials below.
I have tried several things so far using this made up dataset 'Dummydata' which is very representative for the type of things I will do with my real data.
My test CSV File:
Its a .CSV file with four columns, where the first is the temperature, then the three next columns are the first, second and third measurement from corresponding temperature (y1, y2, y3).
in[]: Dummydata.to_dict()
Out[]:
{'Degrees': {0: 0,
1: 10,
2: 20,
3: 30,
4: 40,
5: 50,
6: 60,
7: 70,
8: 80,
9: 90},
'y1': {0: 20, 1: 25, 2: 34, 3: 35, 4: 45, 5: 70, 6: 46, 7: 20, 8: 10, 9: 15},
'y2': {0: 20, 1: 24, 2: 32, 3: 36, 4: 41, 5: 77, 6: 48, 7: 23, 8: 19, 9: 16},
'y3': {0: 18, 1: 26, 2: 36, 3: 37, 4: 42, 5: 75, 6: 46, 7: 21, 8: 15, 9: 16}}
Trial 1: trying to achieve a scatterplot
import pandas as pd
import matplotlib.pyplot as plt
Dummydata = pd.read_csv('DummyData.csv','r',delimiter=(';'), header=0)
y = ['y1','y2','y3']
x = ['Degrees']
Dummydata.plot(x,y)
This will give a nice line plot but also produce the UserWarning: Pandas doesn't allow columns to be created via a new attribute name (??).
If I change the plot to Dummydata.plot.scatter(x,y) then I get the error: x and y must be the same size... So I know that the shape of my data is (10,4) because of 10 rows and 4 column, how can I redefine this to be okay for pandas?
Trial 2: same thing small adjustments
import pandas as pd
import matplotlib.pyplot as plt
#import the .csv file, and set deliminator to ; and set the header as the first line(0)
Dummydata = pd.read_csv('DummyData.csv','r',delimiter=(';'), header = 0)
x =('Degrees')
y1 =('y1')
y2 =('y2')
y3 =('y3')
Dummydata.plot([x,y3]) #works fine for one value, but prints y1 and y2 ?? why?
Dummydata.plot([x,y1]) # also works, but print out y2 and y3 ??? why? # also works but prints out y2 and y3 ?? why?
Dummydata.plot([x,y]) # get error all arrays must be same length?
Dummydata.plot.scatter([x,y]) # many error, no plot
Somehow I must tell pandas that the data shape (10,4) is okay? Not sure what im doing wrong here.
Trial 3: using seaborn and try to get a scatterplot
I simply started to make a Factorplot, where I again came to the same problem of being able to get more than one y value onto my graph. I dont think converting this to a scatter would be hard if I just know how to add more data onto one graph.
import seaborn as sns
import matplotlib.pyplot as plt
#import the .csv file using pandas
Dummydata = pd.read_csv('DummyData.csv', 'r', delimiter=(';'))
#Checking what the file looks like
#Dummydata.head(2)
x =('Degrees')
y1 =('y1')
y2 =('y2')
y3 =('y3')
y =(['y1','y2','y3'])
Factorplot =sns.factorplot(x='Degrees',y='y1',data=Dummydata)
The Factor plot works fine for one dataset, however, trying to add more y value (either defining y =(['y1','y2','y3']) before or in the plotting, I get errors like: Could not interpret input 'y'.. For instance for this input:
Factorplot =sns.factorplot(x='Degrees',y='y',data=Dummydata)
or
Factorplot =sns.factorplot(x='Degrees',y=(['y1','y2','y3']),data=Dummydata)
#Error: cannot copy sequence with size 3 to array axis with dimension 10
What I would like to achieve is something like this:, where in (A) I would like a scatterplot with a rolling mean average - and in (B) I would like to plot the average only from each category but also showing the standard deviation, and additional draw a rolling mean across each category as following:
I dont want to write my data values in manually, I want to import then using .csv file (because the datasets can become very big).
Is there something wrong with the way I am organising my csv file?
All help appreciated.

Compute rolling statistics with rolling. Compute mean and standard deviation with meanand std. Plot data with plot. Add y-error bars with the yerr keyword argument.
data = data.set_index('Degrees').rolling(window=6).mean()
mean = data.mean(axis='columns')
std = data.std(axis='columns')
ax = mean.plot()
data.plot(style='o', ax=ax)
plt.figure()
mean.plot(yerr=std, capsize=3)

Related

Numpy Random Choice with Non-regular Array Size

I'm making an array of sums of random choices from a negative binomial distribution (nbd), with each sum being of non-regular length. Right now I implement it as follows:
import numpy
from numpy.random import default_rng
rng = default_rng()
nbd = rng.negative_binomial(1, 0.5, int(1e6))
gmc = [12, 35, 4, 67, 2]
n_pp = np.empty(len(gmc))
for i in range(len(gmc)):
n_pp[i] = np.sum(rng.choice(nbd, gmc[i]))
This works, but when I perform it over my actual data it's very slow (gmc is of dimension 1e6), and I would like to vary this for multiple values of n and p in the nbd (in this example they're set to 1 and 0.5, respectively).
I'd like to work out a pythonic way to do this which eliminates the loop, but I'm not sure it's possible. I want to keep default_rng for the better random generation than the older way of doing it (np.random.choice), if possible.
The distribution of the sum of m samples from the negative binomial distribution with parameters (n, p) is the negative binomial distribution with parameters (m*n, p). So instead of summing random selections from a large, precomputed sample of negative_binomial(1, 0.5), you can generate your result directly with negative_binomial(gmc, 0.5):
In [68]: gmc = [12, 35, 4, 67, 2]
In [69]: npp = rng.negative_binomial(gmc, 0.5)
In [70]: npp
Out[70]: array([ 9, 34, 1, 72, 7])
(The negative_binomial method will broadcast its inputs, so we can pass gmc as an argument to generate all the samples with one call.)
More generally, if you want to vary the n that is used to generate nbd, you would multiply that n by the corresponding element in gmc and pass the product to rng.negative_binomial.

Find indexes of local maxima/minima in pandas and scipy.signal

My goal is to find the indexes of the local maxima and minima of the function in pandas or matplotlib.
Let us say we have a noisy signal with its local maxima and minima already plotted like in the following link:
https://stackoverflow.com/a/50836425/15934571
Here is the code (I just copy and paste it from the link above):
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal,
order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal,
order=n)[0]]['data']
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
plt.show()
So, I do not have any idea how to continue from this point and find indexes corresponding to the obtained local maxima and minima on the plot. Would appreciate any help!
You use df['min'].notna(), it returns a dataframe in which the row of min is not nan. To find the index of local minima you can use the .loc method.
df.loc[df['min'].notna()].index
The output result for your example is:
Int64Index([0, 11, 21, 35, 43, 54, 67, 81, 105, 127, 141, 161, 168, 187], dtype='int64')
You can use the similar procedure to find local maximan.

Pandas line graph x-axis extra values

I have the following:
df.groupby(['A', 'B'])['time'].mean().unstack().plot()
This gives me a line graph like this one:
The circles in the plot indicate data points. The values on x-axis are the values of A which are discrete (100, 200 and 300). That is, I only have data points on the 100th, 200th and 300th point on the x-axis. Pandas/Matplotlib adds the intermediate values on the x-axis (125, 150, 175, 225, 250 and 275) that I don't want.
How can I plot and tell Pandas not to add the extra values on the x-axis?
Matplotlib x axis tick locators
You're looking for tick locators
import matplotlib.ticker as ticker
df = pd.DataFrame({'y':[1, 1.75, 3]}, index=[100, 200, 300])
ax = df.plot(legend=False)
ax.xaxis.set_major_locator(ticker.MultipleLocator(100))

How to define NetworkX graph from pandas dataframe having multiple columns

I have a pandas dataframe that captures the information whether a invoice has been raised as a dispute or not based on some characteristics. I would like to run a community detection on top of this to search for patterns. But confused on how to create a graph from this. Tried like the below :
import pandas as pd
import networkx as nx
from itertools import combinations as comb
data = [[4321, 543, 765, 3, 2014, 54, 0, 1, 0, 1, 0], [2321, 657, 654, 7, 2017, 59, 1, 0, 1, 0, 1]]
df = pd.DataFrame(data, columns = ['NetValueInDocCurr', 'NetWeight', 'Volume', 'BillingItems', 'FISCALYEAR', 'TaxAmtInDocCurr', 'Description_Bulk', 'Description_Car_Care', 'Description_Packed', 'Description_Services', 'Final_Dispute'])
edges = set(comb(df.columns,2))
G = nx.Graph()
G.add_edges_from(edges)
My current assumption is to define column name as node, pairwise relationship between all columns as edge and column value as edge weight. Is this the right approach? If yes, any help on the code to define weights? My idea is to start with a complete graph and use divisive methods like Girvan-Newman.

Pandas dataframe - multiplying DF's elementwise on same dates - something wrong?

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...
This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.