Is there a better way of finding summary statistics in Python? - dataframe

The following is my code for finding the 5 point summary statistics. I keep getting this error:
list indices must be integers or slices, not str
It seems like the way i'm using the describe function that i created is wrong.
from statistics import stdev,median,mean
def describe(key):
a=[]
for i in scripts:
a.append(i[key])
a=scripts[key]
total = sum(script[key] for script in scripts)
avg = total/len(a)
avg=mean(a)
s = stdev(a)
q25 = min(a)+(max(a)-min(a))*25
med = min(a)+(max(a)-min(a))*50
med=median(a)
q75 = min(a)+(max(a)-min(a))*75
return (total, avg, s, q25, med, q75)`enter code here`
summary = [('items', describe('items')),
('quantity', describe('quantity')),
('nic', describe('nic')),
('act_cost', describe('act_cost'))]
I keep getting this error:
TypeError Traceback (most recent call last)
<ipython-input-8-ba78d5218ead> in <module>()
----> 1 summary = [('items', describe('items')),
2 ('quantity', describe('quantity')),
3 ('nic', describe('nic')),
4 ('act_cost', describe('act_cost'))]
<ipython-input-1-bcf37f98eb7d> in describe(key)
4 for i in scripts:
5 a.append(i[key])
----> 6 a=scripts[key]
7 total = sum(script[key] for script in scripts)
8 avg = total/len(a)
TypeError: list indices must be integers or slices, not str

It is hard to understand your problem, since we don't know how scripts looks like. It is a global variable which is not defined in your script. The error states that scripts is of type list, but it looks like you assume it is a dataframe in your code. So please check the type of scripts.
Also, did you know that there is an easy way to calculate a Five-number summary with numpy like this:
import numpy as np
minimum, q25, med, q75, maximum = np.percentile(a, [0, 25, 50, 75, 100], interpolation='midpoint')
For description, see:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html

As per your question, you are accessing list of dictionaries.
Directly accessing with its key value is not yielding the result here.
So you must do,
getValues = lambda key,inputData: [subVal[key] for subVal in inputData if key in subVal]
in this case,
getValues('key', scripts) will give the corresponding list, then its easy to compute the statistics of that list.

Related

Plotting a graph of the top 15 highest values

I am working on a dataset which shows the budget spent on movies. I want make a plot which contains the top 15 highest budget movies.
#sort the 'budget' column in decending order and store it in the new dataframe.
info = pd.DataFrame(dp['budget'].sort_values(ascending = False))
info['original_title'] = dp['original_title']
data = list(map(str,(info['original_title'])))
#extract the top 10 budget movies data from the list and dataframe.
x = list(data[:10])
y = list(info['budget'][:10])
This was the ouput i got
C:\Users\Phillip\AppData\Local\Temp\ipykernel_7692\1681814737.py:2: FutureWarning: The behavior of `series[i:j]` with an integer-dtype index is deprecated. In a future version, this will be treated as *label-based* indexing, consistent with e.g. `series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To get the future behavior, use `series.loc[i:j]`.
y = list(info['budget'][:5])
I'm new to the data analysis scene so i'm confused on how else to go about the problem
A simple example using a movie dataset I found online:
import pandas as pd
url = "https://raw.githubusercontent.com/erajabi/Python_examples/master/movie_sample_dataset.csv"
df = pd.read_csv(url)
# Bar plot of 15 highest budgets:
df.nlargest(n=15, columns="budget").plot.bar(x="movie_title", y="budget")
You can customize your plot in various ways by adding arguments to the .bar(...) call.

ValueError: setting an array element with a sequence Ask

This python code:
import numpy,math
import scipy.optimize as optimization
import matplotlib.pyplot as plt
# Create toy data for curve_fit.
zo = numpy.array([0.0,1.0,2.0,3.0,4.0,5.0])
mu = numpy.array([0.1,0.9,2.2,2.8,3.9,5.1])
sig = numpy.array([1.0,1.0,1.0,1.0,1.0,1.0])
# Define hubble function.
def Hubble(x,a,b):
return H0 * m.sqrt( a*(1+x)**2 + 1/2 * a * (1+b)**3 )
# Define
def Distancez(x,a,b):
return c * (1+x)* np.asarray(quad(lambda tmp:
1/Hubble(a,b,tmp),0,x))
def mag(x,a,b):
return 5*np.log10(Distancez(x,a,b)) + 25
#return a+b*x
# Compute chi-square manifold.
Steps = 101 # grid size
Chi2Manifold = numpy.zeros([Steps,Steps]) # allocate grid
amin = 0.2 # minimal value of a covered by grid
amax = 0.3 # maximal value of a covered by grid
bmin = 0.3 # minimal value of b covered by grid
bmax = 0.6 # maximal value of b covered by grid
for s1 in range(Steps):
for s2 in range(Steps):
# Current values of (a,b) at grid position (s1,s2).
a = amin + (amax - amin)*float(s1)/(Steps-1)
b = bmin + (bmax - bmin)*float(s2)/(Steps-1)
# Evaluate chi-squared.
chi2 = 0.0
for n in range(len(xdata)):
residual = (mu[n] - mag(zo[n], a, b))/sig[n]
chi2 = chi2 + residual*residual
Chi2Manifold[Steps-1-s2,s1] = chi2 # write result to grid.
Throws this error message:
ValueError Traceback (most recent call last)
<ipython-input-136-d0ef47a881a7> in <module>()
36 residual = (mu[n] - mag(zo[n], a, b))/sig[n]
37 chi2 = chi2 + residual*residual
---> 38 Chi2Manifold[Steps-1-s2,s1] = chi2 # write result to
grid.
ValueError: setting an array element with a sequence.
Note: If I define a simple mag function such as (a+b*x), I do not get any error message.
In fact all three functions Hubble, Distancez and Meg have to be functions of redshift z, which is an array.
Now do you think I need to redefine all these functions to have an output array? I mean first, create an array of redshift and then the output of the functions automatically become array?
I need the output of the Distancez() and mag() functions to be arrays. I managed to do it, simply by changing the upper limit of the integral in the Distancez function from x to x.any(). Now I have an array and this is what I want. However, now I see that the output value of the for example Distance(0.25, 0.5, 0.3) is different from when I just put x in the upper limit of the integral? Any help would be appreciated.
Thanks for your reply.
I need the output of the Distancez() and mag() functions to be arrays. I managed to do it, simply by changing the upper limit of the integral in the Distancez function from x to x.any(). Now I have an array and this is what I want. However, now I see that the output value of the for example Distance(0.25, 0.5, 0.3) is different from when I just put x in the upper limit of the integral? Any help would be appreciated.
The ValueError is saying that it cannot assign an element of the array Chi2Manifold with a value that is a sequence. chi2 is probably a numpy array because residual is a numpy array because, your mag() function returns a numpy array, all because your Distancez function returns an numpy array -- you are telling it to do this with that np.asarray().
If Distancez() returned a scalar floating point value you'd probably be set. Do you need to use np.asarray() in Distancez()? Is that actually a 1-element array, or perhaps you intend to reduce that somehow to a scalar. I don't know what your Hubble() function is supposed to do and I'm not an astronomer but in my experience distances are often scalars ;).
If chi2 is meant to be a sequence or numpy array, you probably want to set an appropriately-sized range of values in Chi2Manifold to chi2.

Pandas dataframe - multiplying DF's elementwise on same dates - something wrong?

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...
This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.

numpy change elements matching conditions

For two numpy array a, b
a=[1,2,3] b=[4,5,6]
I want to change x<2.5 data of a to b. So I tried
a[a<2.5]=b
hoping a to be a=[4,5,3].
but this makes error
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
a[a<2.5]=b
ValueError: NumPy boolean array indexing assignment cannot assign 3 input values to the 2 output values where the mask is true
what is the problem?
The issue you're seeing is a result of how masks work on numpy arrays.
When you write
a[a < 2.5]
you get back the elements of a which match the mask a < 2.5. In this case, that will be the first two elements only.
Attempting to do
a[a < 2.5] = b
is an error because b has three elements, but a[a < 2.5] has only two.
An easy way to achieve the result you're after in numpy is to use np.where.
The syntax of this is np.where(condition, valuesWhereTrue, valuesWhereFalse).
In your case, you could write
newArray = np.where(a < 2.5, b, a)
Alternatively, if you don't want the overhead of a new array, you could perform the replacement in-place (as you're trying to do in the question). To achieve this, you can write:
idxs = a < 2.5
a[idxs] = b[idxs]

Selecting from pandas dataframe (or numpy ndarray?) by criterion

I find myself coding this sort of pattern a lot:
tmp = <some operation>
result = tmp[<boolean expression>]
del tmp
...where <boolean expression> is to be understood as a boolean expression involving tmp. (For the time being, tmp is always a pandas dataframe, but I suppose that the same pattern would show up if I were working with numpy ndarrays--not sure.)
For example:
tmp = df.xs('A')['II'] - df.xs('B')['II']
result = tmp[tmp < 0]
del tmp
As one can guess from the del tmp at the end, the only reason for creating tmp at all is so that I can use a boolean expression involving it inside an indexing expression applied to it.
I would love to eliminate the need for this (otherwise useless) intermediate, but I don't know of any efficient1 way to do this. (Please, correct me if I'm wrong!)
As second best, I'd like to push off this pattern to some helper function. The problem is finding a decent way to pass the <boolean expression> to it. I can only think of indecent ones. E.g.:
def filterobj(obj, criterion):
return obj[eval(criterion % 'obj')]
This actually works2:
filterobj(df.xs('A')['II'] - df.xs('B')['II'], '%s < 0')
# Int
# 0 -1.650107
# 2 -0.718555
# 3 -1.725498
# 4 -0.306617
# Name: II
...but using eval always leaves me feeling all yukky 'n' stuff... Please let me know if there's some other way.
1E.g., any approach I can think of involving the filter built-in is probably ineffiencient, since it would apply the criterion (some lambda function) by iterating, "in Python", over the panda (or numpy) object...
2The definition of df used in the last expression above would be something like this:
import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = range(5)
ix = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
names=('Alpha', 'Int'))
c = ('I', 'II', 'III')
df = pd.DataFrame(np.random.randn(len(idx), len(c)), index=ix, columns=c)
Because of the way Python works, I think this one's going to be tough. I can only think of hacks which only get you part of the way there. Something like
def filterobj(obj, fn):
return obj[fn(obj)]
filterobj(df.xs('A')['II'] - df.xs('B')['II'], lambda x: x < 0)
should work, unless I've missed something. Using lambdas this way is one of the usual tricks for delaying evaluation.
Thinking out loud: one could make a this object which isn't evaluated but just sticks around as an expression, something like
>>> this
this
>>> this < 3
this < 3
>>> df[this < 3]
Traceback (most recent call last):
File "<ipython-input-34-d5f1e0baecf9>", line 1, in <module>
df[this < 3]
[...]
KeyError: u'no item named this < 3'
and then either special-case the treatment of this into pandas or still have a function like
def filterobj(obj, criterion):
return obj[eval(str(criterion.subs({"this": "obj"})))]
(with enough work we could lose the eval, this is simply proof of concept) after which something like
>>> tmp = df["I"] + df["II"]
>>> tmp[tmp < 0]
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
>>> filterobj(df["I"] + df["II"], this < 0)
Alpha Int
A 4 -0.464487
B 3 -1.352535
4 -1.678836
Dtype: float64
would work. I'm not sure any of this is worth the headache, though, Python simply isn't very conducive to this style.
This is as concise as I could get:
(df.xs('A')['II'] - df.xs('B')['II']).apply(lambda x: x if (x<0) else np.nan).dropna()
Int
0 -4.488312
1 -0.666710
2 -1.995535
Name: II