How to calculate Type I error and Type II error by varying the sample sizes in (5, 10, 15, ..., 195, 200)? And plot these on a graph? - numpy

I have calculated the Type 1 and Type II error for the following data:
np.random.seed(1005)
mean = 5 # Population mean
std = 4 # Population std
n = 50 # Sample size
samples = np.random.normal(loc=mean, scale=std, size=n) # Generate the data
print(samples)
Type I error:
X = (sample_mean - 6) / (np.std(samples)/np.sqrt(n))
Type II error:
CI_lower = sample_mean - 1.96*(np.std(samples)/np.sqrt(n))
CI_upper = sample_mean + 1.96*(np.std(samples)/np.sqrt(n))
How would I use these to calculate Type I error and Type II error by varying the sample sizes in {5, 10, 15, ..., 195, 200}? I've tried increasing the sample size in a range like this but I'm not sure if this is the correct way to go:
TT1 = []
for i in range(5,201,5):
6*norm.cdf(-np.abs(X))
p1 = 6*norm.cdf(-np.abs(X))
q1 = 6 - 1.96*(np.std(samples)/np.sqrt(range(5,201,5)))
q2 = 6 + 1.96*(np.std(samples)/np.sqrt(range(5,201,5)))
TT2 = norm.cdf(q2,loc=5.8,scale=np.std(samples)/np.sqrt(range(5,201,5))) -norm.cdf(q1,loc=5.8,scale=np.std(samples)/np.sqrt(range(5,201,5)))
The data is computing but I'm not sure if this is the correct way to apply the intervals, or whether I need to update my values in the samples variable.

Related

How can I delete sub-sub-list elements based on condition?

I am having the following two (2) lists:
lines = [[[0, 98]], [[64], [1,65,69]]]
stations = [[0,1], [0,3,1]]
The lines describes the line combinations for getting from 0 to 1 and stations describes the stops visited by choosing the line combinations. For getting from 0 to 1 the following are the possible combinations:
Take line 0 or 98 (direct connection)
Take line 64 and then line 1 (1 transfer at station 3)
Take line 64 and then line 65 (1 transfer at station 3)
Take line 64 and then line 69 (1 transfer at station 3)
The len of stations always equals the len of lines. I have used the following code to explode the lists in the way I described previously and store the options in a dataframe df.
result_direct = []
df = pd.DataFrame(columns=["lines", "stations", 'transfers'])
result_transfer = []
for index,(line,station) in enumerate(zip(lines,stations)):
#print(index,line,station)
if len(line) == 1: #if the line store direct connections
result_direct = [[i] for i in line[0]] #stores the direct connections in a list
for sublist in result_direct:
df = df.append({'lines': sublist,'stations': station, 'transfers': 0},ignore_index=True)
else:
result_transfer = [[[x] for x in tup] for tup in itertools.product(*line)]
result_transfer = [[item[0] for item in sublist] for sublist in result_transfer]
for sublist in result_transfer:
df = df.append({'lines': sublist,'stations': station, 'transfers': len(sublist)-1},ignore_index=True)
For the sake of the example I add the following 2 columns score1, score2:
df['score1'] = [5,5,5,2,2]
df['score2'] = [2,6,4,3,3]
I want to update the values of lines and stations based on a condition. When score2 > score1 this line/station combinations should be removed from the lists.
In this example the direct line 98 as well as the combination of lines 64,65 and 64,69 should be removed from the lines. Therefore, the expected output is the following:
lines = [[[0]], [[64], [1]]]
stations = [[0,1], [0,3,1]]
At this point, I should note that stations is not affected since there is at least one remaining combination of lines in the sublists. If also line 0 should have been removed the expected output should be:
lines = [[[64], [1]]]
stations = [[0,3,1]]
For starting I have tried a manual solution that works for a single line removal (e.g for removing line 98):
lines = [[y for y in x if y != 98] if isinstance(x, list) else x for x in [y for x in lines for y in x]]
I am having the difficulty for line combinations (e.g 64,65 and 64,69). Do you have any suggestions?

pair rows if conditions of multiple columns met

Store
Sales Amount
Profit
27
75474
9253
30
367852
84463
55
79416
15401
The resulting output should contain pairs of rows which has sales amount +- 3% OR Profit +- 1.5 % of each other
like if store 55's Sales amount fall within the range of +- 3% of sales amount OR +- 1.5 % of profit of any store(lets say 27) the output should be :
| Output|
27-55
df['Lower_range_sales'] = df['Sales_amount'] - df['Sales_amount']*0.03
df['Upper_range_sales'] = df['Sales_amount'] + df['Sales_amount']*0.03
df['Lower_range_Profit'] = df['Profit'] - df['Profit']*0.015
df['Upper_range_Profit'] = df['Profit'] + df['Profit']*0.015
Let's call stores that meet your condition "within range" of each other. There are no such store in your sample data set so we need to mock it with some randomized data:
# Generate sample data
import string
from itertools import permutations
store_names = ["".join(p) for p in permutations(list(string.ascii_uppercase), 2)]
n = 100
np.random.seed(42)
df = pd.DataFrame({
"Store": np.random.choice(store_names, n, replace=False),
"Sales Amount": np.random.randint(10_000, 1_000_000, n),
"Profit": np.random.randint(0, 100_000, n)
})
# ---------------------------------------------------------
# Code
# ---------------------------------------------------------
# Use numpy broadcast to calculate the lower and upper
# limit of each metric
sales = df["Sales Amount"].to_numpy()[:, None]
sales_lower, sales_upper = (sales * [0.97, 1.03]).T
profit = df["Profit"].to_numpy()[:, None]
profit_lower, profit_upper = (profit * [0.985, 1.015]).T
# mask is an n*n matrix, comparing every store against every
# other store to see if they are within range. If mask[i,j]
# is True, the two stores are within range of each other.
mask = (
((sales_lower <= sales) & (sales <= sales_upper))
| ((profit_lower <= profit) & (profit <= profit_upper))
)
# If mask[i,j] is True, then mask[j,i] is also True. Hence
# we only need the upper triangle of the matrix (np.triu =
# triangle upper). And since mask[i,i] is always True, we
# don't need the diagonal either. Hence k=1.
# nonzero() returns the indices of all True elements.
s1, s2 = np.triu(mask, k=1).nonzero()
# Assemble the result
store = df["Store"].to_numpy()
result = pd.DataFrame({
"Store1": store[s1],
"Store2": store[s2]
})
Result:
Store1 Store2
IV AL # store IV is within range of store AL
IV CG # store IV is within range of store CG
IV ZU
IV DQ
IV RP
... ...

Efficiently plot coordinate set to numpy (bitmap) array excluding offscreen coordinates

This question follows from Efficiently plot set of {coordinate+value}s to (numpy array) bitmap
A solution for plotting from x, y, color lists to a bitmap is given:
bitmap = np.zeros((10, 10, 3))
s_x = (0,1,2) ## tuple
s_y = (1,2,3) ## tuple
pixal_val = np.array([[0,0,1],[1,0,0],[0,1,0]]) ## np
bitmap[s_x, s_y] = pixal_val
plt.imshow(bitmap)
But how to handle the case where some (x,y) pairs lie outside the bitmap?
Efficiency is paramount.
If I could map offscreen coords to the first row/col of the bitmap (-42, 7) -> (0, 7), (15, -6) -> (15, 0), I could simply black out the first row&col with bitmap[:,0,:] = 0; bitmap[0,:,:] = 0.
Is this doable?
Is there a smarter way?
Are you expecting offscreen coords? if so don't worry otherwise I was just wondering if it was using a non-traditional coordinate system - where the zero may be in the center of the image for whatever reason
Anyway, after my revelation that you can use numpy arrays to store the coordinates, mapping outliers to the first row/col is pretty straightforward, simply using: s_x[s_x < 0] = 0, however, i believe the most efficient way to use logic to find the index of the pixels you want to use so only they are allocated - see below:
bitmap = np.zeros((15, 16, 3))
## generate data
s_x = np.array([a for a in range(-3,22)], dtype=int)
s_y = np.array([a for a in range(-4,21)], dtype=int)
np.random.shuffle(s_x)
np.random.shuffle(s_y)
print(s_x)
print(s_y)
pixel_val = np.random.rand(25,3)
## generate is done
use = np.logical_and(np.logical_and(s_x >= 0, s_x < bitmap.shape[1]), np.logical_and(s_y >= 0, s_y < bitmap.shape[0]))
bitmap[s_y[use], s_x[use]] = pixel_val[use]
plt.imshow(bitmap)
output:
coordinates:
[ 8 3 21 9 -2 -3 5 14 -1 18 13 16 0 11 7 1 2 12 15 6 19 10 4 17 20]
[ 8 14 1 9 2 4 7 15 3 -3 19 16 6 -1 0 17 5 13 -2 20 -4 11 10 12 18]
image:
I ran a test where it had to allocate 3145728 (four times the size of the bitmap you gave in your other question), around half of which were outside the image and on average it took around 140ms, whereas remapping the outliers and then setting the first row/col to zero took 200ms for the same task

A type error: 'builtin_function_or_method' object does not support item assignment. How can I fix that?

I'm running the following code and got a type error:
TypeError: 'builtin_function_or_method' object does not support item assignment
Here's my code:
N_object4 = 4
alpha = np.random.normal(0, 10, N_object4)# Random alpha values (could be greater or less than 0.)
pa = np.abs(alpha)
num = pa.argsort()[-3:][::-1]
gs = np.zeros(N_object4).tolist
for i in range (len(num)): # Iterating from largest abs(alpha) to the smallest.
if alpha[num[i]] > 0:
gs[num[i]+1] = 1
The error happens in my last line. How can I fix this error? Thanks!!
I think its small typo in line 4. You should use tolist():
gs = np.zeros(N_object4).tolist()

Using "rollmedian" function as a input for "arima" function

My time-series data includes date-time and temperature columns as follows:
rn25_29_o:
ambtemp dt
1 -1.96 2007-09-28 23:55:00
2 -2.02 2007-09-28 23:57:00
3 -1.92 2007-09-28 23:59:00
4 -1.64 2007-09-29 00:01:00
5 -1.76 2007-09-29 00:03:00
6 -1.83 2007-09-29 00:05:00
I am using median smoothing function to enhance small fluctuations that are caused because of imprecise measurements.
unique_timeStamp <- make.time.unique(rn25_29_o$dt)
temp.zoo<-zoo(rn25_29_o$ambtemp,unique_timeStamp)
m.av<-rollmedian(temp.zoo, n,fill = list(NA, NULL, NA))
subsequently, the output of the median smoothing is used for building temporal model and achieving predictions by using the following code:
te = (x.fit = arima(m.av, order = c(1, 0, 0)))
# fit the model and print the results
x.fore = predict(te, n.ahead=50)
Finally, I encounter with the following error:
Error in seq.default(head(tt, 1), tail(tt, 1), deltat) : 'by'
argument is much too small
FYI: The modeling and prediction function works properly by using original time-series data.
Please, guide me through this error.
The problem occurred because of the properties of the zoo package.
Thus, the code can be amended to :
Median_ambtemp <- rollmedian(ambtemp,n,fill = list(NA, NULL, NA)) te = (x.fit = arima(Median_ambtemp, order = c(1, 0, 0)))
# fit the model and print the results
x.fore = predict(te, n.ahead=5)