Bonjour,
I am trying to apply the χ2 test by contingency table from the exploratory statistics course to the results given in Example 1 from Wikipedia.
Rolling a die 600 times in a row gave the following results:
number rolled 1 2 3 4 5 6
numbers 88 109 107 94 105 97
The number of degrees of freedom is 6 - 1 = 5.
We wish to test the hypothesis that the die is not rigged, with a risk α = 0.05.
The null hypothesis here is therefore: "The die is balanced".
Considering this hypothesis to be true, the variable T defined above is : ( 88 - 100 ) 2 100 + ( 109 - 100 ) 2 100 + ( 107 - 100 ) 2 100 + ( 94 - 100 ) 2 100 + ( 105 - 100 ) 2 100 + ( 97 - 100 ) 2 100 = 3 , 44
The χ2 distribution with five degrees of freedom gives the value below which we consider the draw to be compliant with a risk α = 0.05: P(T < 11.07) = 0.95.
Since 3.44 < 11.07, we cannot reject the null hypothesis: this statistical data does not allow us to consider that the die is rigged.
I tried to retrieve this result with pandas:
dico = {' face ' : [1,2,3,4,5,6], ' numbers ' : [88, 109, 107, 94, 105, 97]} #[100, 100, 101, 99, 101, 99]}
tab = pd.DataFrame(dico)
print(tab.head(6))
ta = pd.crosstab(tab[' face '],tab[' numbers '])
print(ta)
test = chi2_contingency(tab)
test
face numbers 0 1 88 1 2 109 2 3 107 3 4 94 4 5 105 5 6 97 numbers 88 94 97 105 107 109 face 1 1 0 0 0 0 0 0 2 0 0 0 0 0 1 3 0 0 0 0 1 0 4 0 1 0 0 0 0 5 0 0 0 1 0 0 6 0 0 1 0 0 0
(4.86, 0.432, 5)
This is not the expected result. (with ta, it is the same)
then I present the problem as follows:
dico = {' error ' : [-12, 9, 7, -6, 5, -3], ' number ' : [88, 109, 107, 94, 105, 97]} #[100, 100, 101, 99, 101, 99]}
tab = pd.DataFrame(dico)
print(tab.head(6))
ta = pd.crosstab(tab[' error '],tab[' staff '])
test
error numbers 0 -12 88 1 9 109 2 7 107 3 -6 94 4 5 105 5 -3 97
(10.94, 0.052, 5)...same...I expect something like (3.44, p-value should be between 0.5 and 0.9, 5)
Something is wrong but What?
Regards,
Leloup
Related
I have the following DataFrame :
num_tra num_ts Year Value
0 0 0 1 100
1 0 0 2 90
2 0 0 3 80
3 0 1 1 90
4 0 1 2 81
5 0 1 3 72
6 1 0 1 81
7 1 0 2 73
8 1 0 3 65
9 1 1 1 73
10 1 1 2 66
11 1 1 3 58
12 2 0 1 142
13 2 0 2 160
14 2 0 3 144
15 2 1 1 128
16 2 1 2 144
17 2 1 3 130
Based on the Multiple Interactions Altair example, I tried to build a chart with two sliders based (in this example) on values of columns num_tra [0 to 2] and num_ts [0 to 1] but it doesn't work
import altair as alt
from vega_datasets import data
base = alt.Chart(df, width=500, height=300).mark_line(color="Red").encode(
x=alt.X('Year:Q'),
y='Value:Q',
tooltip="Value:Q"
)
# Slider filter
tra_slider = alt.binding_range(min=0, max=2, step=1)
ts_slider = alt.binding_range(min=0, max=1, step=1)
slider1 = alt.selection_single(bind=tra_slider, fields=['num_tra'], name="TRA")
slider2 = alt.selection_single(bind=ts_slider, fields=['num_ts'], name="TS")
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1,slider2
).properties(title="Sensi_TRA")
filter_TRA
=> TypeError: transform_filter() takes 2 positional arguments but 3 were given
No problem with one slider but as mentioned, I wasn't able to combine two or more sliders on the same chart.
If you have any idea, it would be very appreciated.
There are a couple ways to do this. If you want the filters to be applied sequentially, you can use two transform statements:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1
).transform_filter(
slider2
)
Alternatively, you can use a single transforms statement and use the & or | operators to filter on the intersection or union of the slider values, respectively:
filter_TRA = base.add_selection(
slider1,slider2
).transform_filter(
slider1 & slider2
)
I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106
I have a large SQL table and I want to add rows so all issue ages 40-75 are present and all the issue ages have a db_perk and accel_perk which is added via liner interpolation.
Here is a small portion of my data
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 45 1 0.2985 0.0033 0
111 F 50 1 0.472 0.0065 0
111 F 55 1 0.7075 0.01 0
111 F 60 1 1.0226 0.0238 0
111 F 65 1 1.5208 0.0551 0
111 F 70 1 2.3808 0.1296 0
111 F 75 1 4.0748 0.3242 0
I want my output to look something like this
class gender iss_age dur db_perk accel_perk ext_perk
111 F 40 1 0.1961 0.0025 0
111 F 41 1 0.21656 0.00266 0
111 F 42 1 0.23702 0.00282 0
111 F 43 1 0.25748 0.00298 0
111 F 44 1 0.27794 0.00314 0
111 F 45 1 0.2985 0.0033 0
I basically want to have all the columns, but iss_age, db_perk, and accel_perk be the same as the column above
Is there anyway to do this?
I have a gritty industrial control problem i'm trying to solve with T-SQL.
The goal is to calculate an index position for each of two pallet loading robots, positioned in one of two ranges; 2 to 78 (robot 1) and 4 to 80 (robot 2).
Each robot indexes in steps of 4 so complete coverage of 80 spots on the pallet is achieved. The robots work side by side with a minimum spacing of 2 spots while they move along the pallet.
Two sized boxes can be placed on the pallet, one twice as long as the other. If two small boxes are placed side by side taking up 1 spot each, a single larger box can be placed on top, taking up 2 spots until a maximum height is reached. Thus the spot number for a small box is always odd and for a large box is always even and the robot index number is always even. e.g. (see diagram) from index position 14 spots 13 and 15 are loaded, and from index 20 spots 19 and 21 can be loaded.
Robot Index Positions
I need a conversion formula that calculates the Index number for a given Spot and Robot.
The calculated Index column should look like the following:
Spot Robot Index
1 1 2
2 1 2
3 1 2
- - -
13 1 14
14 1 14
15 1 14
16 2 16
17 2 16
18 1 18
19 2 20
- - -
- - -
77 1 78
78 1 78
79 2 80
80 2 80
One way would be to do an update to the Index column with every possible combination of Spot and Robot using a simple CASE WHEN selection or maybe do lookups on a reference table holding every possible combination. What I would like to explore (if any math wizards are inclined!) is a math formula that calculate the Index value.
So far I've come up with the following by converting formula developed for use in Excel. The Robot 2 section is incomplete. The 95 to 99 values are for error checking.
UPDATE MovesTable SET [Index] =
CASE
WHEN Robot = 1 THEN
CASE
WHEN Spot%4 = 0 THEN '99'
WHEN Spot = 1 or Spot = 2 or Spot = 3 THEN '02'
WHEN Spot = 5 or Spot = 6 or Spot = 7 THEN '06'
WHEN Spot = 9 or Spot = 10 or Spot = 11 THEN '10'
WHEN Spot%10 = 4 THEN CONCAT(Spot/10,'4')
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND (Spot%10 = 3 OR Spot%10 = 5)) THEN CONCAT(Spot/10,'4')
WHEN Spot%10 = 8 THEN CONCAT(Spot/10,'8')
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND (Spot%10 = 7 OR Spot%10 = 9)) THEN CONCAT(Spot/10,'8')
WHEN Spot%10 = 2 THEN CONCAT(Spot/10,'2')
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 0) AND (Spot%10 = 1 OR Spot%10 = 3)) THEN CONCAT(Spot/10,'2')
WHEN Spot%10 = 6 THEN CONCAT(Spot/10,'6')
WHEN Spot < 57 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND (Spot%10 = 5 OR Spot%10 = 7)) THEN CONCAT(Spot/10,'6')
WHEN Spot%10 = 0 THEN CONCAT(Spot/10,'')
WHEN Spot = 49 THEN '50'
WHEN Spot < 57 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND Spot%10 = 9) THEN '30'
WHEN Spot < 57 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND Spot%10 = 1) THEN CONCAT(Spot/10,'0')
WHEN Spot > 56 AND (((Spot/10)%2 = 1 AND (Spot%10)%2 = 1) AND (Spot%10 = 7 OR Spot%10 = 9)) THEN CONCAT(Spot/10,'8')
WHEN Spot > 56 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND (Spot%10 = 1 OR Spot%10 = 3)) THEN CONCAT(Spot/10,'2')
WHEN Spot > 56 AND (((Spot/10)%2 = 0 AND (Spot%10)%2 = 1) AND (Spot%10 = 5 OR Spot%10 = 7)) THEN CONCAT(Spot/10,'6')
ELSE '98'
END
ELSE
CASE
WHEN Robot = 2 THEN
CASE
WHEN (Spot%2 = 0 AND Spot%4 <> 0) OR (Spot = 1 OR Spot = 2) THEN '97'
WHEN Spot = 4 then '04'
WHEN Spot = 8 then '08'
WHEN Spot%4 = 0 THEN Spot
WHEN Spot = 2 OR Spot = 5 THEN '05'
WHEN Spot = 7 OR Spot = 9 THEN '08'
WHEN Spot = 19 THEN '20'
WHEN Spot = 39 THEN '40'
WHEN Spot = 59 THEN '60'
ELSE '96'
END
ELSE '95'
END
END
I tried to solve this mathematically, rather than by analyzing cases, etc. It matches all of your sample results:
declare #t table (Spot int, Robot int, [Index] int)
insert into #t(Spot,Robot,[Index]) values
(1 ,1 , 2 ),
(2 ,1 , 2 ),
(3 ,1 , 2 ),
(13 ,1 ,14 ),
(14 ,1 ,14 ),
(15 ,1 ,14 ),
(16 ,2 ,16 ),
(17 ,2 ,16 ),
(18 ,1 ,18 ),
(19 ,2 ,20 ),
(77 ,1 ,78 ),
(78 ,1 ,78 ),
(79 ,2 ,80 ),
(80 ,2 ,80 )
select *,
CONVERT(int,
ROUND((Spot +
CASE WHEN Robot = 1 THEN 2 ELSE 0 END
)/4.0,0)* 4 -
CASE WHEN Robot = 1 THEN 2 ELSE 0 END
) as Index2
from #t
The logic is "round to the nearest multiple of four" but we use a couple of expressions to offset Robot 1's results by 2.
Results:
Spot Robot Index Index2
----------- ----------- ----------- -----------
1 1 2 2
2 1 2 2
3 1 2 2
13 1 14 14
14 1 14 14
15 1 14 14
16 2 16 16
17 2 16 16
18 1 18 18
19 2 20 20
77 1 78 78
78 1 78 78
79 2 80 80
80 2 80 80
The following query performs very well at higher zoom levels (bounding box smaller than 2 degrees longitude by 0.5 degrees latitude, but degrades significantly as the bounding box gets larger. The table contains 7~8 million rows of text and location data stored as points in a geometry column.
I have tried different grid configurations, but the performance always degrades as the bounding box gets larger than something like #north=41.123029000000002, #east=-72.935406, #south=40.296503999999999, #west=-75.077740000000006
Any ideas? Thanks ~ Matt
I have included a table of the performance at different zoom levels.
declare #filter geometry
select #filter = GEOMETRY::STGeomFromText(
'LINESTRING(' + CONVERT(varchar,#west) + ' ' + CONVERT(varchar,#south) + ',' + CONVERT(varchar,#east) + ' ' + CONVERT(varchar,#north) + ')'
,4326
).STEnvelope();
select
x.Id
,x.[Timestamp]
,x.Location
,x.[Text]
from (
select top(1000)
t.Id
,t.[Timestamp]
,t.Location
,t.[Text]
from dbo.Table1 AS t with (nolock, index([SPIX_Table1_Location_HIGH]))
inner join containstable(
dbo.Table1
,[text]
, N'FORMSOF(INFLECTIONAL, "word1") | FORMSOF(INFLECTIONAL, "word2") | FORMSOF(INFLECTIONAL, "wordN")'
) as r on t.Id = r.[KEY]
where t.Location.Filter(#filter) = 1
AND t.[Hour] >= #HourId
order by r.[RANK] desc
) as x
order by x.[Timestamp] desc
option (maxdop 1)
Zoom Level 2 = #north=74.542308000000006,#east=94.21875,#south=-24.370607,#west=-180
Zoom Level 20 = #north=40.711250999999997,#east=-74.006050000000002,#south=40.710847000000001,#west=-74.007096000000004
This is the spatial index:
CREATE SPATIAL INDEX [SPIX_Table1_location_HIGH] ON [dbo].[Table1]
(
[location]
)USING GEOMETRY_GRID
WITH (
BOUNDING_BOX =(-180, -90, 180, 90), GRIDS =(LEVEL_1 = HIGH,LEVEL_2 = HIGH,LEVEL_3 = HIGH,LEVEL_4 = HIGH),
CELLS_PER_OBJECT = 16, FILLFACTOR = 70) ON [PRIMARY]
GO
Request Duration by Zoom Level (1000 random tests):
Level Duration0to5 Duration5to10 Duration10to15 Duration15to20 DurationGreaterThan20
2 0 0 0 0 26
3 0 0 0 0 42
4 0 0 0 0 57
5 0 0 0 0 60
6 0 0 0 0 54
7 0 0 0 1 65
8 0 2 5 6 34
9 0 3 7 10 6
10 5 23 25 14 1
11 13 26 18 3 0
12 17 31 7 0 0
13 48 11 0 0 0
14 48 6 0 0 0
15 47 1 0 0 0
16 57 0 0 0 0
17 48 8 0 0 0
18 44 3 0 0 0
19 63 5 0 0 0
20 47 3 0 0 0
ALL 437 122 62 34 345
Row counts and exec times for a typical query, w/out top(1000), forcing spatial index:
if (#level=2)
select #north=74.542308000000006,#east=94.21875,#south=-24.370607,#west=-180 --70,404 rows, 2 minutes w/ spatial
if (#level=3)
select #north=61.978465999999997,#east=-5.451886,#south=9.7052770000000006,#west=-142.56126 -- 57,911 rows, 1m22s w/ spatial
if (#level=4)
select #north=52.614061999999997,#east=-39.729228999999997,#south=26.230861999999998,#west=-108.283917 -- 45,636 rows, 1m23s w/ spatial
if (#level=5)
select #north=46.992624999999997,#east=-56.867901000000003,#south=33.775959999999998,#west=-91.145245000000003 -- 32,386 rows, 26s w/ spatial
if (#level=6)
select #north=43.934699999999999,#east=-65.437236999999996,#south=37.323439,#west=-82.575908999999996 -- 19,998 rows, 13s w/ spatial
if (#level=7)
select #north=42.343530999999999,#east=-69.721905000000007,#south=39.037540999999997,#west=-78.291240999999999 -- 11,256 rows, 13s w/ spatial
if (#level=8)
select #north=41.532438999999997,#east=-71.864238999999998,#south=39.879399999999997,#west=-76.148906999999994 -- 6,147 rows, 4s w/ spatial
if (#level=9)
select #north=41.123029000000002,#east=-72.935406,#south=40.296503999999999,#west=-75.077740000000006 -- 3,667 rows, 3s w/ spatial