posthoc tests from function "phytools::phylANOVA" generating tables full of NA values - phylogeny

I have a factor with 9 categories (Biome), a numeric variable(visual) and a pruned phylogenetic tree (pruned.tree) with all species involved in these two variables. I have the next line:
phyl_phytools<-phylANOVA(pruned.tree,Biome,visual, posthoc=TRUE, p.adj=c("holm"))
But, I obtain the next output:
ANOVA table: Phylogenetic ANOVA
Response: y
Sum Sq Mean Sq F value Pr(>F)
x 107.1753 13.396912 2.829053 0.008
Residual 644.0247 4.735476
P-value based on simulation.
---------
Pairwise posthoc test using method = "holm"
Pairwise t-values:
1 10 13 14 2 3 4 7 9
1 NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA
13 NA NA NA NA NA NA NA NA NA
14 NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA
Pairwise corrected P-values:
1 10 13 14 2 3 4 7 9
1 NA NA NA NA NA NA NA NA NA
10 NA NA NA NA NA NA NA NA NA
13 NA NA NA NA NA NA NA NA NA
14 NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA NA
7 NA NA NA NA NA NA NA NA NA
9 NA NA NA NA NA NA NA NA NA
I would like to know what I'm doing wrong in posthoc tests to get these NA values instead of p-values.
Thanks in advance

Related

Shift row left by leading NaN's without removing all NaN's

Shift row left by leading NaN's without removing all NaN's
How can I remove leading NaN's in pandas when reading in a csv file?
Example code:
df = pd.DataFrame({
'c1': [ 20, 30, np.nan, np.nan, np.nan, 17, np.nan],
'c2': [np.nan, 74, 65, np.nan, np.nan, 74, 82],
'c3': [ 250, 290, 340, 325, 345, 315, 248],
'c4': [ 250, np.nan, 340, 325, 345, 315, 248],
'c5': [np.nan, np.nan, 340, np.nan, 345, np.nan, 248],
'c6': [np.nan, np.nan, np.nan, 325, 345, np.nan, np.nan]})
The code displays this
| | c1 | c2 | c3 | c4 | c5 | c6 |
|:-|-----:|-----:|----:|------:|------:|------:|
|0 | 20.0 | NaN | 250 | 250.0 | NaN | NaN |
|1 | 30.0 | 74.0 | 290 | NaN | NaN | NaN |
|2 | NaN | 65.0 | 340 | 340.0 | 340.0 | NaN |
|3 | NaN | NaN | 325 | 325.0 | NaN | 325.0 |
|4 | NaN | NaN | 345 | 345.0 | 345.0 | 345.0 |
|5 | 17.0 | 74.0 | 315 | 315.0 | NaN | NaN |
|6 | NaN | 82.0 | 248 | 248.0 | 248.0 | NaN |
I'd like to only reomve the leading NaN's so the result would look like this
| | c1 | c2 | c3 | c4 | c5 | c6 |
|:-|-----:|-----:|----:|------:|------:|------:|
|0 | 20 | NaN | 250.0 | 250.0 | NaN | NaN |
|1 | 30 | 74.0 | 290.0 | NaN | NaN | NaN |
|2 | 65 | 340.0 | 340.0 | 340.0 | NaN | NaN |
|3 | 325 | 325.0 | NaN | 325.0 | NaN | NaN |
|4 | 345 | 345.0 | 345.0 | 345.0 | NaN | NaN |
|5 | 17 | 74.0 | 315.0 | 315.0 | NaN | NaN |
|6 | 82 | 248.0 | 248.0 | 248.0 | NaN | NaN |
I have tried the following but that didn't work
response = pd.read_csv (r'MonthlyPermitReport.csv')
df = pd.DataFrame(response)
df.loc[df.first_valid_index():]
Help please.
You can try this:
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.shift(-s[x.name]), axis=1)
Output:
c1 c2 c3 c4 c5 c6
0 20.0 NaN 250.0 250.0 NaN NaN
1 30.0 74.0 290.0 NaN NaN NaN
2 65.0 340.0 340.0 340.0 NaN NaN
3 325.0 325.0 NaN 325.0 NaN NaN
4 345.0 345.0 345.0 345.0 NaN NaN
5 17.0 74.0 315.0 315.0 NaN NaN
6 82.0 248.0 248.0 248.0 NaN NaN
Details:
s, is a series that counts number of leading NaN in a row. isna finds all the NaN the dataframe, then using cumprod along the row axis we are eliminating NaN after a non-NaN value by multiplying by zero. Lastly, we use sum along row to calculate the number of place to shift each row.
Using dataframe apply with axis=1 (rowwise) the name of the pd.Series called in df.apply(axis=1) is the row index of the dataframe. Therefore we can fetch the number of periods to shift using, s defined above.
Let us try apply create the list then recreate the dataframe
out = pd.DataFrame(df.apply(lambda x : [x[x.notna().cumsum()>0].tolist()],1).str[0].tolist(),
index=df.index,
columns=df.columns)
Out[102]:
c1 c2 c3 c4 c5 c6
0 20.0 NaN 250.0 250.0 NaN NaN
1 30.0 74.0 290.0 NaN NaN NaN
2 65.0 340.0 340.0 340.0 NaN NaN
3 325.0 325.0 NaN 325.0 NaN NaN
4 345.0 345.0 345.0 345.0 NaN NaN
5 17.0 74.0 315.0 315.0 NaN NaN
6 82.0 248.0 248.0 248.0 NaN NaN

Pandas df print output

How to sort this dataframe by 'Sex' column alternating F, M?
Name Sex Age Height Weight
0 Alfred M 14 69.0 112.5
1 Alice F 13 56.5 84.0
2 Barbara F 13 65.3 98.0
3 Carol F 14 62.8 102.5
4 Henry M 14 63.5 102.5
5 James M 12 57.3 83.0
6 Jane F 12 59.8 84.5
7 Janet F 15 62.5 112.5
8 Jeffrey M 13 62.5 84.0
9 John M 12 59.0 99.5
10 Joyce F 11 51.3 50.5
11 Judy F 14 64.3 90.0
12 Louise F 12 56.3 77.0
13 Mary F 15 66.5 112.0
14 Philip M 16 72.0 150.0
15 Robert M 12 64.8 128.0
16 Ronald M 15 67.0 133.0
17 Thomas M 11 57.5 85.0
18 William M 15 66.5 112.0
Goal: Print sex column F and M alternatively.
IIUC, try this:
(df.assign(sortkey=df.groupby('Sex').cumcount())
.sort_values(['sortkey','Sex'])
.drop('sortkey', axis=1))
Output:
Name Sex Age Height Weight
1 Alice F 13 56.5 84.0
0 Alfred M 14 69.0 112.5
2 Barbara F 13 65.3 98.0
4 Henry M 14 63.5 102.5
3 Carol F 14 62.8 102.5
5 James M 12 57.3 83.0
6 Jane F 12 59.8 84.5
8 Jeffrey M 13 62.5 84.0
7 Janet F 15 62.5 112.5
9 John M 12 59.0 99.5
10 Joyce F 11 51.3 50.5
14 Philip M 16 72.0 150.0
11 Judy F 14 64.3 90.0
15 Robert M 12 64.8 128.0
12 Louise F 12 56.3 77.0
16 Ronald M 15 67.0 133.0
13 Mary F 15 66.5 112.0
17 Thomas M 11 57.5 85.0
18 William M 15 66.5 112.0

Pandas ffill based on condition

I have a data frame of empty daily prices. I have then written a function which give week commencing monday dates.
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA NA
5/1/12 1/1/12 NA NA
6/1/13 1/1/12 34 NA
7/1/12 1/1/12 33 NA
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
I want to say if the price only has NAs left until the end of the column, then fill with the last value only to the end of the incomplete week
So the expected output is:
Day WC Monday Price 1 Price 2
1/1/12 1/1/12 44 34
2/1/13 1/1/12 55 34
3/1/12 1/1/12 44 34
4/1/13 1/1/12 NA 34
5/1/12 1/1/12 NA 34
6/1/13 1/1/12 34 34
7/1/12 1/1/12 33 34
8/1/13 8/1/12 12 NA
9/1/12 8/1/12 34 NA
10/1/13 8/1/12 23 NA
Idea is test, if last row per group is missing values by GroupBy.transform with GroupBy.last and then replace missing values with DataFrame.mask and GroupBy.ffill:
c = ['Price 1','Price 2']
m = df.isna().groupby('WC Monday')[c].transform('last')
df[c] = df[c].mask(m, df.groupby('WC Monday')[c].ffill())
print (df)
Day WC Monday Price 1 Price 2
0 1/1/12 1/1/12 44.0 34.0
1 2/1/13 1/1/12 55.0 34.0
2 3/1/12 1/1/12 44.0 34.0
3 4/1/13 1/1/12 NaN 34.0
4 5/1/12 1/1/12 NaN 34.0
5 6/1/13 1/1/12 34.0 34.0
6 7/1/12 1/1/12 33.0 34.0
7 8/1/13 8/1/12 12.0 NaN
8 9/1/12 8/1/12 34.0 NaN
9 10/1/13 8/1/12 23.0 NaN

how to encode only categorical data in a dataframe

enter image description here
how to encode only categorical data in a data frame
Income Length of Residence Median House Value Number of Vehicles Percentage Asian Percentage Black Percentage English Speaking Percentage Hispanic Percentage White MakeDescr SeriesDescr Msrp
1 90000 15.0 F 4 1 1 71 6 81 HYUNDAI Sonata-4 Cyl. 19395.0
2 125000 7.0 H 1 11 1 91 1 81 JEEP Grand Cherokee-V6 29135.0
3 90000 8.0 F 1 1 1 71 6 86 JEEP Liberty 20700.0
4 125000 8.0 F 3 1 1 86 6 86 VOLKSWAGEN Passat-V6 28750.0
5 90000 8.0 F 1 1 1 71 6 81 JEEP Wrangler 20210.0
6 110000 7.0 G 5 6 6 71 6 76 HYUNDAI Santa Fe-V6 25645.0
7 110000 7.0 G 3 11 6 71 6 71 HYUNDAI Sonata-4 Cyl. 15999.0
8 125000 8.0 G 1 1 11 81 6 76 HYUNDAI Santa Fe-V6 23645.0
9 125000 9.0 G 1 6 1 91 1 86 CHEVROLET TRUCK Trailblazer EXT 32040.0
10 110000 8.0 E 2 6 46 81 16 26 JEEP Wrangler-V6 18660.0
11 125000 11.0 G 3 6 1 76 1 86 CHEVROLET TRUCK Silverado 2500 HD 31775.0
12 125000 12.0 G 2 11 6 66 1 71 CHEVROLET Cobalt 13675.0
13 125000 13.0 G 2 1 16 95 6 71 HYUNDAI Veracruz-V6 28600.0
15 110000 11.0 F 5 6 41 61 11 41 HYUNDAI Santa Fe 22499.0
16 125000 9.0 F 2 1 6 91 1 81 HYUNDAI Santa Fe 22499.0
17 125000 8.0 G 2 11 11 66 1 66 MITSUBISHI Endeavor-V6 32602.0
18 110000 12.0 E 1 6 46 81 16 26 HYUNDAI Accent-4 Cyl. 10899.0
19 90000 9.0 F 4 1 6 71 6 81 JEEP Grand Cherokee-6 Cyl. 29080.0
21 125000 8.0 G 1 6 1 76 1 86 MITSUBISHI Endeavor-V6 29302.0
22 110000 12.0 F 2 6 26 66 11 51 HYUNDAI Santa Fe 22499.0
23 90000 9.0 F 1 6 6 66 6 76 HYUNDAI Santa Fe-V6 20995.0
24 125000 9.0 H 1 6 1 91 1 81 HYUNDAI Sonata-V6 18799.0
25 90000 14.0 F 2 1 6 71 11 81 HYUNDAI Elantra-4 Cyl. 13299.0
26 125000 9.0 G 3 1 11 81 6 76 JEEP Grand Cherokee-6 Cyl. 29080.0
27 125000 8.0 H 5 6 1 91 1 81 CHEVROLET TRUCK Trailblazer 29395.0
28 110000 12.0 E 4 6 41 61 11 36 HYUNDAI Sonata-4 Cyl. 15999.0
29 110000 10.0 E 1 6 41 61 11 36 HYUNDAI Santa Fe-V6 20995.0
30 125000 10.0 F 2 6 1 71 6 86 CHEVROLET TRUCK Tahoe 37000.0
32 90000 10.0 F 1 1 1 71 6 86 MITSUBISHI Galant-V6 19997.0
33 125000 12.0 F 1 1 1 86 6 86 CHEVROLET TRUCK Trailblazer 28175.0
... ... ... ... ... ... ... ... ... ... ... ... ...
4451 110000 9.0 F 3 6 41 61 11 36 NISSAN Sentra-4 Cyl. 17990.0
4452 125000 11.0 G 2 1 11 81 6 76 CHEVROLET TRUCK Tahoe 39515.0
4453 125000 8.0 H 1 6 1 91 1 81 HYUNDAI Elantra-4 Cyl. 15195.0
4454 110000 10.0 F 3 6 41 61 11 41 HYUNDAI Genesis-4 Cyl. 26750.0
4455 125000 7.0 H 4 11 1 76 1 76 HYUNDAI Sonata-4 Cyl. 19695.0
4456 125000 9.0 G 5 6 1 76 1 86 NISSAN Altima 22500.0
4457 110000 11.0 E 1 6 46 81 16 26 GMC LIGHT DUTY Denali 51935.0
4458 125000 6.0 H 1 11 1 76 1 76 JEEP Liberty-V6 24865.0
4459 125000 12.0 G 3 1 16 95 6 71 HONDA Accord-V6 26700.0
4460 125000 7.0 F 1 1 1 86 6 86 HYUNDAI Veloster-4 Cyl. 17300.0
4461 90000 10.0 F 2 6 11 66 6 71 CADILLAC SRX-V6 42210.0
4463 110000 8.0 F 3 6 26 61 11 56 GMC LIGHT DUTY Acadia 42390.0
4468 125000 8.0 G 1 1 1 91 1 86 HONDA Pilot-V6 40820.0
4469 125000 10.0 H 5 11 1 91 1 81 TOYOTA Highlander-V6 30695.0
4470 110000 12.0 F 1 6 41 61 11 41 HYUNDAI Elantra-4 Cyl. 15195.0
4473 110000 13.0 F 1 6 21 66 6 61 ACURA TSX 32910.0
4476 125000 9.0 G 1 6 1 76 1 86 BMW X3 36750.0
4482 125000 10.0 H 1 6 1 91 1 81 SUBARU Forester-4 Cyl. 21195.0
4486 125000 11.0 H 2 6 1 91 1 81 GMC LIGHT DUTY Yukon XL 44315.0
4492 125000 10.0 H 2 6 1 91 1 81 BMW 5 Series 53400.0
4493 110000 12.0 G 2 6 6 71 6 76 ACURA TL 33725.0
4494 125000 12.0 F 3 1 1 86 6 86 ACURA TL 33725.0
4495 125000 12.0 F 3 1 1 86 6 86 ACURA TL 33725.0
4496 125000 7.0 G 5 1 11 81 6 76 ACURA TL 33325.0
4497 125000 9.0 G 1 6 1 76 1 86 ACURA TL 33725.0
4498 125000 12.0 G 3 1 11 81 6 76 ACURA TL 33725.0
4499 110000 14.0 G 8 11 6 71 6 71 ACURA TL 33725.0
4501 125000 9.0 G 3 11 6 66 1 71 FORD Taurus-V6 20050.0
4502 110000 2.0 G 4 11 6 71 6 71 DODGE Stratus-4 Cyl. 15910.0
4503 125000 8.0 F 1 1 1 86 6 86 DODGE Stratus-4 Cyl. 19145.0
# Using standard scikit-learn label encoder.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Encode all string columns. Assuming all categoricals are of type str.
for c in df.select_dtypes(['object']):
print "Encoding column " + c
df[c] = le.fit_transform(df[c])

sql filter condition, add date filter

I want to adjust the below SQL statement to select a date range.
Date range = this year
call.afdelingoorzaakanalyse.id IN (select distinct c.afdelingoorzaakanalyse
from call c
group by c.afdelingoorzaakanalyse
having count(*) > 250)
call = table with alert message records
afdelingoorzaakanalyse = field with department categories
Thanks!
Structure of call table:
'data.frame': 22227 obs. of 208 variables:
$ cal_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ cal_deleted : int 0 0 0 0 0 0 0 0 0 0 ...
$ cal_insertedby : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_insertedon : POSIXct, format: "2005-09-01 08:35:00" "2005-09-01 08:43:00" "2005-09-01 08:46:00" "2005-09-02 15:21:00" ...
$ cal_updatedby : int 1 1 1 1 1 1 1 1 1 1 ...
$ cal_updatedon : POSIXct, format: "2007-11-20 15:47:17" "2007-11-20 15:47:17" "2007-11-20 15:47:17" "2007-11-20 15:47:17" ...
$ cal_finishedby : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_finishedon : POSIXct, format: "2005-09-08 08:26:00" "2005-09-08 08:27:00" "2005-10-12 13:37:00" "2005-09-12 07:54:00" ...
$ cal_file : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_notes : logi NA NA NA NA NA NA ...
$ cal_workflow : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_status : int 78 78 78 78 78 78 78 78 78 78 ...
$ cal_supplier : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_order : int 16176 16179 16195 16191 16188 16188 16188 16188 16188 16189 ...
$ cal_type : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_problemarea : int 311 311 311 121 311 311 311 311 311 311 ...
$ cal_problemaereadetail : int NA NA NA 333 327 380 123 380 NA 385 ...
$ cal_problemclass : int NA NA NA 125 207 125 125 207 207 202 ...
$ cal_problemsubclass : int NA NA NA 198 NA 197 198 250 250 218 ...
$ cal_repairer : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_causer : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_hours : int NA NA NA NA 32 32 32 32 32 NA ...
$ cal_costshours : int NA NA NA NA 2080 2080 2080 2080 2080 NA ...
$ cal_costsmaterial : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_coststransport : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_totalcostsexp : int NA NA NA NA 2080 2080 2080 2080 2080 NA ...
$ cal_totalcostsreal : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_percentage : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_cause : int NA NA NA NA NA NA NA NA NA NA ...
$ cal_subcause : int NA NA NA NA 161 160 160 160 160 NA ...
$ cal_deadlineherstelactie : POSIXct, format: NA NA NA NA ...
$ cal_deadlineanalyseactie : POSIXct, format: NA NA NA NA ...
$ cal_reference : Factor w/ 1459 levels "-","---","-----",..: NA NA NA NA NA NA NA NA NA NA ...
$ cal_afdelingoorzaakanalyse : int NA NA NA 76 76 76 76 76 76 NA ...
$ cal_afdelinghersteller : int 78 78 78 NA NA NA NA NA NA NA ...
$ cal_koopbriefnr : Factor w/ 5714 levels "'-","-","--",..: NA NA NA NA NA NA NA NA NA NA ...
What I've used to make it work is the following:
call.insertedby.afdeling IN (283,282,281,280,279,101,76,75)
AND call.deleted=FALSE
AND YEAR(call.insertedon) = YEAR(getdate())
Though not the original code presented above, the "AND YEAR" part provides me the data set of the running year