Number of Complete or Full Binary Search Tree - binary-search-tree

We have keys 1,2,3,4,5,6,7 that are inserted in some order into an empty binary search tree, using basic insert algorithm for binary search tree. After inserting these keys, it results in a complete or full binary search tree. How many full or complete binary search trees can be constructed?

Once the shape of a binary tree is fixed, there is only one way the (unique) keys can be assigned to the nodes of that tree for it to be a valid BST. We can see this from the fact that the in-order traversal of a BST generates the keys in sorted order.
So, let's count the number of shapes for a tree with 7 nodes.
Complete binary tree: 1
4
/ \
2 6
/ \ / \
1 3 5 7
This is also a full binary tree.
Other full binary trees (all internal nodes have 2 children -- the maximum height is 4)
6 6 2 2
/ \ / \ / \ / \
4 7 2 7 1 6 1 4
/ \ / \ / \ / \
2 5 1 4 4 7 3 6
/ \ / \ / \ / \
1 3 3 5 3 5 5 7
If we reduce the required height by one, we get the complete binary tree that was already counted.
So the total is 5.

Related

Is it possible to calculate a feature matrix only for test data?

I have more than 100,000 rows of training data with timestamps and would like to calculate a feature matrix for new test data, of which there are only 10 rows. Some of the features in the test data will end up aggregating some of the training data. I need the implementation to be fast since this is one step in a real-time inference pipeline.
I can think of two ways this can be implemented:
Concatenating the train and test entity sets and running DFS and then only using the last 10 rows and throwing away the rest. This is very time consuming. Is there a way to calculate a subset of an entity set while using data from the entire entity set?
Using the steps outlined in Calculating Feature Matrix for New Data section on the Featuretools Deployment page. However, as demonstrated below, this doesn't seem to work.
Create all/train/test entity sets:
import featuretools as ft
data = ft.demo.load_mock_customer(n_customers=3, n_sessions=15)
df_sessions = data['sessions']
# Create all/train/test entity sets.
all_es = ft.EntitySet(id='sessions')
train_es = ft.EntitySet(id='sessions')
test_es = ft.EntitySet(id='sessions')
all_es = all_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions, # all sessions
index='session_id',
time_index='session_start',
)
train_es = train_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[:10], # first 10 sessions
index='session_id',
time_index='session_start',
)
test_es = test_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[10:], # last 5 sessions
index='session_id',
time_index='session_start',
)
# Normalise customer entities so we can group by customers.
all_es = all_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
train_es = train_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
test_es = test_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
Set cutoff_time since we are dealing with data with timestamps:
cutoff_time = (df_sessions
.filter(['session_id', 'session_start'])
.rename(columns={'session_id': 'instance_id',
'session_start': 'time'}))
Calculate feature matrix for all data:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time,
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
1
3
1
2
3
2
3
1
1
4
2
1
5
2
2
6
2
3
7
2
4
8
1
2
9
2
5
10
1
3
11
1
4
12
2
6
13
3
3
14
1
5
15
3
4
Calculate feature matrix for train data:
feature_matrix, features_defs = ft.dfs(entityset=train_es,
cutoff_time=cutoff_time.iloc[:10],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
1
3
1
2
3
2
3
1
1
4
2
1
5
2
2
6
2
3
7
2
4
8
1
2
9
2
5
10
1
3
Calculate feature matrix for test data (using method shown in "Feature Matrix for New Data" on the Featuretools Deployment page):
feature_matrix = ft.calculate_feature_matrix(features=features_defs,
entityset=test_es,
cutoff_time=cutoff_time.iloc[10:])
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
11
1
1
12
2
1
13
3
1
14
1
2
15
3
2
As you can see, the feature matrix generated from train_es matches the first 10 rows of the feature matrix generated from all_es. However, the feature matrix generated from test_es doesn't match the corresponding rows from the feature matrix generated from all_es.
You can control which instances you want to generate features for with the cutoff_time dataframe (or the instance_ids argument in DFS if the cutoff time is a single datetime). Featuretools will only generate features for instances whose IDs are in the cutoff time dataframe and will ignore all others:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time[10:],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
customer_id
customers.COUNT(sessions)
session_id
1
4
2
6
3
3
1
5
3
4
The method in "Feature Matrix for New Data" is useful when you want to calculate the same features but on entirely new data. All the same features will be created, but data isn't shared between the entitysets. That doesn't work in this case, since the goal is to use all the data but only generate features for certain instances.

Excel xlsx file modified into a dataframe is not recognized by an R package that uses dataframe

I uploaded an Excel xlsx file then created a dataframe by converting numeric variables into categories. When I run a R package that uses dataframe, the output shows the following error:
> library(DiallelAnalysisR)
> Griffing(Yield, Rep, Cross1, Cross2, GriffingData41, 4, 1)
Error in `$<-.data.frame`(`*tmp*`, "Trt", value = character(0)) :
replacement has 0 rows, data has 20
When I issue a str() function, it shows the modifications of the numeric columns into catergories as below.
> str(GriffingData41)
'data.frame': 20 obs. of 4 variables:
$ CROSS1: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 3 3 4 ...
$ CROSS2: Factor w/ 4 levels "2","3","4","5": 1 2 3 4 2 3 4 3 4 4 ...
$ REP : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 2 ...
$ YIELD : num 11.9 14.5 9 13.5 20.5 9.8 16.5 22.1 18.4 19.4 ...
Is this a problem in my dataframe creation?
I would appreciate it if I could be helped with this error. By the way, I am running this in R Studio.
Thank you.
Note: This is not really a solution to my problem but I managed to move forward by saving my Excel data in CSV format; changing the data type of the specific columns to character and importing to R Studio. From there, creating the dataframe and running the R package went smoothly. Still, I am curious why it did not work on the "xlsx" file.

Labeling rows in pandas using multiple boolean conditions without chaining

I'm trying to label data in the original dataframe, based on multiple boolean conditions. This is easy enough when labeling based on one or two conditions, but as I begin requiring multiple conditions the code becomes difficult to manage. The solution seems to break the code down into copies, but that causes chain errors. Here is one example of the issue...
This is a simplified version of what my data looks like:
df=pd.DataFrame(np.array([['ABC',1,3,3,4], ['std',0,0,2,4],['std',2,1,2,4],['std',4,4,2,4],['std',2,6,2,6]]), columns=['Note', 'Na','Mg','Si','S'])
df
Note Na Mg Si S
0 ABC 1 3 3 4
1 std 0 0 2 4
2 std 2 1 2 4
3 std 4 4 2 4
4 std 2 6 2 6
A standard (std) is located throughout the dataframe. I would like to create a label when the instrument fails. This occurs in the data when:
String condition met (Note = standard/std)
Na>0 & Mg>0
Doesn't fall outside of a calculated range for more than 2 elements.
For requirement 3 - Here is an example of a range:
maxMin=pd.DataFrame(np.array([['Max',3,3,3,7], ['Min',1,1,2,2]]), columns=['Note', 'Na','Mg','Si','S'])
maxMin
Note Na Mg Si S
0 Max 3 3 3 7
1 Min 1 1 2 2
Calculating out of bound standard:
elements=['Na','Mg','Si','S']
std=df[(df['Note'].str.contains('std|standard'))&(df['Na']>0)&(df['Mg'])
df.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>2]
Note Na Mg Si S
3 std 4 4 2 4
Now, I would like to label this datapoint within the original dataframe. Desired result:
Note Na Mg Si S Error
0 ABC 1 3 3 4 False
1 std 0 0 2 4 False
2 std 2 1 2 4 False
3 std 4 4 2 4 True
4 std 2 6 2 6 False
I've tried things like:
df['Error'].loc[std.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(mMmaxMinloc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values.copy()]=True
That unfortunately causes a chain error.
How would you accomplish this without creating a chain error? Most books/tutorial revolve around creating one long expression, but as I dive deeper, I feel there might be a simpler solution. Any input would be appreciated
I figured it out a solution that works for me.
The solution was to use .index.value to create an array of the index that passed the bool conditions. That array can be used to pass edit the original dataframe.
##These two conditions can probably be combined
condition1=df[(df['Note'].str.contains('std|standard'))&(df['Na']>.01)&(df['Mg']>.01)]
##where condition1 is greater/less than the bounds of the known value.
##provides array where condition is true
OutofBounds=condition1.loc[(condition1[elements].lt(maxMin.loc[1, :])|condition1[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values
OutofBounds
out:array([ 3], dtype=int64)
Now I can pass the array into the original dataframe:
df.loc[OutofBounds, 'Error']=True

5 numbers in an empty binary tree

You insert 5 numbers to an empty binary search tree. The numbers output from applying in-order scan algorithm to this tree are:
4 7 8 9 11
and the numbers output from applying level-order scan algorithm to the tree are
7 4 9 8 11
What is(are) leaf node value(s) in the tree after you delete node 4?
Now this will be a very rough sketch to do, but bear with me. So you gave me a field of numbers, and I'm just assuming 7 is the root. so it goes like this.
7
/ \
4 8
\
9
\
11
In this case, 4 and 11 would be the leaf nodes. Now after you delete 4 from it.
7
\
8
\
9
\
11
Leaving 11 to be the leaf since it has no children.

Understanding Lingo derived sets

I am completely new to LINGO and I found this example in LINGO.
MODEL:
! A 6 Warehouse 8 Vendor Transportation Problem;
SETS:
WAREHOUSES / WH1 WH2 WH3 WH4 WH5 WH6/: CAPACITY;
VENDORS / V1 V2 V3 V4 V5 V6 V7 V8/ : DEMAND;
LINKS( WAREHOUSES, VENDORS): COST, VOLUME;
ENDSETS
! The objective;
MIN = #SUM( LINKS( I, J):
COST( I, J) * VOLUME( I, J));
! The demand constraints;
#FOR( VENDORS( J):
#SUM( WAREHOUSES( I): VOLUME( I, J)) =
DEMAND( J));
! The capacity constraints;
#FOR( WAREHOUSES( I):
#SUM( VENDORS( J): VOLUME( I, J)) <=
CAPACITY( I));
! Here is the data;
DATA:
CAPACITY = 60 55 51 43 41 52;
DEMAND = 35 37 22 32 41 32 43 38;
COST = 6 2 6 7 4 2 5 9
4 9 5 3 8 5 8 2
5 2 1 9 7 4 3 3
7 6 7 3 9 2 7 1
2 3 9 5 7 2 6 5
5 5 2 2 8 1 4 3;
ENDDATA
END
I have several things that I don't understand in this code.
In the derived set LINKS( WAREHOUSES, VENDORS): COST, VOLUME; how does it know that LINKS members should be as V1WH1,V1WH2,..,V1WH6,V2WH1,V2WH2,...,V6WH6,...,V8WH1,...,V8WH6.
That is how does it know that each vendor is connected to all the warehouses when it is specified by LINKS( WAREHOUSES, VENDORS): COST, VOLUME;
Is Volume data given in this?How has it obtained it?
I used to work a bit with Lingo long time ago. Things have changed since, but I have looked up their user manual (Lingo 14) - see page 31, it explains how the SETS definition works.
1) All the set members of the cartesian product WAREHOUSES x PRODUCTS are generated automatically (by concatenating the labels, considering all 'combinations').
Now, if certain pair warehouse-vendor should not be connected, its COST parameter should be left undefined. Look for 'Omitting Values in a Data Section' in the user manual, page 118. You need to use commas as separators in the COST matrix and use an empty field (e.g. 5, 5, , 6...).
2) VOLUME is a variable, not a parameter. The values of VOLUME are to be found by the solver - they will represent the optimal shipment volumes (where every vendor will get what he/she demands, and the total cost of the shipping will be minimal).