CS1500
Algorithms and Data Structures for Engineering, FALL 2012
HW5 Decision Tree
PB1 (EXTRA CREDIT)
Write a non-recursive DFS Traversal function.
PB2 - DECISION TREE
A. In machine Learning, a dataset consists on records (or objects, or
datapoints, or instances) each organized with features (or attributes).
For example: if each datapoint is a house, the features might be the
square ft, number of bedrooms, year of construction, purchase price,
etc. A special attribute denoted "label" or "target" forms usually the
goal of the classification: knowing the other attributes, the task is
to predict the label.
Read data form here (coming from here)
into a matrix M such that M[i][j] is the j-th attribute value of
datapoint i. There are 178 records, each with 13 attributes - so your M
should be about 178 rows and 13 columns. The first attribute is the
label, with three possible label values : 1,2 or 3.
B. Implement a Decision Tree Structure (Machine Learning). The
tree nodes are like the one in class (parent, lchild, rchild pointers),
but the data consists in an array of 178 values either 0 or 1
indicating which datapoints from the set are "present" at the node.
Each node also contains a value, an indice of one of the 13 attributes,
and a real value threshold.
class treenode{
short data[178];
double value;
int splitfeature;
double splitthreshold;
treenode* parent;
treenode* lchild;
treenode* rchild;
}
At root, all datapoints are present so the data array at root has
all 178 values 1. When a node has children, we call that it
"split": each datapoint present at the parent node has to go either
left or right. (so if the root has two children nodes, some datapoints
will be present at leftchild but on not at the rightchild and
viceversa). A split is characterized by an attribute A (one of the 13)
and a threshold T: out of the datapoints present at the node, the ones
for which the attribute A has a value smaller than T go "present" at
the leftchild (and not present at rightchild); similarly the datapoints
that have attribute A value bigger than T go "present" at the right
child.
To get you started, for each split read an attribute A and a threshold
T from the keyboard, then perform the split at the current node.
(EXTRA CREDIT) C. The goal of the tree so to create splits such that the children
nodes are less ambiguous on the label attribute than the parent node.
That is, say one node has 30 datapoints, 10 labeled "1", 10 labeled "2"
and 10 labeled "3". A "good split" example (by a certain attribute A
and threshold T) is a split that puts on the leftchild the 10
datapoints labeled "1" and at the rightchild the 20 labeled "2" or "3".
If a node has datapoints of only one label, then no split is necessary.
Your task is to greedily choose the attribute A and threshold T at each
node such that the tree becomes less ambiguous as fast as possible (
each leaf node is dominated by a particular label)