4.Model Overfitting

3 minute read

1.Model Overfitting

- Reason for Model Overfitting

    - Limited Training Size

    - High Model Complexity

        - Multiple Comparison Procedure

- Classification Errors

    - Training errors (apparent errors)
        - Errors committed on the training set

    - Test errors
        - Errors committed on the test set

    - Generalization errors
        - Expected error of a model over random selection of records from same distribution


- Effect of Multiple Comparison Procedure
    - Consider the task of predicting whether stock market will rise/fall in the next 10 trading days

    - Random guessing:
        - P(correct) = 0.5

    - Make 10 random guesses in a row:

- Approach:
    - Get 50 analysts
    - Each analyst makes 10 random guesses
    - Choose the analyst that makes the most number of correct predictions

- Probability that at least one analyst makes at least 8 correct predictions

- Many algorithms employ the following greedy strategy:
    - Initial model: M
    - Alternative model: M’ = M ∪ γ,   where γ is a component to be added to the model (e.g., a test condition of a decision tree)
    - Keep M’ if improvement, Δ(M,M’) > α

- Often times, γ is chosen from a set of alternative components, Γ = {γ1, γ2, …, γk}

- If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting


- Notes on Overfitting

    - Overfitting results in decision trees that are more complex than necessary

    - Training error does not provide a good estimate of how well the tree will perform on previously unseen records

    - Need ways for estimating generalization errors

2.Model Selection

Using Validation Set
- Performed during model building
- Purpose is to ensure that model is not overly complex (to avoid overfitting)
- Need to estimate generalization error
  - Using Validation Set
  - Incorporating Model Complexity
  - Estimating Statistical Bounds
- Using Validation Set
  - Divide training data into two parts:
    - Training set:
      - use for model building
    - Validation set:
      - use for estimating generalization error
      - Note: validation set is not the same as test set
    - Drawback:
      - Less data available for training
Incorporating Model Complexity
- Rationale: Occam’s Razor
  - Given two models of similar generalization errors, one should prefer the simpler model over the more complex model
  - A complex model has a greater chance of being fitted accidentally
  - Therefore, one should include model complexity when evaluating a model
  - Gen. Error(Model) = Train. Error(Model, Train. Data) + x Complexity(Model)
Estimating Statistical Bounding - Pessimistic Error Estimate of decision tree T with k leaf nodes
Model Selection for Decision Trees
- Pre-Pruning (Early Stopping Rule)
  - Stop the algorithm before it becomes a fully-grown tree
  - Typical stopping conditions for a node:
    - Stop if all instances belong to the same class
    - Stop if all the attribute values are the same
  - More restrictive conditions:
    - Stop if number of instances is less than some user-specified threshold
    - Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test)
    - Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
    - Stop if estimated generalization error falls below certain threshold
- Post-pruning
  - Grow decision tree to its entirety
  - Subtree replacement
    - Trim the nodes of the decision tree in a bottom-up fashion
    - If generalization error improves after trimming, replace sub-tree by a leaf node
    - Class label of leaf node is determined from majority class of instances in the sub-tree
  - Subtree raising
    - Replace subtree with most frequently used branch

3.Model Evaluation

Purpose:
- To estimate performance of classifier on previously unseen data (test set)
Holdout
- Reserve k% for training and (100-k)% for testing
- Random subsampling: repeated holdout
Cross validation
- Partition data into k disjoint subsets
- k-fold: train on k-1 partitions, test on the remaining one
- Leave-one-out: k=n
Variations on Cross-validation
- Repeated cross-validation
  - Perform cross-validation a number of times
  - Gives an estimate of the variance of the generalization error
- Stratified cross-validation
  - Guarantee the same percentage of class labels in training and test
  - Important when classes are imbalanced and the sample is small
- Use nested cross-validation approach for model selection and evaluation

Share on

Twitter Facebook LinkedIn

Yimsu Blog

Role

4.Model Overfitting

1.Model Overfitting

2.Model Selection

3.Model Evaluation

Share on

Leave a comment

You may also enjoy

9.Object Size estimation

1.선형대수학 총 정리

1.보안 총 정리 마인드 맵