1.Data Mining(2)
5.Data Preprocessing
- 5.1 Aggregation
- Combining two or more attributes (or objects) into a single attribute (or object)
- Purpose
    - Data reduction
        - Reduce the number of attributes or objects
 
- Change of scale
        - Cities aggregated into regions, states, countries, etc.
- Days aggregated into weeks, months, or years
 
- More “stable” data
        - Aggregated data tends to have less variability
 
 
- Data reduction
        
- Example: Precipitation in Australia
    - This example is based on precipitation in Australia from the period 1982 to 1993.
- The next slide shows
        - A histogram for the standard deviation of average monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, and
- A histogram for the standard deviation of the average yearly precipitation for the same locations.
 
- The average yearly precipitation has less variability than the average monthly precipitation.
- All precipitation measurements (and their standard deviations) are in centimeters.
 
 
- 5.2 Sampling
    - Sampling is the main technique employed for data reduction.
        - It is often used for both the preliminary investigation of the data and the final data analysis.
 
- 
        Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. 
- 
        Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming. 
- 
        The key principle for effective sampling is the following: - Using a sample will work almost as well as using the entire data set, if the sample is representative
- A sample is representative if it has approximately the same properties (of interest) as the original set of data
 
 
- Sampling is the main technique employed for data reduction.
        
 
- Types of Sampling
    - Simple Random Sampling
        - There is an equal probability of selecting any particular item
- Sampling without replacement
            - As each item is selected, it is removed from the population
 
- Sampling with replacement
            - Objects are not removed from the population as they are selected for the sample.
- In sampling with replacement, the same object can be picked up more than once
 
- Stratified sampling
            - Split the data into several partitions; then draw random samples from each partition
 
 
 
- Simple Random Sampling
        
- 
    5.3 Dimensionality Reduction 
- Purpose:
    - Avoid curse of dimensionality
- Reduce amount of time and memory required by data mining algorithms
- Allow data to be more easily visualized
- May help to eliminate irrelevant features or reduce noise
 
- Techniques
    - Principal Components Analysis (PCA)
- Singular Value Decomposition
- Others: supervised and non-linear techniques
 
- Curse of Dimensionality
    - 
        When dimensionality increases, data becomes increasingly sparse in the space that it occupies 
- 
        Definitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningful 
 
- 
        
- 
    Dimensionality Reduction: PCA - Goal is to find a projection that captures the largest amount of variation in data
 
- 
    5.4 Feature subset selection 
- Another way to reduce dimensionality of data
- Redundant features
    - Duplicate much or all of the information contained in one or more other attributes
- Example: purchase price of a product and the amount of sales tax paid
 
- Irrelevant features
    - Contain no information that is useful for the data mining task at hand
- Example: students’ ID is often irrelevant to the task of predicting students’ GPA
 
- 
    Many techniques developed, especially for classification 
- 
    5.6 Feature creation 
- Feature creation
    - 
        Create new attributes that can capture the important information in a data set much more efficiently than the original attributes 
- Three general methodologies:
        - Feature extraction
            - Example: extracting edges from images
 
 
- Feature extraction
            
- Feature construction - Example: dividing mass by volume to get density
- Mapping data to new space - Example: Fourier and wavelet analysis
 
- 
        
 
- 
    5.7 Discretization and Binarization 
- Discretization is the process of converting a continuous attribute into an ordinal attribute
    - A potentially infinite number of values are mapped into a small number of categories
- Discretization is commonly used in classification
- Many classification algorithms work best if both the independent and dependent variables have only a few values
- We give an illustration of the usefulness of discretization using the Iris data set
 
- Binarization
    - 
        Binarization maps a continuous or categorical attribute into one or more binary variables 
- 
        Typically used for association analysis 
- 
        Often convert a continuous attribute to a categorical attribute and then convert a categorical attribute to a set of binary attributes - Association analysis needs asymmetric binary attributes
- Examples: eye color and height measured as {low, medium, high}
 
 
- 
        
- 
    5.8 Attribute Transformation 
- An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
    - 
        Simple functions: xk, log(x), ex, x 
- Normalization
        - Refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, range
- Take out unwanted, common signal, e.g., seasonality
 
- In statistics, standardization refers to subtracting off the means and dividing by the standard deviation
 
- 
        
 
  
  
 
      
     
       
      
Leave a comment