澳洲 bilogy Assignment 代写
100%原创包过,高质量代写&免费提供Turnitin报告--24小时客服QQ&微信:273427
Data Mining Process
Project/Business Understanding:
Identify potential benefits, risks and
efforts of successful project.
澳洲 bilogy Assignment 代写
Data Understanding: Sufficient
relevant data
Visual assessment of basic
relationships and properties
Data quality (missing values)
Abnormal cases (outliers)
Data Preparation: Selection,
correction and modification of data
Modeling: Extract knowledge out of
data in the form of a model
Predictive – Explanatory
Evaluation
Deployment
Data Mining Cycle
Data Understanding
澳洲 bilogy Assignment 代写
Main Goal
Gain general insights about the data that will potentially be
helpful for the further steps in the data analysis process
Not driven exclusively by goals and methods of later steps
Approach data from neutral viewpoint
Never trust data before carrying out simple plausibility
checks
At the end of Data Understanding we know much better
whether the assumptions we made during the Project
Understanding phase concerning: representativeness,
informativeness, and data quality are justified
Visualisation: Overview of basic characteristics of data and
check plausibility
Simple statistics
Outliers, missing values, data quality
Data Visualisation
Bar chart: Frequency distribution for categorical attribute
Histogram: Frequency distribution for numerical attribute
澳洲 bilogy Assignment 代写
Divide values into bins and show a bar plot of the number of
objects in each bin
Height of each bar indicates the number of objects in bin
Shape of histogram depends on number of bins
Boxplots
Very compact method to visualise distribution of one
attribute
Many boxplots can fit in single plot: Useful for comparing
distributions
Scatterplots
Relationship between two attributes (linear/ non-linear)
Axes represent two considered attributes
Each instance in the dataset is represented by a point
Correlation between attributes
Outliers
With class label info: Separability of classes
Correlation Analysis
Scatterplots can give us an idea about correlations
between pairs of variables
Pearson’s correlation coefficient: Measure of linear
association between 2 numerical attributes. Always
between -1 and 1
Even if a functional dependency exists between two
attributes and the function is monotone, if it is non-linear
then Pearson’s correlation coefficient can be far away from
-1 and 1
Rank correlation coefficients overcome this by relying on
the ordering of the values of the attributes: Spearman’s rho
Outliers
Outlier
A value or a data object that is far away or very different from
most or all of the other data
Intuitive but imprecise definition
It might be worthwhile to exclude outliers from analysis
Different methods more robust to outliers than others
Categorical Attribute: value that occurs with very low
frequency
Numerical Attribute: Detection much more difficult.
Boxplot, Scatterplot
For multidimensional data much more complicated
approaches need to be used
Missing Values
Missing values: One of the most important problems in real
applications
Not one best way of handling missing values
Missing Completely at Random: No special
circumstances or special values of the variable in question
lead to higher or lower chances for values to be missing
Missing at Random: Probability of a missing value
depends on some other variable(s) Y but conditionally
on Y it is independent of the value of X
Nonignorable missing: Occurrence of missing values
directly depends on the true value of the attribute
Distinguish Between Types of Missing Values
Distinction between MCAR and MAR: In case of MAR
other attributes can be used to predict whether value is
missing
Turn considered attribute into binary variable: 1 if value
exists, 0 if it is missing
Build a classifier to predict binary variable using as inputs
other variables
Determine error rate
MCAR: Error rate is approx. equal to proportion of missing
values
MAR: Error rate is significantly lower (it could also be
non-ignorable missing)
In general not possible to distinguish non-ignorable
missing from the other two cases using only available data
Treating Missing Values
Explicit Value: Replace with new value for attribute
MISSING (nominal attributes)
If the fact that the value is missing carries information
about the value itself (non-ignorable missing) introduction
of new value can help because it can express an intention
not captured by other attributes
Better Approach Introduce new binary variable indicating
that value was missing in original dataset and then
substitute missing value
If neither other attributes or imputed value help but the fact
that the value was missing is important, binary variable
captures this
If no such missing value pattern is present the imputed
value can be used without introducing MISSING value
Relevance of Attribute
More realistic problem
Information available: X = H: P(G) high, X = L: P(G) low
General Decision Rule
Given the risk forecast of an applicant, X = {H,L}:
o(G|X = x) =P(G|X = x)
P(B|X = x)>‘
g
Relevance of Attribute
More realistic problem
TN + FP
Sensitivity: Minimise misclassification of Class 1 records
(also called Recall)
Specificity: Minimise misclassification of Class 0 records
ROC Curve
Critical points on ROC curve
(TPR,FPR)
(0,0): All records classified 0
(1,1): All records classified 1
(0,1): Ideal model
Random Classifier: Diagonal Line
Below diagonal line: Prediction is opposite of true class
Good classifier: As close as possible to upper left corner
Area Under ROC (AUC): Summarises ROC curve into a
single number
Cost-Sensitive Learning
Cost of Misclassification
C(i,j): Cost of misclassifying a pattern from class i to class j
Cost Matrix:
Predicted Class
C(i,j)
1
Increasing variable xjby 1:
Increases log(o(1|xi)) by βj
Increases o(1|xi) by factor of eβj
If xjis binary then xj= 1 increases o(1|xi) by eβj
Synopsis Logistic Regression
Linear predictor:
Accommodates quantitative and qualitative variables
(dummy)
Enables transformations and combinations (interactions)
while retaining interpretability. Logistic regressions extends
this idea to binomial data
Explanatory model:
Contribution of individual variables
Model comparison – Model selection
Confidence interval (not covered)
Linear relationship between attribute values and probability
of success
Non-linearities can be overcome using discretisation
Decision Trees
Decision Tree Approach
Ask series of questions about attributes to determine class
Build decision tree from top to bottom (from root to leaves)
Greedy selection of a test attribute
Compute an evaluation measure for all attributes
Select the attribute with the best evaluation
Greedy Strategy
Grows a decision tree by making a series of locally optimal
decisions about how to partition the data
Divide and conquer / recursive descent
Divide examples according to the values of the test attribute
Apply the procedure recursively to the subsets (Hunt’s
algorithm)
Characteristics of Decision Tree Induction
Non-parametric: No assumptions about the type of
probability distributions satisfied by the data
Finding optimal decision tree is computationally infeasible:
Greedy heuristic approaches
Decision tree induction algorithms construct trees quickly
even for very large train sets
Easy to interpret: Especially for small trees
Robust to presence of noise: Especially when methods to
avoid overfitting are employed
Redundant attributes do not adversely affect accuracy
If dataset contains many irrelevant attributes then some
could be accidentally chosen by tree-growing algorithm.
Feature selection
Characteristics of Decision Tree Induction
Data fragmentation: Number of records at leaf nodes can
become too small to make statistically significant decision
– Impose threshold on minimum number of records per
node
Subtree can be replicated many times within a decision
tree making the model more complex and harder to
interpret
Robust performance w.r.t. choice of impurity measure
Treatment of missing values
Small changes in train set can yield entirely different tree
Performance is robust
Performance adversely affected by too many interval
scaled variables (Discretisation)
Artificial Neural Networks (ANN)
ANNs inspired by attempts to model biological neural
systems
Brain consists of a large number of interconnected simple
processing units (neurons)
Learning in human brain takes place by changing the
strength of the synaptic connection between neurons
through repeated stimulation by the same impulse
Perceptron
Perceptron: Simple Model of a Neuron
Each input node is connected via a weighted link to the
summing junction
Weights emulate strength of synaptic connection between
neurons
Training adapts weights to reduce error
Can solve linearly separable problems
Artificial Neural Networks
Number of simple processing
units (nodes)
Organised in Layers
Output layer: Returns prediction
Input layer: Receives inputs
Hidden layers: Layers between
input and output layers
Topology: 5 × 3 × 1
Multilayer Perceptrons: Only
Feed-forward connections
More complicated decision boundaries can be
approximated using more nodes and more layers
Design Issues in ANNs
Systems that combine automatic feature extraction with
classification process
Increasing the number of hidden nodes and the number of
hidden layers ANNs can become very flexible classifiers
Flexibility can easily result to Overfitting
Selecting appropriate topology is Critical
No general rule for how to choose the number of hidden
layers and the size of the hidden layers
Small neural networks might not be flexible enough to fit the
data. Large neural networks tend to overfitting
Cannot handle missing values
Black box models: Explaining what an ANN has learned is
not straightforward
Very sensitive to chosen feature vector: Variable selection
and preprocessing necessary
Ensemble Methods
Central Idea
Improve accuracy by combining predictions of multiple
classifiers
Conditions for performance improvement
1 Base classifiers (close to) independent
2 Base classifiers better than random guessing
Constructing Ensemble Classifiers: Bagging
Bagging – Bootstrap Aggregating
Create many training sets
through Bootstrapping
(resampling with replacement)
Build classifier for each train set
Use majority vote to predict
Reduces variance of base classifiers
Unstable classifier: Sensitive to minor perturbations in
train data
Bagging reduces generalisation error of unstable classifiers
(Decision trees, Neural networks, k–nearest neighbours)
Can be detrimental for stable/ robust classifiers because
the size of the train set is reduced
Does not focus on particular instances of training data
Boosting
Example: Weights determine sampling distribution
Initially all weights are equal 1/N
At each round i = 1,2,...
Draw bootstrap sample Dibased on weights
Base classifier built on Diand used to classify all examples
from original dataset D
Increase weights of misclassified examples
Misclassified examples more likely to be chosen in
subsequent rounds
Attention focused on difficult to classify examples