Rights Contact Login For More Details
- Wiley
More About This Title Discovering Knowledge in Data: An Introduction toData Mining
- English
English
- English
English
PREFACE xi
1 INTRODUCTION TO DATA MINING 1
What Is Data Mining? 2
Why Data Mining? 4
Need for Human Direction of Data Mining 4
Cross-Industry Standard Process: CRISP–DM 5
Case Study 1: Analyzing Automobile Warranty Claims: Example of the CRISP–DM Industry Standard Process in Action 8
Fallacies of Data Mining 10
What Tasks Can Data Mining Accomplish? 11
Description 11
Estimation 12
Prediction 13
Classification 14
Clustering 16
Association 17
Case Study 2: Predicting Abnormal Stock Market Returns Using Neural Networks 18
Case Study 3: Mining Association Rules from Legal Databases 19
Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees 21
Case Study 5: Profiling the Tourism Market Using k-Means Clustering Analysis 23
References 24
Exercises 25
2 DATA PREPROCESSING 27
Why Do We Need to Preprocess the Data? 27
Data Cleaning 28
Handling Missing Data 30
Identifying Misclassifications 33
Graphical Methods for Identifying Outliers 34
Data Transformation 35
Min–Max Normalization 36
Z-Score Standardization 37
Numerical Methods for Identifying Outliers 38
References 39
Exercises 39
3 EXPLORATORY DATA ANALYSIS 41
Hypothesis Testing versus Exploratory Data Analysis 41
Getting to Know the Data Set 42
Dealing with Correlated Variables 44
Exploring Categorical Variables 45
Using EDA to Uncover Anomalous Fields 50
Exploring Numerical Variables 52
Exploring Multivariate Relationships 59
Selecting Interesting Subsets of the Data for Further Investigation 61
Binning 62
Summary 63
References 64
Exercises 64
4 STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION 67
Data Mining Tasks in Discovering Knowledge in Data 67
Statistical Approaches to Estimation and Prediction 68
Univariate Methods: Measures of Center and Spread 69
Statistical Inference 71
How Confident Are We in Our Estimates? 73
Confidence Interval Estimation 73
Bivariate Methods: Simple Linear Regression 75
Dangers of Extrapolation 79
Confidence Intervals for the Mean Value of y Given x 80
Prediction Intervals for a Randomly Chosen Value of y Given x 80
Multiple Regression 83
Verifying Model Assumptions 85
References 88
Exercises 88
5 k-NEAREST NEIGHBOR ALGORITHM 90
Supervised versus Unsupervised Methods 90
Methodology for Supervised Modeling 91
Bias–Variance Trade-Off 93
Classification Task 95
k-Nearest Neighbor Algorithm 96
Distance Function 99
Combination Function 101
Simple Unweighted Voting 101
Weighted Voting 102
Quantifying Attribute Relevance: Stretching the Axes 103
Database Considerations 104
k-Nearest Neighbor Algorithm for Estimation and Prediction 104
Choosing k 105
Reference 106
Exercises 106
6 DECISION TREES 107
Classification and Regression Trees 109
C4.5 Algorithm 116
Decision Rules 121
Comparison of the C5.0 and CART Algorithms Applied to Real Data 122
References 126
Exercises 126
7 NEURAL NETWORKS 128
Input and Output Encoding 129
Neural Networks for Estimation and Prediction 131
Simple Example of a Neural Network 131
Sigmoid Activation Function 134
Back-Propagation 135
Gradient Descent Method 135
Back-Propagation Rules 136
Example of Back-Propagation 137
Termination Criteria 139
Learning Rate 139
Momentum Term 140
Sensitivity Analysis 142
Application of Neural Network Modeling 143
References 145
Exercises 145
8 HIERARCHICAL AND k-MEANS CLUSTERING 147
Clustering Task 147
Hierarchical Clustering Methods 149
Single-Linkage Clustering 150
Complete-Linkage Clustering 151
k-Means Clustering 153
Example of k-Means Clustering at Work 153
Application of k-Means Clustering Using SAS Enterprise Miner 158
Using Cluster Membership to Predict Churn 161
References 161
Exercises 162
9 KOHONEN NETWORKS 163
Self-Organizing Maps 163
Kohonen Networks 165
Example of a Kohonen Network Study 166
Cluster Validity 170
Application of Clustering Using Kohonen Networks 170
Interpreting the Clusters 171
Cluster Profiles 175
Using Cluster Membership as Input to Downstream Data Mining Models 177
References 178
Exercises 178
10 ASSOCIATION RULES 180
Affinity Analysis and Market Basket Analysis 180
Data Representation for Market Basket Analysis 182
Support, Confidence, Frequent Itemsets, and the A Priori Property 183
How Does the A Priori AlgorithmWork (Part 1)? Generating Frequent Itemsets 185
How Does the A Priori AlgorithmWork (Part 2)? Generating Association Rules 186
Extension from Flag Data to General Categorical Data 189
Information-Theoretic Approach: Generalized Rule Induction Method 190
J-Measure 190
Application of Generalized Rule Induction 191
When Not to Use Association Rules 193
Do Association Rules Represent Supervised or Unsupervised Learning? 196
Local Patterns versus Global Models 197
References 198
Exercises 198
11 MODEL EVALUATION TECHNIQUES 200
Model Evaluation Techniques for the Description Task 201
Model Evaluation Techniques for the Estimation and Prediction Tasks 201
Model Evaluation Techniques for the Classification Task 203
Error Rate, False Positives, and False Negatives 203
Misclassification Cost Adjustment to Reflect Real-World Concerns 205
Decision Cost/Benefit Analysis 207
Lift Charts and Gains Charts 208
Interweaving Model Evaluation with Model Building 211
Confluence of Results: Applying a Suite of Models 212
Reference 213
Exercises 213
EPILOGUE: "WE'VE ONLY JUST BEGUN" 215
INDEX 217
- English
English
"...selected material is described in a simple, clear, and…precise way...case studies…examples, and screen shots has definitely added to the learning value of the book." (Journal of Biopharmaceutical Statistics, January/February 2006)
"...does a good job introducing data mining to novices...it skillfully previews some of the basic statistical issues needed to understand data mining techniques." (Journal of the American Statistical Association, December 2005)
"If you need a book to help colleagues understand your data mining procedures and results, this is the one you want to give them." (Technometrics, November 2005)
"…an excellent book…it should be useful for anyone interested in analysing epidemiological data." (Statistics in Medical Research, October 2005)
"...an excellent 'white-box' overview of established approaches for data analysis, in which readers are shown how, why, and when the methods work." (CHOICE, April 2005)
"Larose has the making of a good series of books on data mining…I, for one, look forward to the next two books in the series." (Computing Reviews.com, February 15, 2005)