A guide to the principles and methods of data analysis that does not require knowledge of statistics or programming
A General Introduction to Data Analytics is an essential guide to understand and use data analytics. This book is written using easy-to-understand terms and does not require familiarity with statistics or programming. The authors--noted experts in the field--highlight an explanation of the intuition behind the basic data analytics techniques. The text also contains exercises and illustrative examples.
Thought to be easily accessible to non-experts, the book provides motivation to the necessity of analyzing data. It explains how to visualize and summarize data, and how to find natural groups and frequent patterns in a dataset. The book also explores predictive tasks, be them classification or regression. Finally, the book discusses popular data analytic applications, like mining the web, information retrieval, social network analysis, working with text, and recommender systems. The learning resources offer:
* A guide to the reasoning behind data mining techniques
* A unique illustrative example that extends throughout all the chapters
* Exercises at the end of each chapter and larger projects at the end of each of the text's two main parts
Together with these learning resources, the book can be used in a 13-week course guide, one chapter per course topic.
The book was written in a format that allows the understanding of the main data analytics concepts by non-mathematicians, non-statisticians and non-computer scientists interested in getting an introduction to data science. A General Introduction to Data Analytics is a basic guide to data analytics written in highly accessible terms.
Autorentext
João Mendes Moreira, PhD, is an assistant professor in the Faculty of Engineering at the University of Porto, Porto, Portugal and is also a researcher in LIAAD-INESC TEC, Porto, Portugal.
André de Carvalho, PhD, is a full professor in the Institute of Mathematics and Computer Science at the University of São Paulo, Brazil.
Tomá Horváth, PhD, is an assistant professor at the Faculty of Informatics of the Eötvös Loránd University in Budapest, Hungary, and is also associated with the Faculty of Science at the Pavol Jozef afárik University in Koice, Slovakia.
Inhalt
Preface xiii
Acknowledgments xv
Presentational Conventions xvii
About the Companion Website xix
Part I Introductory Background 1
1 What Can We Do With Data? 3
1.1 Big Data and Data Science 4
1.2 Big Data Architectures 5
1.3 Small Data 6
1.4 What is Data? 7
1.5 A Short Taxonomy of Data Analytics 9
1.6 Examples of Data Use 10
1.6.1 Breast Cancer in Wisconsin 11
1.6.2 Polish Company Insolvency Data 11
1.7 A Project on Data Analytics 12
1.7.1 A Little History on Methodologies for Data Analytics 12
1.7.2 The KDD Process 14
1.7.3 The CRISP-DM Methodology 15
1.8 How this Book is Organized 16
1.9 Who Should Read this Book 18
Part II Getting Insights from Data 19
2 Descriptive Statistics 21
2.1 Scale Types 22
2.2 Descriptive Univariate Analysis 25
2.2.1 Univariate Frequencies 25
2.2.2 Univariate Data Visualization 27
2.2.3 Univariate Statistics 32
2.2.4 Common Univariate Probability Distributions 38
2.3 Descriptive Bivariate Analysis 40
2.3.1 Two Quantitative Attributes 41
2.3.2 Two Qualitative Attributes, at Least one of them Nominal 45
2.3.3 Two Ordinal Attributes 46
2.4 Final Remarks 47
2.5 Exercises 47
3 Descriptive Multivariate Analysis 49
3.1 Multivariate Frequencies 49
3.2 Multivariate Data Visualization 50
3.3 Multivariate Statistics 59
3.3.1 Location Multivariate Statistics 59
3.3.2 Dispersion Multivariate Statistics 60
3.4 Infographics and Word Clouds 66
3.4.1 Infographics 66
3.4.2 Word Clouds 67
3.5 Final Remarks 67
3.6 Exercises 68
4 Data Quality and Preprocessing 71
4.1 Data Quality 71
4.1.1 Missing Values 72
4.1.2 Redundant Data 74
4.1.3 Inconsistent Data 75
4.1.4 Noisy Data 76
4.1.5 Outliers 77
4.2 Converting to a Dierent Scale Type 77
4.2.1 Converting Nominal to Relative 78
4.2.2 Converting Ordinal to Relative or Absolute 81
4.2.3 Converting Relative or Absolute to Ordinal or Nominal 82
4.3 Converting to a Dierent Scale 83
4.4 Data Transformation 85
4.5 Dimensionality Reduction 86
4.5.1 Attribute Aggregation 88
4.5.1.1 Principal Component Analysis 88
4.5.1.2 Independent Component Analysis 91
4.5.1.3 Multidimensional Scaling 91
4.5.2 Attribute Selection 92
4.5.2.1 Filters 92
4.5.2.2 Wrappers 93
4.5.2.3 Embedded 94
4.5.2.4 Search Strategies 95
4.6 Final Remarks 96
4.7 Exercises 96
5 Clustering 99
5.1 Distance Measures 100
5.1.1 Dierences between Values of Common Attribute Types 101
5.1.2 Distance Measures for Objects with Quantitative Attributes 103
5.1.3 Distance Measures for Non-conventional Attributes 104
5.2 Clustering Validation 107
5.3 Clustering Techniques 108
5.3.1 K-means 110
5.3.1.1 Centroids and Distance Measures 110
5.3.1.2 How K-means Works 111
5.3.2 DBSCAN 115
5.3.3 Agglomerative Hierarchical Clustering Technique 117
5.3.3.1 Linkage Criterion 119
5.3.3.2 Dendrograms 120
5.4 Final Remarks 122
5.5 Exercises 123
6 Frequent Pattern Mining 125
6.1 Frequent Itemsets 127
6.1.1 Setting the min_sup Threshold 128
6.1.2 Apriori a Join-based Method 131
6.1.3 Eclat 133
6.1.4 FP-Growth 134
6.1.5 Maximal and Closed Frequent Itemsets 138
6.2 Association Rules 139
6.3 B...