The go-to guidebook for deploying Big Data solutions with
Hadoop
Today's enterprise architects need to understand how the Hadoop
frameworks and APIs fit together, and how they can be integrated to
deliver real-world solutions. This book is a practical, detailed
guide to building and implementing those solutions, with code-level
instruction in the popular Wrox tradition. It covers storing data
with HDFS and Hbase, processing data with MapReduce, and automating
data processing with Oozie. Hadoop security, running Hadoop with
Amazon Web Services, best practices, and automating Hadoop
processes in real time are also covered in depth.
With in-depth code examples in Java and XML and the latest on
recent additions to the Hadoop ecosystem, this complete resource
also covers the use of APIs, exposing their inner workings and
allowing architects and developers to better leverage and customize
them.
* The ultimate guide for developers, designers, and architects
who need to build and deploy Hadoop applications
* Covers storing and processing data with various technologies,
automating data processing, Hadoop security, and delivering
real-time solutions
* Includes detailed, real-world examples and code-level
guidelines
* Explains when, why, and how to use these tools effectively
* Written by a team of Hadoop experts in the
programmer-to-programmer Wrox style
Professional Hadoop Solutions is the reference enterprise
architects and developers need to maximize the power of Hadoop.
Autorentext
Boris Lublinsky is principal architect at Nokia and an
author of more than 70 publications, including Applied SOA:
Service-Oriented Architecture and Design Strategies.
Kevin T. Smith is Director of Technology Solutions for
the AMS division of Novetta Solutions, where he builds highly
secure, data-oriented solutions for customers.
Alexey Yakubovich is a system architect at Hortonworks
and a member of the Object Management Group SIG on SOA governance
and model-driven architecture.
Zusammenfassung
The go-to guidebook for deploying Big Data solutions with Hadoop
Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.
With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them.
- The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications
- Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions
- Includes detailed, real-world examples and code-level guidelines
- Explains when, why, and how to use these tools effectively
- Written by a team of Hadoop experts in the programmer-to-programmer Wrox style
Professional Hadoop Solutions is the reference enterprise architects and developers need to maximize the power of Hadoop.
Inhalt
Introduction xvii
Chapter 1: Big Data and the Hadoop Ecosystem 1
Big Data Meets Hadoop 2
Hadoop: Meeting the Big Data Challenge 3
Data Science in the Business World 5
The Hadoop Ecosystem 7
Hadoop Core Components 7
Hadoop Distributions 10
Developing Enterprise Applications with Hadoop 12
Summary 16
Chapter 2: Storing Data in Hadoop 19
HDFS 19
HDFS Architecture 20
Using HDFS Files 24
Hadoop-Specific File Types 26
HDFS Federation and High Availability 32
HBase 34
HBase Architecture 34
HBase Schema Design 40
Programming for HBase 42
New HBase Features 50
Combining HDFS and HBase for Effective Data Storage 53
Using Apache Avro 53
Managing Metadata with HCatalog 58
Choosing an Appropriate Hadoop Data Organization for Your Applications 60
Summary 62
Chapter 3: Processing Your Data with MapReduce 63
Getting to Know MapReduce 63
MapReduce Execution Pipeline 65
Runtime Coordination and Task Management in MapReduce 68
Your First MapReduce Application 70
Building and Executing MapReduce Programs 74
Designing MapReduce Implementations 78
Using MapReduce as a Framework for Parallel Processing 79
Simple Data Processing with MapReduce 81
Building Joins with MapReduce 82
Building Iterative MapReduce Applications 88
To MapReduce or Not to MapReduce? 94
Common MapReduce Design Gotchas 95
Summary 96
Chapter 4: Customizing MapReduce Execution 97
Controlling MapReduce Execution with InputFormat 98
Implementing InputFormat for Compute-Intensive Applications 100
Implementing InputFormat to Control the Number of Maps 106
Implementing InputFormat for Multiple HBase Tables 112
Reading Data Your Way with Custom RecordReaders 116
Implementing a Queue-Based RecordReader 116
Implementing RecordReader for XML Data 119
Organizing Output Data with Custom Output Formats 123
Implementing OutputFormat for Splitting MapReduce
Job's Output into Multiple Directories 124
Writing Data Your Way with Custom RecordWriters 133
Implementing a RecordWriter to Produce Outputtar Files 133
Optimizing Your MapReduce Execution with a Combiner 135
Controlling Reducer Execution with Partitioners 139
Implementing a Custom Partitioner for One-to-Many Joins 140
Using Non-Java Code with Hadoop 143
Pipes 143
Hadoop Streaming 143
Using JNI 144
Summary 146
Chapter 5: Building Reliable MapReduce Apps 147
Unit Testing MapReduce Applications 147
Testing Mappers 150
Testing Reducers 151
Integration Testing 152
Local Application Testing with Eclipse 154
Using Logging for Hadoop Testing 156
Processing Applications Logs 160
Reporting Metrics with Job Counters 162
Defensive Programming in MapReduce 165
Summary 166
Chapter 6: Automating Data Processing with Oozie 167
Getting to Know Oozie 168
Oozie Workflow 170
Executing Asynchronous Activities in Oozie Workflow 173
Oozie Recovery Capabilities 179
Oozie Workflow Job Life Cycle 180
Oozie Coordinator 181
Oozie Bundle 187
Oozie Parameterization with Exp...