The go-to guidebook for deploying Big Data solutions with
Hadoop

Today's enterprise architects need to understand how the Hadoop
frameworks and APIs fit together, and how they can be integrated to
deliver real-world solutions. This book is a practical, detailed
guide to building and implementing those solutions, with code-level
instruction in the popular Wrox tradition. It covers storing data
with HDFS and Hbase, processing data with MapReduce, and automating
data processing with Oozie. Hadoop security, running Hadoop with
Amazon Web Services, best practices, and automating Hadoop
processes in real time are also covered in depth.

With in-depth code examples in Java and XML and the latest on
recent additions to the Hadoop ecosystem, this complete resource
also covers the use of APIs, exposing their inner workings and
allowing architects and developers to better leverage and customize
them.

* The ultimate guide for developers, designers, and architects
who need to build and deploy Hadoop applications

* Covers storing and processing data with various technologies,
automating data processing, Hadoop security, and delivering
real-time solutions

* Includes detailed, real-world examples and code-level
guidelines

* Explains when, why, and how to use these tools effectively

* Written by a team of Hadoop experts in the
programmer-to-programmer Wrox style

Professional Hadoop Solutions is the reference enterprise
architects and developers need to maximize the power of Hadoop.



Autorentext

Boris Lublinsky is principal architect at Nokia and an
author of more than 70 publications, including Applied SOA:
Service-Oriented Architecture and Design Strategies.

Kevin T. Smith is Director of Technology Solutions for
the AMS division of Novetta Solutions, where he builds highly
secure, data-oriented solutions for customers.

Alexey Yakubovich is a system architect at Hortonworks
and a member of the Object Management Group SIG on SOA governance
and model-driven architecture.



Zusammenfassung

The go-to guidebook for deploying Big Data solutions with Hadoop

Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth.

With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them.

  • The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications
  • Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions
  • Includes detailed, real-world examples and code-level guidelines
  • Explains when, why, and how to use these tools effectively
  • Written by a team of Hadoop experts in the programmer-to-programmer Wrox style

Professional Hadoop Solutions is the reference enterprise architects and developers need to maximize the power of Hadoop.



Inhalt
Introduction xvii

Chapter 1: Big Data and the Hadoop Ecosystem 1

Big Data Meets Hadoop 2

Hadoop: Meeting the Big Data Challenge 3

Data Science in the Business World 5

The Hadoop Ecosystem 7

Hadoop Core Components 7

Hadoop Distributions 10

Developing Enterprise Applications with Hadoop 12

Summary 16

Chapter 2: Storing Data in Hadoop 19

HDFS 19

HDFS Architecture 20

Using HDFS Files 24

Hadoop-Specific File Types 26

HDFS Federation and High Availability 32

HBase 34

HBase Architecture 34

HBase Schema Design 40

Programming for HBase 42

New HBase Features 50

Combining HDFS and HBase for Effective Data Storage 53

Using Apache Avro 53

Managing Metadata with HCatalog 58

Choosing an Appropriate Hadoop Data Organization for Your Applications 60

Summary 62

Chapter 3: Processing Your Data with MapReduce 63

Getting to Know MapReduce 63

MapReduce Execution Pipeline 65

Runtime Coordination and Task Management in MapReduce 68

Your First MapReduce Application 70

Building and Executing MapReduce Programs 74

Designing MapReduce Implementations 78

Using MapReduce as a Framework for Parallel Processing 79

Simple Data Processing with MapReduce 81

Building Joins with MapReduce 82

Building Iterative MapReduce Applications 88

To MapReduce or Not to MapReduce? 94

Common MapReduce Design Gotchas 95

Summary 96

Chapter 4: Customizing MapReduce Execution 97

Controlling MapReduce Execution with InputFormat 98

Implementing InputFormat for Compute-Intensive Applications 100

Implementing InputFormat to Control the Number of Maps 106

Implementing InputFormat for Multiple HBase Tables 112

Reading Data Your Way with Custom RecordReaders 116

Implementing a Queue-Based RecordReader 116

Implementing RecordReader for XML Data 119

Organizing Output Data with Custom Output Formats 123

Implementing OutputFormat for Splitting MapReduce

Job's Output into Multiple Directories 124

Writing Data Your Way with Custom RecordWriters 133

Implementing a RecordWriter to Produce Outputtar Files 133

Optimizing Your MapReduce Execution with a Combiner 135

Controlling Reducer Execution with Partitioners 139

Implementing a Custom Partitioner for One-to-Many Joins 140

Using Non-Java Code with Hadoop 143

Pipes 143

Hadoop Streaming 143

Using JNI 144

Summary 146

Chapter 5: Building Reliable MapReduce Apps 147

Unit Testing MapReduce Applications 147

Testing Mappers 150

Testing Reducers 151

Integration Testing 152

Local Application Testing with Eclipse 154

Using Logging for Hadoop Testing 156

Processing Applications Logs 160

Reporting Metrics with Job Counters 162

Defensive Programming in MapReduce 165

Summary 166

Chapter 6: Automating Data Processing with Oozie 167

Getting to Know Oozie 168

Oozie Workflow 170

Executing Asynchronous Activities in Oozie Workflow 173

Oozie Recovery Capabilities 179

Oozie Workflow Job Life Cycle 180

Oozie Coordinator 181

Oozie Bundle 187

Oozie Parameterization with Exp...
Titel
Professional Hadoop Solutions
EAN
9781118612545
ISBN
978-1-118-61254-5
Format
E-Book (pdf)
Hersteller
Herausgeber
Veröffentlichung
30.08.2013
Digitaler Kopierschutz
Adobe-DRM
Dateigrösse
7.93 MB
Anzahl Seiten
504
Jahr
2013
Untertitel
Englisch