Apache Flume: Distributed Log Collection for Hadoop

by Steve Hoffman

Length: 108 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2013-07-16
ISBN-10: 1782167919
ISBN-13: 9781782167914
Sales Rank: #4088720 (See Top 100 Books)

0 ratings

Print Book Look Inside

Description

Stream data to Hadoop using Apache Flume

Overview

Integrate Flume with your data sources
Transcode your data en-route in Flume
Route and separate your data using regular expression matching
Configure failover paths and load-balancing to remove single points of failure
Utilize Gzip Compression for files written to HDFS

In Detail

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.

What you will learn from this book

Understand the Flume architecture
Download and install open source Flume from Apache
Discover when to use a memory or file-backed channel
Understand and configure the Hadoop File System (HDFS) sink
Learn how to use sink groups to create redundant data flows
Configure and use various sources for ingesting data
Inspect data records and route to different or multiple destinations based on payload content
Transform data en-route to Hadoop
Monitor your data flows

Approach

A starter guide that covers Apache Flume in detail.

Who this book is written for

Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

Chapter 1: Overview and Architecture
Chapter 2: Flume Quick Start
Chapter 3: Channels
Chapter 4: Sinks and Sink Processors
Chapter 5: Sources and Channel Selectors
Chapter 6: Interceptors, ETL, and Routing
Chapter 7: Monitoring Flume
Chapter 8: There Is No Spoon – The Realities of Real-time Distributed Data Collection

Free ChaptersTry Audible and Get Two Free Audiobooks »

To access the link, solve the captcha.

Recommended BooksMore Similar Books »

Python Advanced Programming: The guide to learn pyhton programming. Reference with exercises and samples about dynamical programming, multithreading, multiprocessing, debugging, testing and more

2020-01-08

Object-Oriented Analysis and Design for Information Systems: Modeling with Bpmn, Ocl, Ifml, and Python

2024-04-01

Mastering Microsoft Dynamics 365 Business Central, 2nd Edition: The complete guide for designing and integrating advanced Business Central solutions

2024-03-19

Apache Flume: Distributed Log Collection for Hadoop

Table of Contents

Python Advanced Programming: The guide to learn pyhton programming. Reference with exercises and samples about dynamical programming, multithreading, multiprocessing, debugging, testing and more

Object-Oriented Analysis and Design for Information Systems: Modeling with Bpmn, Ocl, Ifml, and Python

Mastering Microsoft Dynamics 365 Business Central, 2nd Edition: The complete guide for designing and integrating advanced Business Central solutions

Security-Driven Software Development: Learn to analyze and mitigate risks in your software projects

Hands-On Python for DevOps: Leverage Python's native libraries to streamline your workflow and save time with automation

Software Design Patterns with Java

How to use Github for Beginners : Coding Confidence: Beginner's Guide to GitHub.

Building Green Software: A Sustainable Approach to Software Development and Operations