Learn advanced analytical techniques and leverage existing toolkits to make your analytic applications more powerful, precise, and efficient. This book provides the right combination of architecture, design, and implementation information to create analytical systems which go beyond the basics of classification, clustering, and recommendation.
In Pro Hadoop Data Analytics best practices are emphasized to ensure coherent, efficient development. A complete example system will be developed using standard third-party components which will consist of the toolkits, libraries, visualization and reporting code, as well as support glue to provide a working and extensible end-to-end system.
The book emphasizes four important topics:
- The importance of end-to-end, flexible, configurable, high-performance data pipeline systems with analytical components as well as appropriate visualization results.
Best practices and structured design principles. This will include strategic topics as well as the how to example portions.
Use of existing third-party libraries is key to effective development. Deep dive examples of the functionality of some of these toolkits will be showcased as you develop the example system.
What You'll Learn
- The what, why, and how of building big data analytic systems with the Hadoop ecosystem
- Libraries, toolkits, and algorithms to make development easier and more effective
- Best practices to use when building analytic systems with Hadoop, and metrics to measure performance and efficiency of components and systems
- How to connect to standard relational databases, noSQL data sources, and more
Useful case studies and example components which assist you in creating your own systems
Who This Book Is For
Software engineers, architects, and data scientists with an interest in the design and implementation of big data analytical systems using Hadoop, the Hadoop ecosystem, and other associated technologies.
Table of Contents
Part I: Concepts
Chapter 1: Overview: Building Data Analytic Systems with Hadoop
Chapter 2: A Scala and Python Refresher
Chapter 3: Standard Toolkits for Hadoop and Analytics
Chapter 4: Relational, NoSQL, and Graph Databases
Chapter 5: Data Pipelines and How to Construct Them
Chapter 6: Advanced Search Techniques with Hadoop, Lucene, and Solr
Part II: Architectures and Algorithms
Chapter 7: An Overview of Analytical Techniques and Algorithms
Chapter 8: Rule Engines, System Control, and System Orchestration
Chapter 9: Putting It All Together: Designing a Complete Analytical System
Part III: Components and Systems
Chapter 10: Data Visualizers: Seeing and Interacting with the Analysis
Part IV: Case Studies and Applications
Chapter 11: A Case Study in Bioinformatics: Analyzing Microscope Slide Data
Chapter 12: A Bayesian Analysis Component: Identifying Credit Card Fraud
Chapter 13: Searching for Oil: Geographical Data Analysis with Apache Mahout
Chapter 14: “Image As Big Data” Systems: Some Case Studies
Chapter 15: Building a General Purpose Data Pipeline
Chapter 16: Conclusions and the Future of Big Data Analysis
Appendix A : Setting Up the Distributed Analytics Environment
Appendix B: Getting, Installing, and Running the Example Analytics System