Best Hadoop and Spark Books

In God we trust. All others must bring data 📶
W. Edwards Deming (statistician)

The goal is to turn data into information, and information into insight 🏆
~ Carly Fiorina (former president, and chair of Hewlett-Packard)

All in all it’s just another brick in the wall
All in all you’re just another brick in the wall 

~ Pink Floyd (lyrics from Another Brick in the Wall, Part 2)

My prior post was on Scala which—along with Java and Clojure—is a language that I find highly expressive and helpful for my programming needs. This weekend, let’s move on to another topic and see what can be done to help you in your journey to grokking the Big Data solution space 🙂

I do believe that the two key questions which are fueling the torrent that this age of Big Data has evolved into are these

  1. How best to handle and work with data at super-mega scale?
  2. How can one best decipher and understand that high-volume data and, in turn, convert it into a competitive advantage?

Living as we do today, well into the age of Big Data, it sure helps to have some guidance from those who are at the frontline of these endeavors which revolve around these two questions—Online resources are indispensable and fantastic in their own right, especially for cutting edge updates. But what about times when you simply want to sit down and really absorb the wisdom of our Big Data sages—the underlying conceptual infrastructure that powers the Big Data machinery—in a more sustained and methodical way?

That leads me to share some thoughts on the finest books on the subject—primarily on Spark and Hadoop, plus a smattering of others—that have proved especially helpful to me as I drank from the Kool Aid of Big Data knowledge 😎

  1. Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O’Reilly), by Josh Wills, Sandy Ryza, et al 🏆
  2. Learning Spark: Lightning-Fast Big Data Analysis (O’Reilly) by Holden Karau, et al 🎯
  3. Hadoop: The Definitive Guide, 4th Edition (O’Reilly) by Tom White 🐘
  4. Hadoop in Practice, 2nd Edition, (Manning), by Alex Holmes 🐻
  5. Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al 🐙
  6. Data Scientists at Work (Apress) by Sebastian Gutierrez ☕
  7. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O’Reilly), by by Donald Miner and Adam Shook 🍯
  8. Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis 🐝
  9. Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning), by Nathan Marz 🐾

And lest anyone be dying with curiosity about the origins of the phantasmagorical rendition of “The Wall” in the pic above… Insights emerging from the brick-in-the-wall metaphor… And I never metaphor I didn’t like 😎

1. Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O’Reilly), by Josh Wills, Sandy Ryza, et al 🏆

If you’re looking for the best-written and most exciting Big Data book of the year, look no further than this one: Advanced Analytics with Spark: Patterns for Learning from Data at Scale (O’Reilly), by Josh Wills, Sandy Ryza, et al. This book provides sparkling clear insights into the value proposition that Apache Spark brings to the Big Data (metaphorical) table 😉

You get to understand how this open source project makes distributed programming eminently accessible to data scientists. It goes on to show how Spark—while maintaining MapReduce’s linear scalability and fault tolerance—extends it in three important ways:

  1. Its engine can execute a more general directed acyclic graph (DAG) of operators.
  2. It complements this capability with a rich set of transformations.
  3. It extends its predecessors with in-memory processing. Its Resilient Distributed Dataset (RDD) abstraction enables developers to materialize any point in a processing pipeline into memory across the cluster.

One particularly telling remark that the authors make has to do with how, “…With respect to the pertinence of munging and ETL, Spark strives to be something closer to the Python of big data than the Matlab of big data”. Spark’s in-memory caching makes it equally ideal for programming in the large and small. And what’s possibly most exciting is how Spark bridges the gap between the avenues of exploratory analytics and production (i.e. operational) analytics! And given Spark’s tight integration with Hadoop ecosystem makes it an eminently accessible and attractive framework.

If the preceding themes strike a chord with you—and if you’re looking for deep dives to get a sense for the feel of using Spark to do complex analytics on massive data sets—look no further than this book. It covers the entire pipeline in an exceptionally clear and engaging style. A bunch of diverse domains are engagingly covered in no less than nine case studies, to which a chapter each is devoted. These chapters make up the bulk of this stellar book.

IMHO, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is an ideal second book on Spark—for your initial forays into this subject, the next book on this list would be an excellent first book on Spark. But if you’re determined to drink from the proverbial firehose, you really can’t go wrong reading them side-by-side 🙂

Oh, and the most fun and standout chapters in this altogether stellar book are those on

  • Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  • Understanding Wikipedia with Latent Semantic Analysis
  • Analyzing Co-occurrence Networks with GraphX
Finally, I mention here the table of contents to give you a fuller flavor of the topics covered

  • Chapter 1. Analyzing Big Data
  • Chapter 2. Introduction to Data Analysis with Scala and Spark
  • Chapter 3. Recommending Music and the Audioscrobbler Data Set
  • Chapter 4. Predicting Forest Cover with Decision Trees
  • Chapter 5. Anomaly Detection in Network Traffic with K-means Clustering
  • Chapter 6. Understanding Wikipedia with Latent Semantic Analysis
  • Chapter 7. Analyzing Co-occurrence Networks with GraphX
  • Chapter 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
  • Chapter 9. Estimating Financial Risk through Monte Carlo Simulation
  • Chapter 10. Analyzing Genomics Data and the BDG Project
  • Chapter 11. Analyzing Neuroimaging Data with PySpark and Thunder
All in all, Advanced Analytics with Spark: Patterns for Learning from Data at Scale is a book that’s got me really excited about the possibilities of this remarkable platform!

2. Learning Spark: Lightning-Fast Big Data Analysis (O’Reilly) by Holden Karau, et al 🎯

The book Learning Spark: Lightning-Fast Big Data Analysis (O’Reilly) by Holden Karau, et al is a welcome addition to the library of those starting out in their quest to grok the amazing framework that Apache Spark is. What I appreciated the most about this book is the thorough and pragmatic coverage of Apache Spark, beginning with an invitation to understand the value that Spark offers by extending MapReduce

  1. Spark brings value by its ease-of-use (fire up Spark on your laptop, and start using its high-level API, which enables you to focus on your domain-specific computations).
  2. Spark enables interactive use for tackling complex algorithms.
  3. And you get in Spark a general-purpose computation engine (thinking here to combining multiple types of computations, such as ML, text processing, SQL querying, etc.) that would previously have necessitated a bunch of different engines.

For us software types, the following observations by the authors are worth bringing out so you can best decide whether the targeted value that this book offers is for you

This book targets data scientists and engineers. We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve. Spark’s rich collection of data-focused libraries (like MLlib) makes it easy for data scientists to go beyond problems that fit on a single machine while using their statistical background. Engineers, meanwhile, will learn how to write general-purpose distributed programs in Spark and operate production applications. Engineers and data scientists will both learn different details from this book, but will both be able to apply Spark to solve large distributed problems in their respective fields. 

The second group this book targets is software engineers who have some experience with Java, Python, or another programming language. If you are an engineer, we hope that this book will show you how to set up a Spark cluster, use the Spark shell, and write Spark applications to solve parallel processing problems (italicized by me for emphasis). If you are familiar with Hadoop, you have a bit of a head start on figuring out how to interact with HDFS and how to manage a cluster, but either way, we will cover basic distributed execution concepts.

The full chapter devoted to Spark’s core abstraction for doing data-intensive computations—the resilient distributed dataset (aka RDD)—is a standout. The other standout chapter is the one that gets into the nitty gritty of configuring a Spark application, and which also provides an overview of tuning and debugging Spark workloads in production.

Learning Spark: Lightning-Fast Big Data Analysis is richly illustrated with diagrams and tables, and there’s no shortage of helpful code snippets to get you going with Spark 🙂

3. Hadoop: The Definitive Guide 4th Edition (O’Reilly) by Tom White 🐘

Let’s segue from Spark to Hadoop land now, beginning with a remarkable book: Hadoop: The Definitive Guide, 4th Edition (O’Reilly) by Tom White—This crystal clear and eminently readable book is perhaps the grand-daddy of all Big Data books out there! Now in its fourth edition, this book is the paragon of sparkling clear prose and unambiguous explanations of all things Hadoop, which of course we tech types crave 🙂

When reading books, we’re all gotten used to doing the inevitable google searches periodically—to compensate for the equally inevitable gaps in the narratives of any given technology book—but this book is mercifully free of the aforesaid read-some, search-online-some, resume-reading syndrome, yay!

So if you’re ready to drink deep at the Hadoop pool, you simply can’t go wrong with this book. Allow me to elaborate: In the Preface, the author elegantly traces the genesis of this very point—sparkling clear prose and unambiguous readability—to the works of the renowned mathematics writer, Martin Gardner, and adds

Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.  

But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.

You immediately get the sense that this book is a no-nonsense, friendly, and engaging guide to Hadoop and its ecosystem; rest assured that you’ll finish this book without the author letting you down one bit. In fact, elaborating on this very theme—that this is a no-nonsense, friendly, and engaging guide to Hadoop—the first chapter gives a pleasant tour (a lay of the land, if you will) to the entirety of Hadoop: The Definitive Guide, 4th Edition, which is made up of no less than 756 pages. Be sure to use the book’s indispensable first chapter to make the most of absorbing the contents of this remarkable book. As the author explains,

The book is divided into five main parts: Parts I to III are about core Hadoop, Part IV covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies. You can read the book from cover to cover, but there are alternative pathways through the book that allow you to skip chapters that aren’t needed to read later ones.

Further along, a bird’s eye view is provided for each of the chapters in the five main parts that make up this book. This summary is accompanied by a lovely flowchart of the paths that can be taken through the contents—Thoughtful design, with the reader in mind, is the hallmark of the entire book. As a reader, I felt secure in the knowledge of learning Hadoop from a master of the art. In this regard, the following remarks (in the Foreword) by Doug Cutting—who, along with Mike Cafarella, created Hadoop in 2005—are quite telling, and reflect just how friendly and engaging a guide this book is to all things Hadoop

Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. 

Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk.

Don’t miss this work (Hadoop: The Definitive Guide, 4th Edition) by the leading popularizer of Hadoop, who is doing for Hadoop what Martin Gardner has done for mathematics!

4. Hadoop in Practice 2nd Edition, (Manning), by Alex Holmes 🐻

This next title is an excellent second book on Hadoop: Hadoop in Practice, 2nd Edition (Manning), by Alex Holmes. Now in its second edition, this book got a thorough update to cover changes and new features in Hadoop, including MapReduce 2. New chapters have been added to cover YARN, Kafka, Impala, and Spark SQL as they each relate to Hadoop. While sticking to the strengths of the first edition—approximately 100 intermediate-to-advanced Hadoop examples in a superb problem-and-solution format—the new edition continues to build on those  strengths, while maintaining the high-quality in the code examples.

In the About this Book section, after mentioning how, with its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets, the author goes on to identify the target audience of this book:

This hands-on book targets users who have some practical experience with Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand and apply the techniques covered in this book.  

Many techniques in this book are Java-based, which means readers are expected to possess an intermediate-level knowledge of Java. An excellent text for all levels of Java users is Effective Java, Second Edition by Joshua Bloch (Addison-Wesley).

One thing I really, really like about this book is the abundance of useful diagrams and code snippets, all of which are profusely annotated with thoughtful comments! I would say that the barrier-to-entry to this book is not all that high—hastening to add that this is most emphatically not the same as saying that the contents are trifling—so if you’re determined, don’t shy away from tackling this book (along with, importantly, having an introductory book by your side, such as the fine book entitled Hadoop: The Definitive Guide, by Tom White, and which is also reviewed above).

Very briefly, here is a rundown of the topics covered in this book:

1. Background and fundamentals: Chapter 1. Hadoop in a heartbeat, Chapter 2. Introduction to YARN.  

2. Data logistics: Chapter 3. Data serialization—working with text and beyond, Chapter 4. Organizing and optimizing data in HDFS, Chapter 5. Moving data into and out of Hadoop.  

3. Big data patterns: Chapter 6. Applying MapReduce patterns to big data, Chapter 7. Utilizing data structures and algorithms at scale, Chapter 8. Tuning, debugging, and testing.  

4. Beyond MapReduce: Chapter 9. SQL on Hadoop Chapter 10. Writing a YARN application.

This book (Hadoop in Practice, 2nd Editionis packed with helpful material which—far from being cluttered in any way—is pleasingly organized and makes for smooth reading and a rewarding learning experience.

5. Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al 🐙

Once comfortable with the Hadoop paradigm, you’ll be able to appreciate the gem of a book we’ve got in this next title: Professional Hadoop Solutions (Wrox), by Boris Lublinsky et al. The authors have assembled a first-class collection of design expertise narratives.

In my mind, the key to understanding the value in this book lies in appreciating the following observation, which the authors make in the introductory chapter

Although many publications emphasize the fact that Hadoop hides infrastructure complexity from business developers, you should understand that Hadoop extensibility is not publicized enough… Hadoop’s implementation was designed in a way that enables developers to easily and seamlessly incorporate new functionality into Hadoop’s execution. 

A significant portion of this book is dedicated to describing approaches to such customizations, as well as practical implementations. These are all based on the results of work performed by the authors.

They go on to explain cogently the reasons why great emphasis is placed on MapReduce code throughout the book. So if you approach this book with the mindset that the narratives will directly revolve around MapReduce, you’ll glean quite a bit of value out of this book. Their explanation of the MapReduce paradigm, as well as its nuts-and-bolts mechanisms, really are top notch.

The standout chapters are the following:

  • Processing Your Data with MapReduce
  • Customizing MapReduce Execution
  • Hadoop Security
  • Building Enterprise Security Solutions for Hadoop Implementations

The Appendix toward the end of Professional Hadoop Solutions is especially rich and useful. Overall, I’m glad to have found this book!

6. Data Scientists at Work (Apress) by Sebastian Gutierrez ☕

And now let’s segue from Hadoop to a foray into Data Science kingdom proper 🏰

But first a fair warning is in order about this next book: Once you start reading it, you’re going to have a terribly hard time putting it down or, for that matter, doing anything else before you’ve read it all! Such was my experience of reading (and re-reading) this page-turner of a book: Data Scientists at Work (Apress) by Sebastian Gutierrez.

Consider this… We have these marvelous frameworks—in Spark, Hadoop, Storm and others—but surely they were not created in some ethereal vacuum. Right, these frameworks were of course created in the service of genuine business needs, and to solve pressing problems that folks were facing. So if you’re looking for the scoop on this nexus (i.e. the potent symbiosis between the aims of Data Science and what Big Data has to offer), this is the book for you.

The corpus of this book is made up of in-depth interviews of 16 gifted data scientists. What makes these interviews incredibly engaging is the spectacularly good job done by the interviewer (the author of this book), Sebastian Gutierrez. His academic training is from MIT—where he earned a BS in Mathematics—and he is a data entrepreneur who has founded three data-related companies.

The pointed and evocative questions asked throughout the book could only have come from someone who knows the pragmatics of the Data Science field inside-out! And therein lies the immense value of this book: Detailed answers by 16 top data scientists as they shed light on the human side of data science, their thoughts on how this field is evolving, where it’s headed, plus plenty of straight-from-the-trenches stories about their work.

While the quality of the interviews is uniformly excellent, the standout interviews in my mind are the ones with these data scientists who are doing stellar work

To give you a flavor of the interviews—each of which is given its own chapter—ever so briefly, here is something from Claudia, who is the Chief Scientist at Dstillery. She teaches a high-level overview course on data mining for the NYU Stern MBA program to, in here own words, “…give people a good understanding of what the opportunities are and how to manage them instead of really teaching them how to do it”. She has taught at NYU, MIT, Wharton, and Columbia. In response to the interview question in the book (“What about this work is interesting and exciting for you?”), Claudia noted

I have always been fascinated by math puzzles and puzzles in general. The work that I do is a real-world version of puzzles that life just presents. Data is the footprint of real life in some form, and so it is always interesting. It is like a detective game to figure out what is really going on. Most of my time I am debugging data with a sense of finding out what is wrong with it or where it disagrees with my assumption of what it was supposed to have meant. So these are games that I am just inherently getting really excited about.

In the end, here is the book’s author (Sebastian Gutierrez) himself, describing in the Introduction the essence of his approach in putting together the interviews for this book

My interviewing method was designed to ask open-ended questions so that the personalities and spontaneous thought processes of each interviewee would shine through clearly and accurately. My aim was to get at the heart of how they came to be data scientists, what they love about the field, what their daily work lives entail, how they built their careers, how they developed their skills, what advice they have for people looking to become data scientists, and what they think the future of the field holds.

Some 20 years ago, when I was finishing grad school—at that time, I earned an MS degree in electrical engineering from Texas A&M University—we didn’t call work such as my dissertation (Noise-tolerant Software Method for Traffic Sign Recognition) Data Science. But in several ways, while I was reading the fine interviews in this book, I sure was reminded of the algorithms I worked out back then: Various AI programming techniques (neural networks primarily, such as the Back-propagation Neural Network and the Adaptive Resonance Theory model, aka ART2). Good stuff, and enough reminiscing, for that matter 🙂

So Data Scientists at Work is a fantastic book overall, if this sort of thing piques your interest.

7. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O’Reilly), by by Donald Miner and Adam Shook 🍯

Segueing right back to Hadoop now, the title of the next book is decidedly open-ended—MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems (O’Reilly), by by Donald Miner and Adam Shook. Given the open-ended title, allow me to elaborate on the gist of this fine book…

The authors are clearly experts in the Hadoop ecosystem, and what they’ve put together is more than what you’ll find in the endearing O’Reilly “cookbook” series. Thus, they don’t call out specific problems and accompanying solutions. Instead, they share the lessons that they have learned along the way to becoming experts in the Hadoop ecosystem. Note, too, that this book is mostly about the analytics side of Hadoop and MapReduce.

And they assume that you’re already familiar with how Hadoop and MapReduce work, so they don’t dive into the details of the APIs which they use in this book—Those topics have already been covered thoroughly in other books, and they focus on analytics. In their own words

The motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well. The intent of this book is to prevent you from having to make some of your own mistakes by educating you on how experts have figured out how to solve problems with MapReduce. So, in some ways, this book can be viewed as an intermediate or advanced MapReduce developer resource, but we think early beginners and gurus will find use out of it.

One thing I appreciated a lot was the way the authors answer the question, “So why should we use Java MapReduce in Hadoop at all when we have options like Pig and Hive?”. They point out two core reasons for spending time explaining how to implement something in hundreds of lines of code when the same can be accomplished in a couple lines with, say, Pig and Hive. In their own words

First, there is conceptual value in understanding the lower-level workings of a system like MapReduce. The developer that understands how Pig actually performs a reduce-side join will make smarter decisions. Using Pig or Hive without understanding MapReduce can lead to some dangerous situations…. 

Second, Pig and Hive aren’t there yet in terms of full functionality and maturity (as of 2012). It is obvious that they haven’t reached their full potential yet. Right now, they simply can’t tackle all of the problems in the ways that Java MapReduce can.

Remaining mindful of the fact that the title of this book is admittedly open-ended, I mention here the table of contents to give you a flavor of the topics covered

  • Chapter 1. Design Patterns and MapReduce
  • Chapter 2. Summarization Patterns
  • Chapter 3. Filtering Patterns
  • Chapter 4. Data Organization Patterns
  • Chapter 5. Join Patterns
  • Chapter 6. Metapatterns
  • Chapter 7. Input and Output Patterns
  • Chapter 8. Final Thoughts and the Future of Design Patterns
With the caveats noted above, MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems is a book absolutely worth exploring!

8. Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis 🐝

Finally, let’s segue to the land of real-time, streaming data 🙂

This next book is impeccably written in an eminently thoughtful style—Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data (Wiley), by Byron Ellis. The author is the CTO of Spongecell, and he has a Ph.D. in Statistics from Harvard University.

No doubt, with enough determination and time, one can do online searches and cobble together a solution to handle real-time, high-volume mega data. But that begs the question, and I’m not questioning anyone’s tenacity here: Is that really the ideal strategy? And that’s where the book shines—What makes it stand out is the care and thought that have clearly been poured into making this book a one-stop resource for crafting end-to-end solutions for effectively grappling with real-time, high-volume mega data.

Much as I alluded to above, this book is impeccably written. The author has clearly honed his writing skills—quite likely while preparing his dissertation for the Ph.D. that he earned from Harvard University 🙂

Clearly written books are a heaven-send, and this superb book is one. In that vein, the author notes with razor-sharp precision the aim of this book

The goal of this book is to allow a fairly broad range of potential users and implementers in an organization to gain comfort with the complete stack of applications. When real-time projects reach a certain point, they should be agile and adaptable systems that can be easily modified, which requires that the users have a fair understanding of the stack as a whole in addition to their own areas of focus. “Real time” applies as much to the development of new analyses as it does to the data itself. Any number of well-meaning projects have failed because they took so long to implement that the people who requested the project have either moved on to other things or simply forgotten why they wanted the data in the first place. By making the projects agile and incremental, this can be avoided as much as possible.

The author weaves into the narratives a lot of pragmatic advice; he has clearly been in the development trenches and done it all. As with the prior book, I mention here the table of contents to give you a flavor of the topics covered in Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data

Part I: Streaming Analytics Architecture
Chapter 1: Introduction to Streaming Data Sources of Streaming Data, Why Streaming Data Is Different, Infrastructures and Algorithms, Conclusion
Chapter 2: Designing Real-Time Streaming Architectures Real-Time Architecture Components Features of a Real-Time Architecture Languages for Real-Time Programming A Real-Time Architecture Checklist Conclusion

Chapter 3: Service Configuration and Coordination Motivation for Configuration and Coordination Systems Maintaining Distributed State Apache ZooKeeper Conclusion
Chapter 4: Data-Flow Management in Streaming Analysis Distributed Data Flows Apache Kafka: High-Throughput Distributed Messaging Apache Flume: Distributed Log Collection Conclusion
Chapter 5: Processing Streaming Data Distributed Streaming Data Processing Processing Data with Storm Processing Data with Samza Conclusion
Chapter 6: Storing Streaming Data Consistent Hashing “NoSQL” Storage Systems Other Storage Technologies Choosing a Technology Warehousing Conclusion  

Part II: Analysis and Visualization 
Chapter 7: Delivering Streaming Metrics Streaming Web Applications Visualizing Data Mobile Streaming Applications Conclusion
Chapter 8: Exact Aggregation and Delivery Timed Counting and Summation Multi-Resolution Time-Series Aggregation Stochastic Optimization Delivering Time-Series Data Conclusion
Chapter 9: Statistical Approximation of Streaming Data Numerical Libraries Probabilities and Distributions Working with Distributions Random Number Generation Sampling Procedures Conclusion
Chapter 10: Approximating Streaming Data with Sketching Registers and Hash Functions Working with Sets The Bloom Filter Distinct Value Sketches The Count-Min Sketch Other Applications Conclusion
Chapter 11: Beyond Aggregation Models for Real-Time Data Forecasting with Models Monitoring Real-Time Optimization Conclusion Introduction Overview and Organization of This Book Who Should Read This Book Tools You Will Need What’s on the Website Time to Dive In

In the end, do make a note of the author’s point when he reiterates that

The hope is that the reader of this book would feel confident taking a proof-of-concept streaming data project in their organization from start to finish with the intent to release it into a production environment.

All this makes Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data a book that shouldn’t be missed 😉

9. Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning), by Nathan Marz 🐾

Last, but certainly not the least—continuing now in the spirit of frameworks that enable us developers to tackle real-time, streaming data—is a book by Nathan Marz: Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning). The author happens to be the originator of the Lambda Architecture approach to programming in the world of Big Data, and he deploys his considerable knowledge of this approach in explaining the details.

This book is dives deep into the concepts underlying Lambda Architecture—which is what the author dubbed the approach that he formalized during his years working at the startup BackType—along with, importantly, many illustrative examples which are nicely supplemented by code snippets. The author puts it succinctly when he notes that

This book is the result of my desire to spread the knowledge of the Lambda Architecture and how it avoids the complexities of traditional architectures. It is the book I wish I had when I started working with Big Data. I hope you treat this book as a journey—a journey to challenge what you thought you knew about data systems, and to discover that working with Big Data can be elegant, simple, and fun.

As an aside—confessing here my fondness for Clojure, the Lisp that runs on the JVM—I couldn’t help but resonate with the following sentiments echoed by Nathan Marz in the Acknowledgments section of Big Data: Principles and Best Practices of Scalable Realtime Data Systems, where he notes that

Rich Hickey has been one of my biggest inspirations during my programming career. Clojure is the best language I have ever used, and I’ve become a better programmer having learned it. I appreciate its practicality and focus on simplicity. Rich’s philosophy on state and complexity in programming has influenced me deeply.

In sum, this is a worthwhile book, nicely structured into theory and illustration chapters.

In the end, and as I mentioned at the outset, I invite your comments—Having now read my brief take each on the books above…

  • Do you find that your experience of reading any of these books was different? 
  • Perhaps some qualities that I did not cover are the ones that you found the most helpful as you learned Scala and its ecosystem. 
  • Did I omit any of your favorite Big Data book(s)? 
  • I’ve covered only a partial list of the Big Data books that I’ve read, limited as you can imagine I am by the time available…

As with my prior post, which contains a set of book vignettes—those pertaining to the finest and most useful books on Scala in print—my aim here, too, in sharing these brief reviews remains the same, albeit on a different subject (Big Data) this time: I hope these vignettes will help you in selecting your resources well, and help you in your journey to grokking the Big Data solution space!

Bon voyage, and I leave you with an obligatory photo of a section of one of my bookshelves—one that’s, um, rather biased toward Big Data material in a statistically significant way, eh 😉


  1. I read this post two times, I like it so much, please try to keep posting & Let me introduce other material that may be good for our community.

  2. I got all the required information for the information for which I was trying to find.
    Even I have a suggestion for those are looking for to find an institution located near BTM Layout, Bangalore.
    Afitech Training, Placement and consultancy. For more information, kindly visit their official website.
    We are into the field of providing training for various courses such as Machine learning, digital marketing, AWS, Python, Data science, Medical coding, Cyber security, Mulesoft, Java testing.
    We are also providing services in regarding placements for trained candidates and even dealing with those are seeking for job after completing their studies.

  3. The reactjs course in Surat is a JavaScript based UI development library course. Facebook and an open-source developer community run it. Although React is a library rather than a language, it is widely used in web development. The library first appeared in May 2013 and is now one of the most commonly used frontend libraries for web development.

  4. Thanks for the beautiful post with the great insightful content. The information you mention in the blog is informative, relevant, and reliable. And I see the worth in reading the blog post that you have explained in a simple and understandable way. Nevertheless, someone wishing to make a career in the field of digital marketing must check Digital marketing course in Gurgaon and understand the details.

  5. Hi, I have read a lot from this blog thank you for sharing this information. We provide all the essential topics in Data Science Course In Chennai like, Full stack Developer, Python, AI and Machine Learning, Tableau, etc. for more information just log in to our website :

Leave a Reply to SarikaCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.