Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Buy Rights Online Buy Rights

Rights Contact Login For More Details

  • Wiley

More About This Title Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data

English

Construct a robust end-to-end solution for analyzing and visualizing streaming data

Real-time analytics is the hottest topic in data analytics today. In Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data, expert Byron Ellis teaches data analysts technologies to build an effective real-time analytics platform. This platform can then be used to make sense of the constantly changing data that is beginning to outpace traditional batch-based analysis platforms.

The author is among a very few leading experts in the field. He has a prestigious background in research, development, analytics, real-time visualization, and Big Data streaming and is uniquely qualified to help you explore this revolutionary field. Moving from a description of the overall analytic architecture of real-time analytics to using specific tools to obtain targeted results, Real-Time Analytics leverages open source and modern commercial tools to construct robust, efficient systems that can provide real-time analysis in a cost-effective manner. The book includes:

  • A deep discussion of streaming data systems and architectures
  • Instructions for analyzing, storing, and delivering streaming data
  • Tips on aggregating data and working with sets
  • Information on data warehousing options and techniques

Real-Time Analytics includes in-depth case studies for website analytics, Big Data, visualizing streaming and mobile data, and mining and visualizing operational data flows. The book's "recipe" layout lets readers quickly learn and implement different techniques. All of the code examples presented in the book, along with their related data sets, are available on the companion website.

English

BYRON ELLIS is CTO of Spongecell, where he heads research and development. Previously the Chief Data Scientist for LivePerson and CTO at AdBrite, Ellis holds a Ph.D. in Statistics from Harvard University, and a B.S. in Cybernetics from UCLA. He presents sessions on real-time analytics at Strata and other major conferences.

English

Introduction xv

Chapter 1 Introduction to Streaming Data 1

Sources of Streaming Data 2

Operational Monitoring 3

Web Analytics 3

Online Advertising 4

Social Media 5

Mobile Data and the Internet of Things 5

Why Streaming Data Is Different 7

Always On, Always Flowing 7

Loosely Structured 8

High-Cardinality Storage 9

Infrastructures and Algorithms 10

Conclusion 10

Part I Streaming Analytics Architecture 13

Chapter 2 Designing Real-Time Streaming Architectures 15

Real-Time Architecture Components 16

Collection 16

Data Flow 17

Processing 19

Storage 20

Delivery 22

Features of a Real-Time Architecture 24

High Availability 24

Low Latency 25

Horizontal Scalability 26

Languages for Real-Time Programming 27

Java 27

Scala and Clojure 28

JavaScript 29

The Go Language 30

A Real-Time Architecture Checklist 30

Collection 31

Data Flow 31

Processing 32

Storage 32

Delivery 33

Conclusion 34

Chapter 3 Service Configuration and Coordination 35

Motivation for Confi guration and Coordination Systems 36

Maintaining Distributed State 36

Unreliable Network Connections 36

Clock Synchronization 37

Consensus in an Unreliable World 38

Apache ZooKeeper 39

The znode 39

Watches and Notifi cations 41

Maintaining Consistency 41

Creating a ZooKeeper Cluster 42

ZooKeeper’s Native Java Client 47

The Curator Client 56

Curator Recipes 63

Conclusion 70

Chapter 4 Data-Flow Management in Streaming Analysis 71

Distributed Data Flows 72

At Least Once Delivery 72

The “n+1” Problem 73

Apache Kafka: High-Throughput Distributed Messaging 74

Design and Implementation 74

Configuring a Kafka Environment 80

Interacting with Kafka Brokers 89

Apache Flume: Distributed Log Collection 92

The Flume Agent 92

Configuring the Agent 94

The Flume Data Model 95

Channel Selectors 95

Flume Sources 98

Flume Sinks 107

Sink Processors 110

Flume Channels 110

Flume Interceptors 112

Integrating Custom Flume Components 114

Running Flume Agents 114

Conclusion 115

Chapter 5 Processing Streaming Data 117

Distributed Streaming Data Processing 118

Coordination 118

Partitions and Merges 119

Transactions 119

Processing Data with Storm 119

Components of a Storm Cluster 120

Configuring a Storm Cluster 122

Distributed Clusters 123

Local Clusters 126

Storm Topologies 127

Implementing Bolts 130

Implementing and Using Spouts 136

Distributed Remote Procedure Calls 142

Trident: The Storm DSL 144

Processing Data with Samza 151

Apache YARN 151

Getting Started with YARN and Samza 153

Integrating Samza into the Data Flow 157

Samza Jobs 157

Conclusion 166

Chapter 6 Storing Streaming Data 167

Consistent Hashing 168

“NoSQL” Storage Systems 169

Redis 170

MongoDB 180

Cassandra 203

Other Storage Technologies 215

Relational Databases 215

Distributed In-Memory Data Grids 215

Choosing a Technology 215

Key-Value Stores 216

Document Stores 216

Distributed Hash Table Stores 216

In-Memory Grids 217

Relational Databases 217

Warehousing 217

Hadoop as ETL and Warehouse 218

Lambda Architectures 223

Conclusion 224

Part II Analysis and Visualization 225

Chapter 7 Delivering Streaming Metrics 227

Streaming Web Applications 228

Working with Node 229

Managing a Node Project with NPM 231

Developing Node Web Applications 235

A Basic Streaming Dashboard 238

Adding Streaming to Web Applications 242

Visualizing Data 254

HTML5 Canvas and Inline SVG 254

Data-Driven Documents: D3.js 262

High-Level Tools 272

Mobile Streaming Applications 277

Conclusion 279

Chapter 8 Exact Aggregation and Delivery 281

Timed Counting and Summation 285

Counting in Bolts 286

Counting with Trident 288

Counting in Samza 289

Multi-Resolution Time-Series Aggregation 290

Quantization Framework 290

Stochastic Optimization 296

Delivering Time-Series Data 297

Strip Charts with D3.js 298

High-Speed Canvas Charts 299

Horizon Charts 301

Conclusion 303

Chapter 9 Statistical Approximation of Streaming Data 305

Numerical Libraries 306

Probabilities and Distributions 307

Expectation and Variance 309

Statistical Distributions 310

Discrete Distributions 310

Continuous Distributions 312

Joint Distributions 315

Working with Distributions 316

Inferring Parameters 316

The Delta Method 317

Distribution Inequalities 319

Random Number Generation 319

Generating Specific Distributions 321

Sampling Procedures 324

Sampling from a Fixed Population 325

Sampling from a Streaming Population 326

Biased Streaming Sampling 327

Conclusion 329

Chapter 10 Approximating Streaming Data with Sketching 331

Registers and Hash Functions 332

Registers 332

Hash Functions 332

Working with Sets 336

The Bloom Filter 338

The Algorithm 338

Choosing a Filter Size 340

Unions and Intersections 341

Cardinality Estimation 342

Interesting Variations 344

Distinct Value Sketches 347

The Min-Count Algorithm 348

The HyperLogLog Algorithm 351

The Count-Min Sketch 356

Point Queries 356

Count-Min Sketch Implementation 357

Top-K and “Heavy Hitters” 358

Range and Quantile Queries 360

Other Applications 364

Conclusion 364

Chapter 11 Beyond Aggregation 367

Models for Real-Time Data 368

Simple Time-Series Models 369

Linear Models 373

Logistic Regression 378

Neural Network Models 380

Forecasting with Models 389

Exponential Smoothing Methods 390

Regression Methods 393

Neural Network Methods 394

Monitoring 396

Outlier Detection 397

Change Detection 399

Real-Time Optimization 400

Conclusion 402

Index 403

loading