Emerging Technologies for the Enterprise

Druid: Real-Time Queries Meet Real-Time Data

Eric Tschetter - Lead Architect, Druid

Wed - 10:15-11:15 AM, Salon E

Infrastructure

This talk will focus on the design considerations and architecture of Druid, an open-source, distributed, column-oriented analytical data store. Druid is an open source distributed system in use at Metamarkets (http://www.metamarkets.com) to facilitate rapid exploration of high dimensional spaces. We use Druid to expose impression monetization data to ad tech companies along any arbitrary combination of demographic, content and sales-based dimensions. One Druid cluster currently exposes a data set of >40 billion rows of data representing >2 trillion impressions in hypercubes of varying dimensionality (largest is 30+ dimensions) while allowing for exploration using top lists and timeseries in sub-second latencies. There will be a particular focus on how Druid can be used to ingest data in real-time on the write side and provide real-time access to data on the read side.

The Druid code can be found at http://www.github.com/metamx/druid.

presentation

Stream Processing: Philosophy, Concepts, and Technologies

Dan Frank - Software Engineer, bit.ly

Wed - 04:00-05:00 PM, Salon E

Infrastructure

Stream processing has emerged in recent years as a very fast-growing paradigm in data science infrastructure. This rise can be partly attributed to some factors external to system design, such as business demands for near-realtime data or inability of hardware to manage an ever-growing data set. However, this paradigm also possesses many inherent strengths, and there is good reason for it to be embraced, not simply tolerated. In this talk I’ll discuss some high level advantages of processing data in streams, such as fault tolerance, horizontal scalability, and composability. I’ll then introduce NSQ, Bitly’s open source queueing system, and discuss how it provides us with these advantages and how it approaches the tradeoffs inherent in designing distributed systems. I’ll also discuss some of the burdens that NSQ places on developers, such as idempotent operations, and why they are necessary. Finally, I’ll discuss some new technologies that aim to abstract away the mechanism of communcation between streaming programs, and talk about the powerful opportunities and risks that they offer.

presentation video

The Future of the JVM

Jamie Allen - authoring, Effective Akka
Cliff Click - CTO & Founder, 0xdata
Charlie Hunt - Author, Java Performance
Doug Lea - Governing Board, OpenJDK
Michael Pilquist - Lead Software Architect, CCAD

Wed - 11:30-12:30 PM, Salon C

Infrastructure

In today’s production environments, tremendous amounts of work can be performed on servers running the JVM with dozens of cores, yet in just a few years we could have machines that have thousands of cores. Parallelizing work in such a “manycore” environment is a hot topic, as is managing concurrency with so many possible threads executing at the same time. Will deterministic results be impossible in such a world? Will the JVM evolve to have more hardware affinity, providing developers with tools to create applications with more specific performance profiles? Join us as we talk with experts Cliff Click, Charlie Hunt, Doug Lea and Mike Pilquist about the challenges facing developers using the JVM on tomorrow’s computing platforms, as well as discuss the future of the JVM itself.

presentation video

The Fundamentals of JVM Tuning

Charlie Hunt - Author, Java Performance

Tue - 11:30-12:30 PM, Salon C

Infrastructure

When you are faced with the challenge of tuning JVM, you can find a wide variety information. Yet, almost always the information is rather specific in the type of tuning, or specific to a type of problem. Seldom can you find information that tells abstracts the details into a higher level and simplifies it into a set of fundamentals and principles. This is what you can expect to hear and learn in this session.
The first important thing to do is to understand your application requirements when it comes to the performance metrics of throughput, latency and footprint. From there you can formulate a strategy including choosing an appropriate GC. From there it’s a matter of understanding some fundamentals about what impacts GC behavior and what you can do about it. In addition, you will also learn what a Java developer should understand when it comes to JIT compilation and what he or she can do about it.

presentation video

Building a Terabyte-scale Math Platform

Cliff Click - CTO & Founder, 0xdata

Tue - 04:00-05:00 PM, Salon C

Infrastructure

Datasets have gotten to PB-scale, but the modeling you can do has been limited to a single-node (e.g. R, SAS) or stuck inside the database or takes hours on Hadoop-like technologies. We have built a simple clustering package, and are using it to do distributed analytics on the sum of all ram in a cluster. This talk focuses on how the clustering technology, plus a Java-based vector math API, is being used to build full algorithms like GLM/GLMNET, Random Forest and K-means. These algorithms are complex multi-pass programs and traditional distributed programming models expose the distributed boundaries making the algorithms hard to reason about. We have a basic JDK for doing at-scale math, we can run most Plain Olde Java in (distributed) inner loops, communicate via a K/V store with exact Java Memory Model consistency (not lazy consistency). Adding more cpus makes these algorithms run faster, and adding more ram allows larger datasets. We are bringing back Moore’s Law!

The Database as a Value

Rich Hickey - Author of Clojure, Designer of Datomic

Tue - 02:45-03:45 PM, Salon C

Infrastructure

Proponents of functional programming tout its many benefits, most of which are available only within a particular process, or afforded by a particular programming language feature. Anything outside of that is considered I/O, dangerous and difficult to reason about. But real systems almost always cross process and language boundaries, and most require, crucially, a very gnarly bit of shared state – a database. In this talk we will examine how Datomic renders the database into that most prized and easy-to-reason-about construct, a value, and makes it available to multiple processes in multiple languages, functional and not.

Along the way, we’ll discuss the importance of immutability and time in representing information, the reification of process, and the mechanisms of durable persistent data structures. No knowledge of functional programming is required.

Local Lightning Infra Talks

Mat Schaffer - co-founder, Mashion

Wed - 04:00-05:00 PM, Salon C

Infrastructure

Local Lightning talks

This session includes 5 rapid-fire talks showcasing some of the local applications of technology and what they’ve accomplished for their businesses. Philadelphia area local speakers include:

Brian Flad – Messing with Sasquatch: Chef in the Woods
Chris Alfano – The Emergence Platform
Nate Bomberger – Enterprise Logistics using CloudMine
Angel Pizarro – On-demand high-throughput compute clusters with StarCluster
Brian O’Neill – The Big Data Quadfecta: Combining Cassandra, Storm, Kafka, and Elastic Search

Emerging Technologies
for the Enterprise 2013

Sessions