Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 2)

Originally posted on Confluent:

This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.

This advice is drawn from our experience building and implementing Kafka at LinkedIn and rolling it out across all the data types and systems there. It also comes from four years working with tech companies in Silicon Valley to build Kafka-based stream data platforms in their organizations.

This is meant to be a living document. As we learn new techniques, or new tools become available, I’ll update it.

Getting Started

Much of the advice in this guide covers techniques that will scale to hundreds or thousands of well formed data streams. No one starts with…

View original 6 089 more words

Posted in Nhật ký | Để lại bình luận

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 1)

Originally posted on Confluent:

These days you hear a lot about “stream processing”, “event data”, and “real-time”, often related to technologies like Kafka, Storm, Samza, or Spark’s Streaming module. Though there is a lot of excitement, not everyone knows how to fit these technologies into their technology stack or how to put it to use in practical applications.

This guide is going to discuss our experience with real-time data streams: how to build a home for real-time data within your company, and how to build applications that make use of that data. All of this is based on real experience: we spent the last five years building Apache Kafka, transitioning LinkedIn to a fully stream-based architecture, and helping a number of Silicon Valley tech companies do the same thing.

The first part of the guide will give a high-level overview of what we came to call a “stream data platform”: a…

View original 4 180 more words

Posted in Nhật ký | Để lại bình luận

Play with Spark: Building Spark MLLib in a Play Spark Application

Originally posted on Knoldus:

In our last post of Play with Spark! series, we saw how to integrate Spark SQL in a Play Scala application. Now in this blog we will see how to add Spark MLLib feature in a Play Scala application.

Spark MLLib is a new component under active development. It was first released with Spark 0.8.0. It contains some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as some optimization primitives. For detailed list of available algorithms click here.

To add Spark MLLib feature in a Play Scala application follow these steps:

1). Add following dependencies in build.sbt file

The dependency – “org.apache.spark”  %% “spark-mllib” % “1.0.1” is specific to Spark MLLib.

As you can see that we have upgraded to Spark 1.0.1 (latest release of Apache Spark).

2). Create a file app/utils/SparkMLLibUtility.scala & add following code to it

In above…

View original 168 more words

Posted in Nhật ký | Để lại bình luận

Download the New Impala e-Book from O’Reilly Media

See on Scoop.itpdg-technologies.com

Cloudera offers enterprises a powerful new data platform built on the popular Apache Hadoop open-source software package.

See on blog.cloudera.com

Posted in Nhật ký | Để lại bình luận


See on Scoop.itPDG Mobile Tech

Rebound is a java library that models spring dynamics. Rebound spring models can be used to create animations that feel natural by introducing real world physics to your application.

Rebound is not a general purpose physics library; however, spring dynamics can be used to drive a wide variety of animations. The simplicity of Rebound makes it easy to integrate and use as a building block for creating more complex components like pagers, toggles, and scrollers.

See on facebook.github.io

Posted in Nhật ký | Để lại bình luận

In-Stream Big Data Processing

Originally posted on Highly Scalable Blog:

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, Apache Spark, and Apache Tez appeared and joined the army of Big Data and NoSQL systems. This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing…

View original 5 219 more words

Posted in Nhật ký | Để lại bình luận

Log Analysis System Using Hadoop and MongoDB | CUBRID Blog

See on Scoop.itpdg-technologies.com

See on www.cubrid.org

Posted in Nhật ký | Để lại bình luận