Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 2)

Confluent

This is the second part of our guide on streaming data and Apache Kafka. In part one I talked about the uses for real-time data streams and explained our idea of a stream data platform. The remainder of this guide will contain specific advice on how to go about building a stream data platform in your organization.

This advice is drawn from our experience building and implementing Kafka at LinkedIn and rolling it out across all the data types and systems there. It also comes from four years working with tech companies in Silicon Valley to build Kafka-based stream data platforms in their organizations.

This is meant to be a living document. As we learn new techniques, or new tools become available, I’ll update it.

Getting Started

Much of the advice in this guide covers techniques that will scale to hundreds or thousands of well formed data streams. No one starts with…

Xem bài viết gốc 6 089 từ nữa

Đăng tải tại Nhật ký | Bình luận về bài viết này

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform (Part 1)

Confluent

These days you hear a lot about “stream processing”, “event data”, and “real-time”, often related to technologies like Kafka, Storm, Samza, or Spark’s Streaming module. Though there is a lot of excitement, not everyone knows how to fit these technologies into their technology stack or how to put it to use in practical applications.

This guide is going to discuss our experience with real-time data streams: how to build a home for real-time data within your company, and how to build applications that make use of that data. All of this is based on real experience: we spent the last five years building Apache Kafka, transitioning LinkedIn to a fully stream-based architecture, and helping a number of Silicon Valley tech companies do the same thing.

The first part of the guide will give a high-level overview of what we came to call a “stream data platform”: a…

Xem bài viết gốc 4 180 từ nữa

Đăng tải tại Nhật ký | Bình luận về bài viết này

Play with Spark: Building Spark MLLib in a Play Spark Application

Knoldus Blogs

In our last post of Play with Spark! series, we saw how to integrate Spark SQL in a Play Scala application. Now in this blog we will see how to add Spark MLLib feature in a Play Scala application.

Spark MLLib is a new component under active development. It was first released with Spark 0.8.0. It contains some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as some optimization primitives. For detailed list of available algorithms click here.

To add Spark MLLib feature in a Play Scala application follow these steps:

1). Add following dependencies in build.sbt file

The dependency – “org.apache.spark”  %% “spark-mllib” % “1.0.1” is specific to Spark MLLib.

As you can see that we have upgraded to Spark 1.0.1 (latest release of Apache Spark).

2). Create a file app/utils/SparkMLLibUtility.scala & add following code to it

In above…

Xem bài viết gốc 168 từ nữa

Đăng tải tại Nhật ký | Bình luận về bài viết này

Download the New Impala e-Book from O’Reilly Media

See on Scoop.itpdg-technologies.com

Cloudera offers enterprises a powerful new data platform built on the popular Apache Hadoop open-source software package.

See on blog.cloudera.com

Đăng tải tại Nhật ký | Bình luận về bài viết này

rebound

See on Scoop.itPDG Mobile Tech

Rebound is a java library that models spring dynamics. Rebound spring models can be used to create animations that feel natural by introducing real world physics to your application.

Rebound is not a general purpose physics library; however, spring dynamics can be used to drive a wide variety of animations. The simplicity of Rebound makes it easy to integrate and use as a building block for creating more complex components like pagers, toggles, and scrollers.

See on facebook.github.io

Đăng tải tại Nhật ký | Bình luận về bài viết này

In-Stream Big Data Processing

Highly Scalable Blog

The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that real-time query processing and in-stream processing is the immediate need in many practical applications. In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, Apache Spark, and Apache Tez appeared and joined the army of Big Data and NoSQL systems. This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing…

Xem bài viết gốc 5 219 từ nữa

Đăng tải tại Nhật ký | Bình luận về bài viết này

Log Analysis System Using Hadoop and MongoDB | CUBRID Blog

See on Scoop.itpdg-technologies.com

See on www.cubrid.org

Đăng tải tại Nhật ký | Bình luận về bài viết này

Thoughts on NoSQL & Big Data Architecture

Adam's Deep Technology Blog

I recently web a webpage forwarded to me by someone at work. It’s a very complex diagram of a ‘typical’ Big Data architecture. It also contains a couple of NoSQL databases. I decided to do a critique of it from a pure NoSQL standpoint. The diagram should (if we in the Computer industry are doing our jobs right) be able to be simplified if we use the correct products and approaches. I’ll detail my thoughts in this article…

Xem bài viết gốc 4 143 từ nữa

Đăng tải tại Nhật ký | Bình luận về bài viết này

FalconProposal – Incubator Wiki

See on Scoop.itpdg-technologies.com

Kun Le‘s insight:

Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop

See on wiki.apache.org

Đăng tải tại Nhật ký | Bình luận về bài viết này

Đăng tải tại Nhật ký | Bình luận về bài viết này