Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged streaming platform.
As a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. Scala has been created by Martin Odersky and he released the first version in 2003. Scala smoothly integrates the features of object-oriented and functional languages.
Apache Spark is written in Scala and because of its scalability on JVM, Scala programming is most prominently used programming language, by big data developers for working on Spark projects.
The name Scala is a portmanteau of scalable and language, signifying that it is designed to grow with the demands of its users.
Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that compiled Java code can run on all platforms that support Java without the need for recompilation. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of computer architecture.
KNIME Analytics Platform is written in Java and based on Eclipse and makes use of its extension mechanism to add Java-written plugins providing additional functionality.
TensorFlow is an open source software library for numerical computation using data-flow graphs. It was originally developed by the Google Brain Team within Google's Machine Intelligence research organization for machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well. It reached version 1.0 in February 2017, and has continued rapid development, with 21,000+ commits thus far, many from outside contributors.
Apache Spark is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009 and has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations.
Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations.
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Apache Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Evolved from work at Google, Amazon and Facebook, Apache Cassandra is used by leading companies such as Disney, IBM, New York Times, Spotify and Twitter.
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.
One of the most important features of Python is its rich set of utilities and libraries for data processing and analytics tasks (e.g. Scikit-Learn). In the current era of big data, Python is getting more popularity due to its easy-to-use features which supports big data processing.
R is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms. R is freely available under the GNU General Public License and is a free implementation of the S programming language, which was originally created and distributed by Bell Labs. R performs a wide variety of basic to advanced statistical and graphical techniques at little to no cost to the user.
While R has a command line interface, there are several graphical front-ends available like KNIME or RStudio.
SAS is a software suite developed by SAS Institute for advanced analytics, multivariate analyses, business intelligence, data management, and predictive analytics.
With SAS/BASE and SAS/MACRO SAS takes an extensive programming approach to data transformation and analysis rather than a pure drag drop and connect approach. SAS has a very large number of components customized for specific industries (e.g. Customer Intelligence Studio) and data analysis tasks (e.g. SAS Enterprise Miner).
Amazon Simple Storage Service is storage for the Internet and is designed to make web-scale computing easier for developers. Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.
Amazon S3 can be employed to store any type of object which allows for uses like storage for Internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.
Beats is the platform for single-purpose data shippers. They send data from hundreds or thousands of machines and systems to Logstash or Elasticsearch. Beats are great for gathering data. They sit on your servers, with your containers, or deploy as functions — and then centralize data in Elasticsearch. And if you want more processing muscle, Beats can also ship to Logstash for transformation and parsing. Beats gather the logs and metrics from your unique environments and document them with essential metadata from hosts, container platforms like Docker and Kubernetes, and cloud providers before shipping them to the Elastic Stack.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable means for managing pools of big data and supporting related big data analytics applications.
Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.
By doing so, thanks to the container, the developer can rest assured that the application will run on any other machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code.
Elasticsearch is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic). Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization. Commonly referred to as the ELK Stack (after Elasticsearch, Logstash, and Kibana), the Elastic Stack now includes a rich collection of lightweight shipping agents known as Beats for sending data to Elasticsearch.
The speed and scalability of Elasticsearch and its ability to index many types of content mean that it can be used for a number of use cases such as Application search, Website search, Enterprise search, Logging and log analytics, Infrastructure metrics and container monitoring, Application performance monitoring, Geospatial data analysis and visualization, Security analytics and Business analytics.
Grafana is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application analytics but many use it in other domains including industrial sensors, home automation, weather, and process control.
Grafana connects with every possible data source, commonly referred to as databases such as Graphite, Prometheus, Influx DB, Elasticsearch, MySQL, PostgreSQL etc.
Grafana being an open source solution also enables you to write plugins from scratch for integration with several different data sources.
Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack so you can do anything from tracking query load to understanding the way requests flow through your apps.
Kibana gives you the freedom to select the way you give shape to your data. And you don’t always have to know what you’re looking for. With its interactive visualizations, start with one question and see where it leads you. Kibana core ships with the classics: histograms, line graphs, pie charts, sunbursts, and more. And, of course, you can search across all of your documents. Furthermore Kibana supports visualization for the following topics: Location analysis, Time series, Machine learning, Graphs and networks.
Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite "stash." Logstash dynamically ingests, transforms, and ships your data regardless of format or complexity. Derive structure from unstructured data with grok, decipher geo coordinates from IP addresses, anonymize or exclude sensitive fields, and ease overall processing.
As data travels from source to store, Logstash filters parse each event, identify named fields to build structure, and transform them to converge on a common format for more powerful analysis and business value. Logstash has a pluggable framework featuring over 200 plugins. Mix, match, and orchestrate different inputs, filters, and outputs to work in pipeline harmony.
Kubernetes is an open-source container orchestration platform. It was designed and created by Google and is now maintained by the Cloud Native Computing Foundation. Container orchestration means that Kubernetes takes care of the deployment, scaling and management of containerized applications.
By using declarative, infrastructure-agnostic constructs to describe how applications are composed, how they interact, and how they are managed, Kubernetes enables an order-of-magnitude increase in operability of modern software systems.
Rancher is an open source software platform that enables organizations to run containers in production. With Rancher, organizations no longer have to build a container services platform from scratch using a distinct set of open source technologies. Rancher supplies the entire software stack needed to manage containers in production.
Rancher takes in raw computing resources from any public or private cloud in the form of Linux hosts. Each Linux host can be a virtual machine or physical machine. Rancher does not expect more from each host than CPU, memory, local disk storage, and network connectivity. From Rancher’s perspective, a VM instance from a cloud provider and a bare metal server hosted at a solo facility are indistinguishable.
Julia is a free open source, high-level, high-performance, dynamic programming language. While it is a general purpose language and can be used to write any application, many of its features are well-suited for high-performance numerical analysis and computational science.
Distinctive aspects of Julia's design include a type system with parametric polymorphism, a fully dynamic programming language, and multiple dispatch as its core programming paradigm. It allows concurrent, parallel (with or without the MPI package and/or the built-in corresponding to "OpenMP-style" threads) and distributed computing, and direct calling of C and Fortran libraries without glue code. A just-in-time compiler that is referred to as "just-ahead-of-time" in the Julia community is used.
KNIME Analytics Platform is the leading open solution for data-driven innovation, helping you discover the potential hidden in your data, mine for fresh insights, or predict new futures. The enterprise-grade, open source platform is fast to deploy, easy to scale, and intuitive to learn.
With more than 1500 modules, hundreds of ready-to-run examples, a comprehensive range of integrated tools, and the widest choice of advanced algorithms available, KNIME Analytics Platform is the perfect toolbox for any data scientist.