Client Inquiry:
We’re working on a new centralized data platform so that we can perform effective analytics with a quicker turnaround. We’d like to verify some aspects of a new architecture that we’re considering. Can you provide feedback?
 Expert Takeaways:
IoT Streaming Data
For one of your use cases, your proposed architecture shows that you will be streaming Internet of Things (IoT) data from machinery on your factory floor for the purpose of predictive maintenance, and that you will be using Spark along with Kafka.
- There’s another technology called Apache Flink, which has better streaming capability than Spark, especially for stateful streaming.
- Stateful streaming means that the “state” is shared between events, and, therefore, past events can influence the way current events are processed.
- You can integrate Flink with Kafka for stateful streaming, which improves processing speeds because the system can collapse thousands of records into a few records and generate alerts more quickly.
- Spark’s driver is a single point of failure. If the driver dies, processing dies as well. This isn’t optimal from an operationalization perspective. It’s better to maintain the process used by Kafka, in which if a process dies the partition gets moved off.
Optimally, use open-source tools, such as Apache Kafka, for IoT ingestion. Apache Pulsar is also popular.
AWS Kinesis and Apache Kafka
Your proposal shows both AWS Kinesis and Apache Kafka as real-time streaming processors, depending on the use case. These tools both act as middlemen between data-streaming sources and their intended data consumers.
- Generally, you would use one or the other, but not both.
- Both Kinesis and Kafka will cost you money, but Kafka has an overall better technology, especially now that it is a supported service on AWS.
- Check into the historical reason why you need Kinesis. There’s no operational reason to use Kinesis if you have the choice to use Kafka.
Sandbox Zone
Your proposal shows a sandbox zone for advanced analytics experimentation.
- Elasticsearch provide fast discovery for data scientists. Another candidate tool for the sandbox is Apache Druid.
- Druid had several benefits: scalability and high concurrency. If you’re worried about slow down-state concurrency, consider Druid. In addition, it integrates with Kafka and processes real-time data streaming extremely well.
- Consult the data scientists before you configure the sandbox zone. Not just for technology, but also for how to expose the data in a way that’s most useful for them.
- A common issue with sandbox zones is that they’re not populated with enough data or the right data, which renders them essentially useless.
One of the questions to ask yourself is, how fast do you want to be able to act on an IoT streaming event? Or, how many seconds of lag can you tolerate before you can curate that data? For some technologies you are looking at 30 or 60 seconds, whereas with Druid you’re looking at milliseconds. This means that when data comes into the raw layer, it will be available sooner at the curated level.
 “With the help of IIA Experts we were able to get multiple relevant perspectives on our issues quickly and efficiently without need for engaging expensive consultants.”
 Expert Network
IIA provides guided access to our network of over 150 analytics thought leaders, practitioners, executives, data scientists, data engineers with curated, facilitated 1-on-1 interactions.
- Tailored support to address YOUR specific initiatives, projects and problems
- High-touch onboarding to curate 1-on-1 access to most relevant experts
- On-demand inquiry support
- Plan validation and ongoing guidance to advance analytics priority outcomes
- Monthly roundtables facilitated by IIA experts on the latest analytics trends and developments
