Which Quality of Service (QoS) is Right for IIoT?

Quality of Service (QoS) is a general term to indicate the delivery contract from a sender to a receiver.  In some applications QoS talks about delivery time, reliability, latency or throughput.  In IIoT, QoS generally refers to the reliability of delivery.

Using MQTT as an example, there are three common Quality of Service levels for IIoT:

  • Level 0 – At most once.  Every message will be delivered on a best-effort basis, similar to UDP.  If the message is lost in transit for whatever reason, it is abandoned―the receiver never receives it, and the sender does not know that it was lost.
  • Level 1 – At least once.  Every message will be delivered to a receiver, though sometimes the same message will be delivered two or more times.  The receiver may be able to distinguish the duplicates, but perhaps not.  The sender is not aware that the receiver received multiple copies of the message.
  • Level 2 – Exactly once.  Every message will be delivered exactly once to the receiver, and the sender will be aware that it was received.

These QoS levels actually miss something important that comes up a lot in industrial systems, but let’s look at these three quickly.

First, QoS level 0 is simply unacceptable.  It is fine to lose a frame of a video once in a while, but not fine to lose a control signal that safely shuts down a stamping machine.  If the sender is transmitting data more quickly than the receiver can handle it, there will come a point where in-flight messages will fill the available queue positions, and new messages will be lost.  Essentially, QoS 0 will favor old messages over new ones.  In IIoT, this is a fatal flaw.  There’s no reason to discuss QoS 0 further.

QoS level 1 seems pretty reasonable at first glance.  Message duplication is not a problem in most cases, and where there is an issue the duplicates can be identified by the receiver and eliminated, assuming the client maintains enough history to be able to identify them.

However, problems arise when the sender is transmitting data more quickly than the receiver can process it.  Since there is a delivery guarantee at QoS 1, the sender must be able to queue an infinite number of packets waiting for an opportunity to deliver them.  Longer queues mean longer latencies.  For example, if I turn a light on and off three times, and the delivery latency is 5 seconds simply due to the queue volume, then it will take 30 seconds for the receiver to see that the light has settled into its final state.  In the meantime the client will be acting on false information.  In the case of a light, this may not matter much (unless it is a visual alarm), but in industrial systems timeliness matters.  The problem becomes even more severe if the client is aggregating data from multiple sources.  If some sources are delayed by seconds or minutes relative to other, then the client will be performing logic on data that are not only inconsistent with reality but also with each other.

Ultimately, QoS 1 cannot be used where any client could produce data faster than the slowest leg of the communication path can handle.  Beyond a certain data rate, the system will effectively “fall off a cliff” and become unusable.  I’ve personally seen this exact thing happen in a municipal waste treatment facility.  It wasn’t pretty.  The solution was to completely replace the communication mechanism.

QoS level 2 is similar to QoS 1, but more severe.  QoS 2 is designed for transactional systems, where every message matters, and duplication is equivalent to failure.  For example, a system that manages invoices and payments would not want to record a payment twice or emit multiple invoices for a single sale.  In that case, latency matters far less than guaranteed unique delivery.

Since QoS level 2 requires more communication to provide its guarantee, it requires more time to deliver each message.  It will exhibit the same problems under load as QoS level 1, but at a lower data rate.  That is, the maximum sustained data rate for QoS 2 will be lower than for QoS 1.  The “cliff” just happens sooner.

QoS Levels 1 and 2 Don’t Propagate

Both QoS level 1 and level 2 suffer from another big flaw – they don’t propagate.  Consider a trivial system where two clients, A and B, are connected to a single broker.  The goal is to ensure that B receives every message that A transmits, meaning that QoS 1 or 2 should apply between A and B.  Looking at QoS 1, A would send a message and wait for a delivery confirmation.  The broker would need to transmit the message to B before sending the confirmation to A.  That would imply that the broker knows that A needs to wait for B to respond.  Two problems arise: first, A cannot know that B is even connected to the broker.  That is a fundamental property of a one-to-many broker like MQTT.  Second, the broker cannot know that the intention of the system is to provide reliable communication between A and B.  Even if the broker were somehow programmed to wait like that, how would it deal with a third client, C, also listening for that message.  Would it wait for delivery on all clients?  What would it do about clients that are temporarily disconnected?  The answer is that it cannot.  If the intention of the system is to offer QoS 1 or 2 among clients then that QoS promise cannot be kept.

Some brokers have a server-to-server, or daisy-chain, mechanism that allows brokers to transfer messages to each other.  This allows clients on different brokers to intercommunicate.  In this configuration the QoS promise cannot be maintained beyond the connection between the original sender and the first broker in the chain.

Guaranteed Consistency

None of these QoS levels is really right for IIoT.  We need something else, and that is guaranteed consistency.  In a typical industrial system there are analog data points that move continuously, like flows, temperatures and levels.  A client application would like to see as much detail as it can, but most critical is the current value of these points.  If it misses a value that is already superseded by a new measurement, that is not generally a problem.  However, the client cannot accept missing the most recent value for a point.  For example, if I flick a light on and off 3 times, the client does not need to know how many times I did it, but it absolutely must know that the switch ended in the off position.  The communication path needs to guarantee that the final “off” message gets through, even if some intermediate states are lost.  This is the critical insight in IIoT.  The client is mainly interested in the current state of the system, not in every transient state that led up to it.

Guaranteed consistency for QoS is actually slightly more complex than that.  There are really three critical aspects that are too often ignored:

  1. Message queues must be managed for each data point and client. When communication is slow, old messages must be dropped from the queue in favor of new messages to avoid ever-lengthening latencies.  This queuing must occur on a per-point, per-client basis.  Only messages that are superseded for a specific point destined for a specific client can be dropped.  If we drop messages blindly then we risk dropping the last message in a sequence, as in the final switch status above.
  2. Event order must be preserved.  When a new value for a point enters the queue, it goes to the back of the queue even if it supersedes a message near the front of the queue.  If we don’t do this, the client could see the light turn on before the switch is thrown.  Ultimately the client gets a consistent view of the data, but for a short time it may have been inconsistent.
  3. The client must be notified when a value is no longer current.  For the client to trust its data, it must know when data consistency is no longer being maintained.  If a data source is disconnected for any reason, its data will no longer be updated in the client.  The physical world will move on, and the client will not be informed.  Although the data delivery mechanism cannot stop hardware from breaking, it can guarantee that the client knows that something is broken.  The client must be informed, on a per-point basis, whether the point is currently active and current or inaccessible and thus invalid.  In the industrial world this is commonly done using data quality, a per-point indication of the trustworthiness of each data value.

For those instances where it is critical to see every change in a process (that is, where QoS 1 or 2 is required), that critical information should be handled as close as possible to the data source, whether it’s a PLC or an embedded device.  That is, time-critical and event-critical information should be processed at its source, not transmitted via the network to a remote system for processing where that transmission could introduce latency or drop intermediate values. We will discuss this more when we talk about edge processing.

For the IIoT, the beauty of guaranteed consistency for QoS is that it can respond to changes in network conditions without slowing down, backing up or invalidating the client’s view of the system state.  It has a bounded queue size and is thus suitable for resilient embedded systems.  This quality of service can propagate through any number of intermediate brokers and still maintain its guarantee, as well as notify the client when any link in the chain is broken.

So there’s the answer.  For IIoT, you definitely don’t want QoS 0, and probably cannot accept the limitations and failure modes of QoS 1 or 2.  You want something altogether different—guaranteed consistency.

 

– Andrew Thomas, Skkynet