Blog Post / Insight

The Complete IoT Architecture Audit Checklist (2025 Edition)

I created a structured IoT Architecture Audit Checklist. It captures the core principles of reliability, security, and observability, and provides a consistent process for evaluating device -> gateway -> cloud pipelines.

I created a structured IoT Architecture Audit Checklist. It captures the core principles of reliability, security, and observability, and provides a consistent process for evaluating device -> gateway -> cloud pipelines.

IoT systems rarely fails because because of one big issue. More often, they fail because of dozens of small reliability, security, and observability gaps that accumulate over time. These blind spots only become visible once devices are deployed in the real world - when outages, missing telemetry, or performance bottlenecks start affecting customers.

Over the years, working with IoT architectures across telecom data pipelines, secure communications systems, and IoT gateways, I’ve seen the same failure patterns repeat again and again. A minor typo in script, a fragile retry strategy, a missing metric, or an incorrect gateway setting can cascade into major incidents; broken dashboards, SLA violations, regulatory exposure, or long nights on call fixing issues under pressure.

To avoid repeating these mistakes - both for myself and the teams I work with - I created a structured IoT Architecture Audit Checklist. It captures the core principles of reliability, security, and observability, and provides a consistent process for evaluating device -> gateway -> cloud pipelines. This is the same approach I use when auditing production-grade IoT systems, helping companies catch issues early, improve resilience, and prevent costly failures before they happen.

Common IoT Architecture Failures (What teams miss 90% of the time)

Below are high-risk areas that some IoT teams overlook - not because they don’t care, but because these problems hide across layers and only show up under real-world load.

1. Device connection stability & certificate storage
Checks:
• How does the device authenticate (mTLS? token, static key)?
• Where is the certificate/key stored (flash, secure element)?
• Can certificates be rotated?
• What happens on expiration?
• Is clock drift handled?
• Is reconnect logic stable under poor connectivity?
• Is there exponential backoff?
• Can devices overwhelm the broker during network instability?

2. Broker topic misuse (wildcards, shared topics, poor partitioning)
Checks:
• Are topic patterns too broad?
• Does devices publish to shared or global topics?
• Are unsubscribe/wildcard patterns safe?
• Does the gateway subscribed to excessive hierarchical paths?
• Are ACLs applying least privilege?

3. Gateway persistence durability (SQLite/WAL/fsync)
Checks:
• Is SQLite in WAL mode?
• Is fsync enabled or disabled?
• Are writes batched or committed individually?
• What happens on power loss?
• Does the gateway corrupt or rotate WAL incorrectly?
• Is the queue depth monitored?

4. Retry storms & backpressure issues
Checks:
• Does device -> broker retry use jitter?
• Does gateway -> cloud retry use exponential backoff?
• Is there a retry cap?
• Does gateway drop or queue messages when full?
• Is there a fallback / degraded mode?

5. Cloud ingestion deduplication
Checks:
• Is ingestion idempotent?
• Does the gateway generate UUIDs?
• Does the cloud store duplicates?
• Are messages hashed?
• How does cloud return success/failure?

6. Metrics missing -> blind operation
Checks:
• Queue depth visibility?
• Upload latency visibility?
• Retry counters?
• Reject counters?
• Device battery stats?
• Reconnection counts?

7. Secrets management
Checks:
• config.yaml or .env committed to repo?
• credentials stored in plaintext?
• Cloud API keys exposed?
• Certificates stored unencrypted?
• Private keys readable on the gateway?

8. Firmware update security
Checks:
• Do devices verify firmware signatures?
• Does the gateway manage OTA or does the cloud?
• Is rollback supported?
• Are OTA servers authenticated?
• Is there firmware version pinning?

The Full Checklist (Broken Down by Layers)

Device Layer

Checks:
• Protocols (MQTT, CoAP, custom)?
• QoS levels correct?
• Are device IDs deterministic and stable?
• Payload schemas versioned?
• Certificates provisioned securely?
• Memory pressure on sensor loops?
• Retry-logic under poor connectivity?
• Hard-coded secrets anywhere?

Broker Layer

Checks
• Topic hierarchy (least privilege)?
• Shared topics usage?
• Wildcards used safely?
• Gateway subscribing too broadly?
• ACL enforcement correct?
• Retained messages misused?
• Connection logs available?

Gateway Layer

1. Reliability / Durability
• WAL/queue durability (fsync enabled)?
• Crash-safety behavior
• Backpressure mechanisms
• Batching implementation

2. Idempotency
• Gateway-generated UUIDs?
• Device IDs trusted?
• Duplicate prevention on reconnect?

3. Retry Strategy
• Exponential backoff
• Jitter
• Retry storms avoided

4. Security
• mTLS eanbled?
• Root CA pinned?
• Certificate rotation supported?
• Secrets stored securely?

5. Observability
Scan for:
publish_failures_total?
db_write_latency_seconds?
queue_depth?
reconnect_count?
cloud_upload_latency?
• OpenTelemetry spans tagged with device_id?

Cloud Layer

Checks:
• Ingestion API (HTTP, gRPC, MQTT bridge)
• Auth strategy (IAM, Cognito, custom token)
• Deduplication model
• Hot vs cold storage
• Retention & backup policies
• Observability coverage (ingest rate, error rate, backlog)
• Failure-handling (rate limits, slow consumers)

UI / Analytics Layer

Checks:
• Dashboards show SLIs (ingest rate, latency, error rate)?
• Are graphs correlated across device -> gateway -> cloud to trace issues end-to-end?
• Is latency visible at each stage of the pipeline?
• Any error budget or performance thresholds defined and monitored?

Code-Level Review

Checks:
• MQTT client setup
• Database writer task
• Cloud Ingestion uploader
• Shutdown / cancellation handlers
• Error-handling paths
• Backpressure logic

If you want a second pair of eyes on your architecture, the IoT Infrastructure Audit gives you a full reliability, security and observability review.

Check out our IoT Audit services.

When to Perform an IoT Architecture Audit

There are many situations where an IoT Architecture Audit provides clarity and prevents costly failures. Some teams run audits periodically to maintain system health and compliance. Others performs them before key business or engineering milestones.

Common triggers:

  • Scaling from prototype to production.
  • Outrages or system instability.
  • Missing or inconsistent telemetry.
  • Migrating from test to production.
  • Major customer rollout.
  • Upcoming compliance requirements (NIS2, CRA, ISO27001).
  • Legacy codebases becoming hard to reason about.
  • When multiple vendors or teams are involved.

An audit provides a structured end-to-end view across the device -> gateway -> cloud pipeline, allowing teams to make informed decisions before issues become critical.

Why Most Teams Don’t Catch These Issues Internally

Teams working on IoT systems typically operate close to the code and the day-to-day delivery pressure. Over time, familiarity with the stack creates blind spots. Legacy code accumulates, assumptions remain undocumented, and “temporary fixes” become permanent parts of the architecture. When key people leave, undocumented decisions and forgotten constraints make the system harder to reason about.

On top of this, IoT systems span multiple layers — devices, firmware, MQTT brokers, gateways, cloud ingestion, storage, and observability. No single engineer owns the entire pipeline, so problems that cross boundaries often go unnoticed. Issues that look small in isolation can cascade into outages or data gaps when the system is under real-world load.

Bringing in an external, fresh set of eyes helps teams see the architecture from a different angle. An independent audit surfaces risks and reveals hidden dependencies that internal teams simply don’t have the bandwidth or distance to spot.

Case Study - Audit on Secure Edge IoT Gateway

In a recent audit of a secure edge IoT gateway built in Rust, there was identified high-impact issues including missing certificate rotation, incomplete MQTT ACL enforcement, and limited observability due to missing tracing. These issues weren’t visible from tests but were affecting production reliability.

Read the full case study -> https://combotto.io/references/secure-edge-iot-gateway-audit

If you don’t want to spend weeks doing this checklist manually — or you want a second opinion — I can perform the complete audit for you.

Book a free discovery call here.

20-minutes discovery call. No obligation.

Have a question about this blog post?

If you’re considering an IoT Infrastructure Audit or reliability sprint, send a short message about your devices and current setup. You’ll get a same-day reply with clear next steps.

Typical response: same business day.