Engineering Robust Hardware for Last-Mile Deployment

By
Nithya Menon
October 12, 2022
Reliability is not something that’s ever “done.” But now we have systems in place to monitor, test, and improve as we deploy hardware in the most remote areas of the world.

Highlights

Robust Hardware for Last-Mile Deployment

When your technology is deployed in some of the hardest-to-reach parts of the world and provides an essential service to extremely resource-constrained people, product reliability is fundamental to the mission. Here’s a deep dive into our journey to deliver on this mission.

Samraong village

A bamboo house connected with an Okra Pod, in Samraong Village Cambodia

The Edamame Pod is built on numerous lessons from our first device (known as Pineapple), most notably the rising demand from households and energy companies for higher power systems to power productive appliances. We worked as quickly as possible to get an MVP into the field and three substantial generations of the Pod later, we hit a 1% annual failure rate. But getting here was no simple task.

V1 – High Power Failures

When the V1 Pod got into the field, we ran into a series of failures associated with our 5V regulator failing on 24V batteries. We mitigated this by limiting deployments to 12V systems, but this was disappointing.

24V systems leverage the Pod’s full 1.2 kW power output, and we had customers eager to experiment with higher power systems. The Pod hit a ~27% annual failure rate, and we raced to finalise the V2 Pod. Learning from the issues in V1, the hardware and QA team levelled up their stress testing and addressed all the vulnerabilities from V1 at sustained high power usage.

V2 – Stressful Transients

Unfortunately, despite Edamame V2 passing high power tests in the lab, high power appliances are usually paired with inverters, and the addition of inverters introduced new transient stress conditions. Our grid and panel switch mode power supplies started failing at even higher numbers than the V1. At its worst, this pod hit a 63% annual failure rate. As an early-stage startup, we were determined to absorb this pain from customers by providing spares (free of charge) and avoiding the types of installation conditions that seemed most failure-prone. And internally, we embarked on yet another debugging mission.

The hunt to find the root cause of these V2 failures was painstaking and ultimately revealed that two lines of copper on the circuit board were a few millimeters too close together, causing transient feedback loops to spin out of control.

A lesson for all of us in hardware development: no change is too small to cause catastrophic effects.

In parallel, our firmware team hunted for ways to minimise the frequency of this failure mode. Eventually they managed to remotely push software updates to safely shutdown the susceptible hardware in less than 200us in the event of one of these failure transients, preventing the dangerous, uncontrolled feedback loops from damaging the hardware. These updates brought the V2 failure rates down to < 5% per year.

V3 – Testing, Automation, and Stability

Thankfully, not only did the process of hunting down the instability in V2 result in a robust and reliable V3 product, but it also led to serious advancements in our testing infrastructure.

We built automated, accelerated life-cycle test machines that continuously trigger all known failure modes in a controlled environment. With this device, we vetted the V3 improvements even before going to production. Where V2s failed after 20-30 repeated transient events, the V3 design withstood 12,000+ events over the course of 40 hours without any problems, and we are yet to see any of these failures at all in the field. For context, these transient events occur at most a couple times a day in field conditions, so this level of stress testing gave us a lot of confidence. V3 also went through thermal testing, including creative thermal chambers built with toaster ovens at home during Covid.

Samraong village

Our hardware testing journey.

From >60% failure rates to 1% was a massive effort from all our engineers.

On the software side, we built features that give us real-time failure data and field removals, all visible in a reliability dashboard that numerous teams track closely. This dashboard helped us identify trends in failures across different hardware versions and calculates our failure rates on an ongoing basis.

Our reliability dashboard with key metrics and alerts to keep everyone vigilent.

On the quality assurance/control side, we built remotely-controlled staging systems that automatically cycle loads and thoroughly stress systems all day and night.

This staging village helps us stress test pods, appliances, batteries, and all other equipment we sell, while also providing an environment to monitor new features or experimental ideas.

The setup includes 8 interconnected Pods, including all hardware iterations supported in the field, almost 5 (five) kWp of solar installed on the roof, and a complex array of relays that cycle vacuums, hair dryers, heaters, blenders, fans, and LEDs between the pods.

Samraong village

The Testing Village

Samraong village

In addition to our staging infrastructure, one of our most crucial test-engineering projects is our automated test rig which is equipped to test every component in our hardware and key algorithm in our firmware at full power ranges.

This rig is leveraged both during manufacturing (two of these rigs live on the manufacturing floor for end-of-line testing) and for our team to automate a comprehensive set of regression tests, giving us confidence in each firmware release and update to the product design.

And ultimately we rely heavily on all this infrastructure and knowledge to continuously improve our existing products as well as all new prototypes and products.

Reliability is not something that’s ever “done.” We’re proud of the progress so far, but more importantly, we’re equipped to keep improving. Our short-term goal is < 1% annual failure rate, but as we scale to hundreds of thousands of households, our target will get even lower. Okra’s engineering mission lies at the intersection of positive social impact and cutting-edge technology, and our engineering teams treat reliability, bug-crushing, and low failures as core values. Delivering quality products to households and families depending on our technology for energy access is not something we take lightly.

If you’re in the same boat and want to collaborate on any strategies to improve reliability – feel free to reach out! And if you’re the next bug-squasher who wants to join this journey with us, keep an eye out on the careers page 🙂

Nithya Menon is an engineering graduate from Harvey Mudd College who has spent her career developing technology targeted towards empowering marginalized and developing communities worldwide. She has been pivotal in designing Okra's key power-sharing algorithms, IoT firmware, and grid management software - and now drives the direction and strategy of Okra's technology as Product Development Lead.

#PowerToThePeople