In the context of a microservice architecture, a message driven, asynchronous, event based design seems to be gaining popularity (see here and here for some examples, as well as the Reactive Manifesto - Message Driven trait) as opposed to a synchronous (possibly REST based) mechanism.
Taking that context and imagining an overly simplified ordering system, as depicted below:
and the following message flow:
- Order is placed from some source (web/mobile etc.)
- Order service accepts order and publishes a
CreateOrderEvent - The InventoryService reacts on the
CreateOrderEvent, does some inventory stuff and publishes aInventoryUpdatedEventwhen it's done - The Invoice service then reacts to the
InventoryUpdatedEvent, sends an invoice and publishes aEmailInvoiceEvent
All services are up and we happily process orders... Everyone is happy. Then, the Inventory service goes down for some reason
Assuming that the events on the event bus are flowing in a "non blocking" manor. I.e. the messages are being published to a central topic and do not pile up on a queue if no service is reading from it (what I'm trying to convey is an event bus where, if the event is published on the bus, it would flow "straight through" and not queue up - ignore what messaging platform/technology is used at this point). That would mean that if the Inventory service were down for 5 minutes, the CreateOrderEvent's passing through the event bus during that time are now "gone" or not seen by the Inventory service because in our overly simplified system, no other system is interested in those events.
My question then is: How does the Inventory service (and the system as a whole) restore state in a way that no orders are missed/not processed?

I am going to give an architects answer rather than drill down into details. I hope you don't mind.
The first suggestion is decouple all of the concepts: events, messages, bus and/or queue and asynch. This opens up possibilities, provided you have not already decided on the software you are using to implement your bus.
From an architecture standpoint, if you require a "must deliver" type of scenario, you will persist the messages when the service fails. Yes, you will likely need some sort of clean up in the system, as stuff happens, but focus on the guaranteed delivery problem first. I see two basic options off hand which can be expanded on (there are likely more, but these are sufficient to start thinking about the problem).
Just because the system is asynch and event based does not mean you cannot implement some type of guaranteed delivery. A queue is an option (you seem to discard this idea?), but a bus that persists on failure and retries once subscribers are up again is another. And you can persist without blocking.
One other issue is what tokens the messages use to get them synched back to the business function at hand, but I assume you have this handled in the system somehow. The only concept you may not have is having systems all respect the token and respect the other systems in returning messages in cases of failures.
Note that asynchronous communication, from the business standpoint, does not mean fire and forget at the point of contact. You can return messages back without using the asynch method on every single bit of information. What I mean here is the inventory system starting up may process a message and send to the application on the UI end and it can return "forget about it, you were too slow" so the transaction is returned to its original state (nonexistent?).
I don't have enough information (or time?) to suggest which method is best for your architecture, as the details are still a bit too high level, but hopefully this stirs some thought.
Hope this makes sense, as I basically did a brain to keyboard maneuver in my ADHD state. ;-)