Outbound email delays have been a problem for a while, but have been manually mitigated until today. We've determined that the actual fix is going to be further down the line, and so we should automate the mitigation to ensure consistency. The mitigation has been automated by running this every 15 minutes on our servers: https://github.com/mxroute/da_server_updates/blob/master/exim/fixqueue.sh
It's not the fix we want, but there are pros and cons to every solution to this problem in our stack, and this will be the least evil automation for it at the moment. While the delays impact less than 0.5% of all email that our users send, we want to make sure that these emails are not disregarded as statistically irrelevant. This will minimize the impact to such a degree that it is unlikely any user will ever visually recognize the result of the already rare delay.
This solution also adds logging so that we can further, and somewhat more reliably, audit the real impact of these delays. This will help to further prioritize a more permanent fix.