On June 8, 2020 select clients experienced connectivity issues to core PaymentEvolution services. This incident report details the nature of the outage and our response.
The following is the incident report for the connectivity issues experienced by select clients of core PaymentEvolution services that occurred on June 8, 2020. We understand this service issue has impacted our valued clients, and we apologize to everyone who was affected.
Issue Summary
On June 8, 2020 from 11:43:44 AM ET to 11:48:44 AM ET and again from 1:23:44 PM ET to 2:48:44 PM ET requests to resolve certain *.paymentevolution.com DNS entries were not successful. Some users were unable to access core PaymentEvolution services including payroll processing and our public payroll calculator. Other services were not impacted including our help desk, blogs, and marketing sites. The root cause was a NIC failure at our datacentre provider.
Timeline (all times Eastern Time)
08 June 2020 11:43:44 AM DNS resolution issues started
08 June 2020 11:44:14 AM PE team alerted
08 June 2020 11:48:44 AM Service restored
08 June 2020 1:23:44 PM DNS resolution issues
08 June 2020 1:24:14 PM PE team alerted
08 June 2020 1:30:00 PM Datacentre team attempted to replace failed NIC
08 June 2020 2:48:44 PM Service restored
Root Cause
On June 8, 2020 morning some parts of the datacentre network became unreachable. The datacentre team immediately had multiple staff members looking into the problem. The problem was traced to an issue with a single 48-port line card on one of the core network switches.
The datacentre team first attempted re-seating the card twice, but it was not detected by the switch. The team then unplugged all the network cables from the card and put in another identical spare card from emergency on-site spare hardware. The replacement card was also not detected after re-seating it 3 times. The team then tried a 3rd line card in the slot and it also was not detected after multiple attempts at re-seating.
A decision was then made to attempt power cycle the switch to see if the card could be detected with a full hard boot of the switch. Prior to the power cycle, it had had an uptime of over 8+ years. This did not solve the problem and the card was still not detected, although the rest of the switch was functioning nominally.
Resolution and Recovery
To solve the problem, the team had to swap in an identical spare switch chassis from the emergency on-site spare hardware. This required unplugging all cables (copper and fiber) from all 7 modules on the switch, pulling out all the modules and power supplies, unracking the chassis, racking the spare chassis, putting all the modules and power supplies back in, turning on the switch, and reconnecting all the cables in all the different modules of the switch. The datacentre had 3 staff members helping with physically moving the hardware parts and plugging in cables to get it done as fast as possible. A console cable was connected to the switch during bootup to ensure there were no errors and it showed that everything was functioning correctly. Services were restored by 2:48 PM.
Corrective and Preventative Measures
In the last two days, we’ve conducted an internal review and analysis of the outage. The following are actions we are taking to address the underlying causes of the issue and to help prevent recurrence and improve response times:
- The datacentre team enacted the recovery plan to ensure service was restored as quickly as possible.
- Emergency equipment performed as expected. Equipment has been restocked.
- Core PaymentEvolution services, data and customer information was NOT impacted and our redundant site redirection plans were not needed.
PaymentEvolution is committed to continually and quickly improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support.