Resolved -
Dear Partners,
We successfully completed the final round of compute resources expansion and based on our monitoring all services are in a stable state; hence the global notification will be marked as resolved.
We managed to make reasonable progress to our goals on both traffic reinstation and Web Administration performance improvement, and thus our focus will be shifting to the full restoration of our AMS cloud region.
What to expect next:
We will be announcing the migration activities to all partners who will be transferred back to the AMS cloud region with a global Service Management notification, where we will be sharing timelines and what can be expected as the initiative develops. Expectation is to start the process on Monday, the 18th of May.
Root cause analysis will be completed next week, and we will be sharing details on the experienced issues, actions taken to mitigate the challenges and what future improvements we are planning to prevent/limit similar situations going forward.
We sincerely appreciate your understanding and support during the past week, as we navigate through all the challenges and restore the service to nominal operation. Thank you!
In the meantime, should you have any questions or face difficulties, please reach out to the PROS Support team.
Sincerely,
PROS Team
May 15, 19:50 UTC
Update -
Dear Partners,
Today, we procured additional hardware capacity that will be set up and introduced in production throughout the next 24 hours to assure continued stable performance as we head for the weekend.
At this moment the overall service is stable, traffic is very close to pre-incident levels, and the remaining focus will be:
*Bridging the gap to get between 95% to 100% of usual traffic
*Stabilizing the performance of the Web Administration tool since performance is not optimal and some configuration modules require support assisted workarounds to be introduced depending on the goals
Given the above, the expectation is to close the Global Outage notification before EOD Friday, 15th of May and start reaching out to individual partners separately upon initiating the migration to AMS data center which is planned to commence and be completed within the upcoming week.
We will report on our progress tomorrow.
Sincerely,
PROS Team
May 14, 17:57 UTC
Update -
Dear Partners,
We managed to complete our goals and technically, the service performance is almost at pre-incident levels:
*FastSearch and PricingCache data quality is within expected quality levels ~90%
*Web Administration interface is live, functionality controls are available, however performance is not yet optimal
*We are very close to getting to the usual traffic load (currently sitting at around ~95%)
*We have outlined the plan to begin traffic migration back to AMS data region the moment it becomes available to us
The goal for the next 24 hours is:
*Continued monitoring to assure service stability
*Fully reinstate the traffic to the usual levels
*Improve Web Administration performance
*Close the Global Outage event as resolved
*Please expect more news tomorrow.
Sincerely,
PROS Team
May 13, 19:13 UTC
Update -
Dear Partners,
After today's work we are very close to successfully restoring all services to pre-incident performance, and current estimation is that we might be able to achieve that within the end of the day tomorrow.
Within the past ~24 hours:
*Currently we are close to 90% of usual traffic and largely stable performance
*FastSearch environments & PricingCache exports continue to be updated, and current overall data quality is above 80%
*Web Administration interface is fully operational, however loading times may not be optimal
*Continue to work with customers where the Support team is assisting with questions, issues faced or clarifications required
*We are ready with a plan to gradually start shifting the services back to the Amsterdam Data region upon its full restoration. According to our Cloud Provider, that will happen in the next 72 hours.
What is next:
*We will require an additional round of compute capacity expansion to get back to pre-incident level traffic and performance
*Get to 90%+ data quality for all FastSearch environments within Wednesday, 13th of May
*Get to 90%+ data quality for all PricingCache exports within Wednesday, 13th of May
*Web Administration performance improvement
*Prepare for 100% service restoration sign off and start planning the execution of the reinstation of the AMS Data region
We will be providing the next update on Wednesday, 13th of May.
Sincerely,
PROS Team
May 12, 19:44 UTC
Update -
Dear Partners,
We faced some scaling issue and had to resolve some unexpected bottlenecks; however, we still managed to progress well today.
Within the past ~24 hours we:
*Executed an additional compute resources increase - currently we are close to 80% of usual traffic with close to no overall degradation
*Powered up all FastSearch (100%) environments, which are now fully operational. We are seeing swifter than anticipated data quality improvement across environments
*Powered up all PricingCache (100%) environments and have resumed all of data deliveries. Data quality will gradually improve within the following exports. All identified delivery issue were addressed.
*Faced some issues with the Web Administration interface configuration, hence revised ETA is to bring it back at operational level tomorrow
*Continue to work with customers where the Support team is assisting with questions, issues faced or clarifications required
What is next:
*Final round of compute capacity expansion and assessment of overall performance
*Get to 80-90% data quality for all FastSearch environments within Tuesday, 12th of May
*Get to 80-90% data quality for all PricingCache within Tuesday, 12th of May
*Web Administration functionality to be restored and available to users
*Continue to address questions or issue reported by customers
*Prepare for 100% service restoration sign off and plan reinstation of the Primary Data Center
*We will be providing the next update on Tuesday, 12th of May.
Sincerely,
PROS Team
May 11, 20:50 UTC
Update -
Dear Partners,
The PROS team deployed additional compute capacity successfully, and we continue to be on track on restoring service performance to normal operating levels. Our current ETA on fully restoring the service to pre-incident level performance is within the upcoming week. We should be in a position to share a more concrete target following the developments of the work that will continue throughout the night and on Monday.
Within the past ~24 hours we managed to deliver:
• Significant increase in additionally introduced compute resources, hence we are able to serve more traffic and getting closer to pre-incident levels, currently sitting at around 60% of usual traffic of the affected services
• ~60% of the FastSearch environments are now fully operational
• ~50+% PricingCache data deliveries have been resumed, with the expectation that the initial data deliveries will be with outdated quality. Data quality will gradually improve within the following exports.
• Majority of the configuring changes required to enable full functionality to the Web Administration interface has been completed
In the next 24 hours the PROS team aims to deliver:
• Further expansion of compute capacity to get us closer to pre-incident performance levels and allow concrete ETA on full performance restoration
• 100% of FastSerach environments to be made operational within Monday, 11th of May
• 100% of PricingCache environments to be made operational and all data deliveries to be resumed within Monday, 11th of May
• Web Administration functionality will be restored and available to users
• PROS support will be working with customers to confirm satisfactory service restoration and address issues if any
We will be providing another progress update on Monday the 11th of May.
Sincerely,
PROS Customer Support
May 10, 19:24 UTC
Update -
Dear Partners,
The PROS team continues to configure and deploy additional compute capacity successfully, and we managed to make solid progress towards restoring normal operations.
Within the past ~24 hours we:
• Resolved and/or provided a workaround to specifically reported customer issues
• Further stabilization of the live-transaction services, hence Intermittent Transaction Time-outs currently constitute less than 1% of all transactions
• Availability services were restored to pre-issue state and are operating optimally
• We were able to configure additional compute resources and commenced the generation of a pool of FastSearch environments
• PricingCache service will gradually become available, and some partners might start receiving data, yet it will be with lower quality. We will be reaching out to all of these customers and providing additional details as they become available and when a full restoration is achieved.
• Additional work was done to support the full restoration of the Web Administration Interface
In the next 24 hours the PROS team will be focused on:
• Another round of compute capacity expansion will be completed - while the provisioning speed is not ideal, we will continue to bring gradual performance improvements
• FastSearch environments regeneration continues - the goal of restoring the service in limited capacity as of early hours of Monday, 11th of May remains unchanged for the time being
• PricingCache service restoration continues, aiming to gradually restart data delivery as of Monday 11th of May - currently we are moving ahead of schedule
• Resolve the remaining Web Administration issues, aiming to fully restore functionality as of Monday 11th of May
We will be providing the next status update on Sunday, the 10th of May, to report on our progress.
Sincerely,
PROS Customer Support
May 9, 20:37 UTC
Update -
Dear Partners,
We are still anticipating our cloud provider to provision the remaining compute resources. While optimal performance is not yet achieved, we are operating in a stable manner.
Throughout the day we were able to:
Resolve or mitigate majority of the aftermath customer reported issues for all live-transaction based services
Further reduce the Intermittent Transaction Time-outs and "no-results" situations as we introduced more compute resources
Availability data recaps were processed which will greatly limit availability discrepancy errors
OneSearch, Merchandising, Repricer and historical transaction services are fully operational
Made significant progress in preparing the FastSeach instances for commencing operation
The Web Administration interface is brought up with limited functionality
As we are entering the weekend here are the next steps we will be taking:
Continue expanding hardware capacity as such becomes available, thus further improving performance
FastSearch environments regeneration will commence with the goal of restoring the service in limited capacity as of early hours of Monday, 11th of May
PricingCache service restoration aiming to gradually restart data delivery as of Monday 11th of May
Enable additional Web Administration modules for customer usage
We will be providing the next status update on Saturday, the 9th of May, to report on our progress.
Sincerely,
PROS Support
May 8, 15:51 UTC
Update -
Dear Partners,
Our cloud provider has provisioned a portion of the additional compute resources earlier today, and these have now been deployed into Production. While this has already contributed to improved stability and reduced service interruptions, full restoration remains dependent on the provisioning of the remaining expected compute capacity.
Current Status:
• Intermittent transaction timeouts have been further reduced
• Intermittent “no-results” occurrences have decreased
• Availability data processing is operational
• Repricer and historical transaction services are operational
• FastSearch services remain partially impacted; however, rebuilding and configuration activities are actively underway to enable rapid restoration once the additional compute resources are provisioned
• The Web Administration interface is still currently inaccessible pending additional compute resource availability
Ongoing Activities:
• The team continues active monitoring, validation, and remediation activities to address remaining service issues
• FastSearch environments are being rebuilt in preparation for full restoration as soon as sufficient compute resources are provisioned
• PROS continues to work closely with the cloud provider to secure the remaining expected compute resources. Although a limited portion has already been delivered, we are still awaiting the majority of the provisioning expected later today
• Preparation activities to restore the Web Administration interface are in progress
• A service impact assessment and root cause analysis are ongoing while awaiting the official report from the cloud provider
We will continue to provide updates as additional compute resources are provisioned and services are progressively restored.
Please expect the next update within the next two hours.
Sincerely,
PROS Support
May 8, 10:15 UTC
Update -
Dear partners,
Our cloud provider has provisioned a part of the additional compute resources earlier today, and we just deployed them on Production. While this is not going to restore optimal operation and we continue to be at limited capacity, it will indeed further stabilize the services and improve the current performance in a noticeable way.
Current status:
• Intermittent Transaction Time-outs have been further reduced
• Intermittent "no-results" further reduced
• Availability data processing is operational
• Repricer and historical transaction services are operational
• FastSeach service is not fully operational, however we have started rebuilding and configuring the impacted FastSearch environments so they can be swiftly re-generated once the additional compute capacity is delivered (Pending additional compute resources availability)
• The Web Administration interface is currently not accessible (Pending additional compute resources availability)
Ongoing activities:
• The team continues to address issue identified within our validation activities and the monitoring of the services performance
• FastSearch environment configurations are being re-built, to prepare for service restoration upon enough compute resource provisioning
• PROS is working with the cloud provider to procure the additionally expected compute resources- while a small portion was provided, we are still expecting the majority of the resources to become available later today
• Preparation activities to bring the Web Administration back online have commenced.
• Service impact assessment has been initiated while we await the official cloud provider report to support the root cause analysis process
Kindly expect further updates in the next 2 hours.
Sincerely,
PROS Support
May 8, 07:11 UTC
Update -
Dear Partners,
We have successfully completed the 3rd round of provisioning and introduction of compute capacity on Production. This results in a noticeable improvement in stability and we are seeing further reduction in the intermittent issues. With this we have exhausted the available spare resources the cloud provider is able to provide for the time being. We have arranged a significant additional capacity to be provisioned, yet realistically it will be available for configuration in approximately 7-8 hours.
Current status:
Intermittent Transaction Time-outs have been reduced
Intermittent "no-results" reduced
Availability discrepancies were resolved (the MQ connectivity issues observed were resolved)
Repricer and historical transaction services are operational
FastSeach service is not fully operational, however we have started rebuilding and configuring the impacted FastSearch environments so they can be swiftly re-generated once the additional compute capacity is delivered
While we are awaiting the distribution of these compute resources, the team's focus will remain on the observed customer-specific intermittent issues and in preparation and planning around the deployment of additional resources later today.
Sincerely,
PROS Support
May 7, 23:36 UTC
Update -
Dear Partners,
We have successfully completed the 3rd round of provisioning and introduction of compute capacity on Production. This results in a noticeable improvement in stability and we are seeing further reduction in the itermittent issues. With this we have exhausted the available spare resources the cloud provider is able to provide for the time being. We have arranged a significant additional capacity to be provisioned, yet realistically it will be available for configuration in approximately 7-8 hours.
Current status:
Intermittent Transaction Time-outs have been reduced
Intermittent "no-results" reduced
Availability discrepancies were resolved (the MQ connectivity issues observed were resolved)
Repricer and historical transaction services are operational
FastSeach service is not fully operational, however we have started rebuilding and configuring the impacted FastSearch environments so they can be swiftly re-generated once the additional compute capacity is delivered
While we are awaiting the distribution of these compute resources, the team's focus will remain on the observed customer-specific intermittent issues and in preparation and planning around the deployment of additional resources later today.
Sincerely,
PROS Support
May 7, 23:00 UTC
Update -
Dear Partners,
We have successfully completed another round of provisioning and introduction of compute capacity on Production, which allowed us to further stabilize the system and restore additional portion of traffic. As of this moment, preliminary observations suggest that services are operating in limited capacity across all impacted partners.
The PROS team continues with the deployment and configuration of the 3rd batch of compute resources and also addressing customer-specific reported observations, such as:
Intermittent Transaction Time-outs
Intermittent "no-results"
Availability discrepancies (we are experiencing MQ connectivity issues)
Repricer and historical transaction services are not fully operational for some partners
We continue to troubleshoot and resolve these and all other discrepancies that are being flagged by customers or identified within the validation process.
Kindly expect the next update in approximately 90 minutes.
Sincerely,
PROS Support
May 7, 21:28 UTC
Update -
Dear Partners,
We managed to configure and deploy a portion of the provisioned compute capacity in production in the alternative region. Hence, the first batch of traffic has been successfully shifted and operational in limited capacity. We will be reaching out directly to customers, which services are being restored.
The full spare compute capacity has also been delivered, therefor we are proceeding with the configure next back of traffic transfer. For the time being, the ETA for mitigation remains unchanged.
We will continue to provide regular updates on our progress or in case there are changes.
Sincerely,
PROS Support
May 7, 18:29 UTC
Update -
Dear partners,
We managed to configure and deploy a portion of the provisioned compute capacity in production in the alternative region. Hence, the first batch of traffic has been successfully shifted and operational in limited capacity. We will be reaching out directly to customers, which services are being restored.
The full spare compute capacity has also been delivered, therefor we are proceeding with the configure next back of traffic transfer. For the time being, the ETA for mitigation remains unchanged.
We will continue to provide regular updates on our progress or in case there are changes.
Sincerely,
PROS Support
May 7, 18:19 UTC
Update -
Dear partners,
We successfully restored part of the affected traffic with additional resources in our alternative region.
We are working with our cloud provider to procure the full capacity needed to restore the service in full. The available compute resources will not be sufficient to deliver 100% of the usual service performance; however, it will be sufficient to bring back the service in a reduced capacity, while we await a permanent solution. We have filed the necessary requisitions for a significant increase of the compute resources; however, they will be able to provide this in a matter of days.
Based on the provisioning speed of the cloud provider so far, we are looking at 6-12 hours in which all services should be restored with limited capacity. The process itself will consist of loops of:
Configure compute resources
Shift a portion of the traffic
Assure traffic volumes are manageable and existing operational service is not degrading
Performance, while not optimal is sufficient
Proceed with the next "batch"
We will continue to provide regular updates on our progress or in case there are changes.
Sincerely,
PROS Support
May 7, 16:12 UTC
Update -
Dear partners,
We successfully restored part of the affected traffic with additional resources in our alternative region.
We are working with our cloud provider to procure the full capacity needed to restore the service in full . The available compute resources will not be sufficient to deliver 100% of the usual service performance; however, it will be sufficient to bring back the service in a reduced capacity, while we await a permanent solution. We have filed the necessary requisitions for a significant increase of the compute resources; however, they will be able to provide this in a matter of days.
Based on the provisioning speed of the cloud provider so far, we are looking at 6-12 hours in which all services should be restored with limited capacity. The process itself will consist of loops of:
Configure compute resources
Shift a portion of the traffic
Assure traffic volumes are manageable and existing operational service is not degrading
Performance, while not optimal is sufficient
Proceed with the next "batch"
We will continue to provide regular updates on our progress or in case there are changes.
Sincerely,
PROS Support
May 7, 16:09 UTC
Update -
Dear partners,
We successfully restored part of the affected traffic with additional resources in our alternative region.
We are working with our cloud provider to procure the full capacity needed to restore the service in full . The available compute resources will not be sufficient to deliver 100% of the usual service performance; however, it will be sufficient to bring back the service in a reduced capacity, while we await a permanent solution. We have filed the necessary requisitions for a significant increase of the compute resources; however, they will be able to provide this in a matter of days.
Based on the provisioning speed of the cloud provider so far, we are looking at 6-12 hours in which all services should be restored with limited capacity. The process itself will consist of loops of:
Configure compute resources
Shift a portion of the traffic
Assure traffic volumes are manageable and existing operational service is not degrading
Performance, while not optimal is sufficient
Proceed with the next "batch"
We will continue to provide regular updates on our progress or in case there are changes.
Sincerely,
PROS Support
May 7, 16:03 UTC
Investigating -
Dear partners,
We managed to shift a portion of the affected traffic to the alternative locations and were able to mitigate impact for some of our partners.
The remaining transfer efforts continue. In parallel, we are assessing the capabilities of our hosting provider to provision the required resources to fully resolve the situation and exploring alternative options depending on their readiness.
Individual partners where we believe the issue has been resolved will be contacted directly.
Kindly expect a progress update in the next 30-60 minutes.
Sincerely,
PROS Support
May 7, 14:45 UTC