I have wanted to provide you with as much information as possible about RGMS since its problems in January and February and the disruptions to your preparation of applications to NHMRC. I understand that many of you will want to understand more about these circumstances and to have an assurance that NHMRC will be able to maintain the recent improvements in RGMS performance.
I have provided below an account of what happened over the period of disruption in January and February, including an explanation of how the performance issues were resolved.
I am very relieved that the fixes installed by our providers from 10 to 17 February 2011 appear to have restored RGMS to the level displayed when we tested it in October 2010. Staff and our providers are closely monitoring the performance but since RGMS came back on line for all users on 23 February, it has operated without interruption. Over the last few days, we have seen peak loads of more that 12,800 log- ins per day.
RGMS consists of the main application software, interacting at the one end with software that controls the internet-based user interface and at the other end with Oracle Databases. The system is housed across a number of servers supported by a separate company. The management of these components is through a series of contractual arrangements, as IT services in Commonwealth Departments have been outsourced since the mid part of last decade.
NHMRC response to 2010 problems
After the unsatisfactory operation of RGMS in early 2010, I initiated an independent report on its problems in April 2010. We acted on the recommendations of this review over the following several months. In October 2010, we took RGMS off-line to install and test those changes.
These tests were rigorous, conducted by an independent company, and tested for load and endurance performance. The tests showed a stable system, able to handle most expected demands upon it, though we anticipated that some “load management” may be required for peak periods.
As described in NHMRC’s communiqués RGMS Updates from late 2010, load management would temporarily restrict new users from gaining access to the system when RGMS reached threshold levels, in order to maintain system performance and functionality for those users already in the system. We expected these times to be rare.
Performance in early 2011
After opening the system in December and monitoring continuously, we began to see unexpected peaks in CPU use in January 2011. These peaks in CPU use did not correlate with the numbers of users logged in and had not been seen in the testing.
The CPU load peaks also tended to escalate very rapidly from low CPU use to a maximum level that would then bring the system down. Despite several changes made by our providers, the system continued to deteriorate through January into early February.
As you will be only too aware, RGMS was down frequently between 8 and 9 February 2010. This meant major disruption for all our applicants for project grants, career development fellowships, TRIP Fellowships and Centres of Research Excellence. It was also a major inconvenience for Research Office staff at university and medical research institutes.
On 9th February, I decided that RGMS was performing much too poorly to continue, took it off-line and called in our providers to urgently work on the problem. Over the period 10 to 13 February 2011, NHMRC staff supported the work of our providers who brought international teams to diagnose the problems.
A major change was implemented as recommended by the providers, designed to fix the most apparent problem. The system was again tested thoroughly off-line but when we went back on line on 14th February, though the system initially performed well, we again experienced the sudden cascading of CPU use as we had seen in January.
At this point, the provider brought in its most experienced trouble-shooter from the USA. Between 17 and 21 February this expert, through working with the system while it was running live, was able to diagnose and provide fixes for a number of problems. The main problem, which caused the unexplained spikes in CPU usage, was only detected by live monitoring of actual users in the system. This problem had only been reported once before in the literature.
Since then the system has been providing access as we had hoped it would, given the changes we had made and the results of testing in 2010. RGMS is now supporting many thousands of logins every day.
Causes of the performance problems
I have been advised that degradation of access and performance in RGMS was caused by complex interactions across the entire RGMS system. The underlying causes were only identified and resolved by our supplier’s most experienced trouble-shooter, brought to NHMRC from the USA to interact with the system in real time.
This expert told us that the bugs did not appear during our rigorous load and stress testing in October last year because, while the testing was designed to simulate the activities of users as closely as possible (logging in, creating applications, uploading documents etc), it could not simulate the unique behaviours of real users. It turned out that the performance problems were unpredictable and difficult to identify because they resulted from rarely occurring, multiplying combinations of events. These are summarised below:
- The "row table lock" issues, which caused the freezing of the database server, were resolved by a hot-fix provided by our supplier and some other configuration changes.
- Through watching the live activities of users in the system, the expert identified and resolved a major problem that arose when users were undertaking combinations of tasks that required the software to initiate complex tasks that took many seconds to complete. He observed that the system multiplied those requests several times over before the request was completed. This in turn led to delays in the completion of other tasks in the system, which were then also amplified. On occasions, these multiplying and amplified requests hit a threshold when all processors became saturated and RGMS became unresponsive.
- Changes to the separate software system that handles log-in requests also removed a factor that was causing inefficiencies in some processes.
- Our service providers were also able to fine-tune a number of processes across the system to improve the systems response time and automate some required maintenance processes.
RGMS as an end to end system
In providing these details, I am not seeking to diminish NHMRC’s recognition of the effects of these RMGS performance failures on you as applicants. However, I do want to describe some of the possibilities that RGMS does and will offer researchers in the future, as a complete grants management system.
RGMS is a “go to whoa” system. That is, it is designed to provide:
- an ability for applicants to submit grants electronically
- an ability for applicants to use their RGMS curriculum vitae in different applications each year, rather than having to create one for each grant application
- an ability for NHMRC-held information to be pre-populated in application forms, (e.g. grants already held)
- a database for assigning external assessors to all applicants (at last; NHMRC has not had a useful assessors database available to it for many years)
- instant support for peer review panels during their meetings (this is yet to be fully installed).
- reporting on grant expenditure in line with government reporting requirements.
- reporting to the general public on the use of taxpayers dollars in medical research, including in due course, sophisticated reports on what is being funded and achieved in their areas of interest – cancer, mental health, heart disease, and so on, and
- an ability for grant holders to quickly request and gain approval for changes in their award.
Perhaps the most important advantage that RGMS will offer however, it the possibility of introducing more than one application date each year. A fully functioning RGMS can mean that researchers can apply more often than once per year to NHMRC for project grant funding.
Research Committee has been considering how this might be implemented, so that researchers are not confined to the current once-per-year applications for research grants. I am hopeful that we will be able to consult with the research community on this possibility later this year.
Despite the deadline for project grant applications being a month later than originally planned, there will be no delay in the peer review process this year. Assigners’ panel will meet as originally planned in the first week of April, and RGMS will be able to provide these members with automated potential external assessors from its database of over 15,000 curriculum vitae.
Then, we look forward to the assistance of thousands of researchers who each year provide peer review comments on applications, so that the GRPs have access to the best possible expert opinion upon which to score applications in July and August.
11 March 2011