Problem Background
Recently, I joined a government agency as a software engineer/scientist/analyst. Previously, worked in software industry - gained 3 years of software engineering experience at previous job (to add to about 7 years in computational science/scientific computing). My current job is to come up with a strategy for modernizing a legacy scientific program.
The scientific program to modernize is a large legacy computational system that basically does mathematical optimization. Development started in the 1990s and has not kept up with best practices, unfortunately. It was/is written by scientists and analysts.
The main component of the system is a Fortran-based (various versions starting from 90, some newer versions incorporated, compiling with 2018 compiler) program that does the optimization. The program consists of 400K lines of Fortran code, 20K lines of shell scripts, and 60K lines of external math solver code. There is no test suite, hence the legacy label. The program can be thought of as a dozen modules that describe a particular physical component's behavior in the optimization. The general flow of the Fortran program is described in a main
routine, where these dozen modules are called sequentially. The main
routine does some other data orchestration and I/O as well. There is some interface to commercial products and optimization solvers, probably through a home grown Fortran wrapper. One of the biggest issues IMO is the use of global variables - both main
and the modules have access to these globals, so change to the state can be made from anywhere (see my specific question).
There is a lot of home grown code for sub-systems or utilities that manage the main Fortran program, written mainly as shell scripts. These sub-systems include:
- a queuing system that manages the executions of the main Fortran program on internal prem Windows servers,
- post-processor that converts the Fortran UNF files to CSV and Excel format,
- custom visualization package written in Visual Basic that plots the results of the Fortran program,
- version control utilities as wrappers around RCS VCS,
- compiler utility that wraps the Fortran compilation.
Those are the main sub-systems or utilities necessary to work with the Fortran program and its input/output, but there are loads of other Fortran programs and shell scripts that do longer-term things like server space management and license management.
My immediate team is responsible for the Fortran code execution and integration with other modules (so not all 400K lines of Fortran is in our scope, just maybe 10-20%, the rest is with other groups responsible for the dozen modules, which introduces some organizational pains since we have no control over their code). My team consists of me and another software developer, both mid-level software developers converted from scientific computing. A junior software developer with a traditional background in software and CS is joining shortly. Our senior software developer (one of the original developers of the entire system) is retiring in 1 month, and we are in the process of trying to find a replacement.
Problem
My question is: What are the components and sequence of the modernization plan/strategy that I should consider? The modernization is basically the process of moving from legacy to a more modern process, both technically (e.g., architecture, frameworks) and organizationally (e.g., agile process management for development).
Proposed strategy
Currently, at a high level, my plan is to:
- assess extent of home grown code for systems that are not part of the main Fortran program;
- replace each of these home grown solutions with best practice open source
solution, so we maintain as little code as possible;
- current order is modern VCS (Git/Gitlab), then queuing system, then viz package, but order will be determined by how much code there is per sub-system.
- with the remainder of the code - hopefully just the main Fortran program and not some vital sub-system that we cannot find an open source solution for - capture current behavior with characterization tests;
- refactor (update Fortran, port all functionality that doesn't do number crunching from Fortran to Python, etc.), make sure tests pass, repeat;
- "futurize" code by updating architecture to enable cloud compute (to avoid vendor lock in), using Docker for containerization.
Research
I've looked at some great discussion of similar topics:
- I've inherited 200K lines of spaghetti code -- what now?
- https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/
- How to deal with a large codebase with no requirements and the responsible person leaving the company soon
- How can a large, Fortran-based number crunching codebase be modernized?
- What are the key points of Working Effectively with Legacy Code?
But notice that some of these questions and answers are almost 10 years old, so I wonder if there are better approaches available. Also, I am dealing with a procedural scientific computing environment, rather than a heavy OOP business app, so perhaps the principles mentioned in the above Stackexchange links don't carry over as well. I am also not a senior software engineer, so not sure if I am even using the right terms in search and question formulation. There is the complication of scripts and utilities in the system that makes this effort not just about porting or refactoring Fortran, that makes this situation and problem unique.
Thanks!