It always surprises me as to how little use is made of the simple but effective concept of Application Checkpointing.
Checkpointing is a simple technique that allows you to add fault tolerance to your application by saving a snapshot of the application’s state, allowing it to restart from a checkpoint, should it fail. This is extremely important for a long running process, especially if it is operating within a fragile environment.
Checkpointing does not have to be used in isolation; when used with other techniques, it is a powerful tool for improving the resilience of your application.
I always use application checkpointing when I’m working on a data migration or a similar type of batch process.
In a typical data migration, we will be migrating a number of related objects, with a parent-child relationship. This migration may be between relational databases, SaaS vendors such as Salesforce, or a combination of these and other disparate data sources.
The following list shows a simple, yet typical, example of the objects that may make up a data migration.
Due to volume of data and other constraints, it is typical for a data migration to process data object-by-object, rather than processing the data in a transactional manner.
This process usually involves the reading of Customer data from the source system, validation and restructuring of the data, and then loading the Customer data to the target system. The migration process may then move on to the next object, Invoice, and so on. In the real world, of course, this process may be more complex than the example shown here.
Checkpointing allows you to place a marker, once you have successfully completed Customer. Should you encounter an issue while processing Invoice, your data migration process will be able to skip Customer and resume processing Invoice, once the underlying issue has been resolved.
The processing of Invoice could, of course, be a complex process in its own right and this, too, may be broken down in to a number of checkpoints. There is no reason why you should not to have a hierarchy of checkpoints, allowing your application to always resume from a sensible recovery point.
Recovering from a Checkpoint Failure
In an ideal world, you can re-run from a checkpoint with impunity – You’ve corrected the underlying issue and you can re-start your application and it will just do the right thing. You may, of course, have to do some work to make this happen.
For some checkpointed tasks, you will be able to re-run the entire task as though it had never been previously executed. If, for example, your checkpointed task is to convert an Excel spreadsheet to a CSV file, then you would probably have no issue with your task overwriting any previously (partially) created file.
If you’re writing to a database, then you may be a position where some of your data may already have been committed to the database.
When inserting in to a database, a common technique is to upsert (sometimes referred to as auto-correct load).
Using this load mechanism, no assumption is made as to whether or not a row with a known key already exists. The loader will insert the record if it does not exist or update it if it does. This is helpful during your recovery from a previous load failure, as you are able to simply replay all of your upserts. You should be mindful of any consequence of updating a row that you have previously inserted. There may, for example, be update triggers that cause unexpected results. In many scenarios, upserts are a mechanism that allows you to ensure that your recovery is simple and robust.
When using checkpoints, you need to ensure that your application is able to detect when it is recovering from a failed checkpoint. This will allow you to run recovery code that is specifically designed to rollback to the start of the checkpoint, should you need to, then allowing you to re-run the entire checkpointed logic from the beginning.
You will already have been considering your data consistency, during the design phase of your application. This is of particular importance when you are not processing data in a transactional manner. You should also give consideration to this when allowing your application to recover from a failed checkpoint, as any time delay in restarting your application may need to be accounted for in your data consistency considerations.
The first time that you make checkpointing an integral part of your application, you will quickly come to realise it’s usefulness and importance.
Each time you test the early builds of your application, it is likely that you will experience application failure. You will correct your code, or other underlying issues, and then watch you application correctly recover from this error situation – a considerable benefit over having to deal with an application that has failed in an uncontrolled manner, where recovery is difficult and time consuming.