Test Data Generation

test-data-generation250x250There are many areas of application development and implementation where data plays a significant role. When you’re building and testing your application, you’ll invariably hit the question of where to get good data (and by that, I also mean bad data) in sufficient volume that will allow you to test both the functionality of your application and its ability to adequately handle your anticipated volume of data in a timely manner.

In this article, we’ll take a look at the challenges faced, and the different techniques that are employed for provisioning data for both development and testing purposes. We’ll also take a look at the auto-generation of test data and how this can help.

There are a number of solutions to acquiring test data including, but not limited to, the following: –

  • Use production data
  • Use obfuscated production data
  • Use existing test data
  • Create new test data that is specifically for testing your application
  • Auto-generate test data

You may  be looking to use a number of these techniques, depending on where you are in the life cycle of your project.

Production Data

In the most part, the use of production data should be a non-starter.

As convenient as it may seem, there are some good reason why you should not do this. The exception to this rule may be when your application enters the User Acceptance Testing (UAT) phase and where your users may want to test and accept your product using real data. You should, however, exercise great caution when doing this, so that your data and third-parties are protected.

There will, of course, be some strong cases put forward as to why you need to use production data during both your development and testing phases. Data migration may be one of the scenarios where the case for this is strongly argued. Whatever decision you take, you should exercise extreme caution and afford this test data all of the luxuries that you would to your real production data.

Reasons not to use Production Data

Here are just some of the reasons why you may want to reconsider your plan to use your production data during your development and testing phases.

  • Data Protection

Your development and test systems are unlikely to come under the same scrutiny or have the same protection as your production systems.

A number of recent and high-profile losses of data by Software as a Service (SaaS) vendors have not been cause by their production servers being hacked. They have been caused by the penetration of lesser-protected systems; which just happened to have a copy of production data.

  • Data Privacy

Your data is confidential. Should your developers and testers have access to this data? Do you allow them this privilege to your real production data?

  • Unforeseen Consequences

What unforeseen consequences may result from your use of production data? Will you inadvertently send test Emails to your customers during the testing of your application’s Email functionality?

Obfuscated Production Data

Obfuscated production data is exactly how it sounds. It is still your production data; however, it has now been obfuscated, usually de-personalising the data in some way. Is this good enough? Often, not.

It takes a lot of effort to obfuscate data, so this often results in poor obfuscation. Some values may be so lightly obfuscated that you can easily determine the original data, whilst other values may simple be removed, or replaced with garbage. These latter obfuscations may be good for your negative testing; however, they are of little use for anything else.

There are techniques that you can employ to successfully obfuscate production data and there is software available to help you do this. One of these techniques is to combine production data with auto-generated data. We discuss auto-generated data, later in this article.

Existing Test Data

Using existing test data would seem to be your natural choice.

You’re interfacing with a system where your organisation already has a designated test system, with carefully crafted test data that meets all of the use-cases that could be expected.

Unfortunately, people are not very good at creating test data. This usually manifests itself in some or all of the following: –

  • Low quality data
  • Low volume data
  • Low entropy
  • Real data (again)

Create new Test Data

So, finally, you’ve come to the conclusion that one of your team has to start manually entering data in to a test system; it’s a job that no one wants to do. You need some data for testing your application during your development and testing phases. You’ve agreed with the business that you’ll be using production data for UAT and you’ll take all of the necessary precautions; however, you still need to get through the next few months.

Hopefully, you’ll get some good quality data; however, you’ll be faced with many of the problems that you saw with the organisation’s existing test data.

I’m sure that David Beckham (real data) will be surprised at how many times he has featured in test data sets, along with a host of other celebrities. I have even seen celebrity names used, as test data, in a context that they would be more than disappointed to see.

You will also struggle to see some of your data move through it’s full life cycle. It is sometimes insufficient to simply enter your data. Data often has a life cycle that extends beyond this initial entry, and it is often difficult to complete this.

It will, of course, be an insurmountable task to create anything but the most trivial of data volume.

Auto-generate Test Data

The Auto-generation of data is an underutilised method for creating both high quality and volume data. You can use auto-generation for both anonymisation, and for generating data in volume.

A simple example of anonymisation, is people’s names. If you have a pool of 100 common male first names, 100 female first names and 100 last names, then you can generate up to 20,000 unique names that can form part of your test data set. You can then use this generated data in isolation, with other generated data, or you could combine it with other data including production data. Remember that, if you’re combining this data with production data, it is likely that there will be other values that you will also need to anonymise.

There is a difference between generating data that appears realistic, as opposed to generating data that is realistic.

With people’s names, for example, you will always be able to generate realistic names; but how would this work for other values such as Email Addresses, Postal Addresses and Telephone Numbers? One important factor is that your generated values may need to pass your application’s input validation, so, if you’re generating telephone numbers, then they should have the appearance of telephone numbers.

Sometimes you will, of course, generate values that are, coincidentally, valid. As they say in films – “The story, all names, characters, and incidents portrayed in this production are fictitious. No identification with actual persons, places, buildings, and products is intended or should be inferred“.

Other data types may be generated, including Dates and Numbers. You may want to specify upper and lower limits, and also provide algorithms should you require data hotspots – if you have a customer database whose average age is between 25 and 40, you may chose to encourage your dates of birth to be between those ranges.

You may have dependencies in your generated data. If the Country is “USA“, then you may want the State to be from the list of real State names.

You may choose to introduce a percentage of invalid data. Either missing data, or data that is of an invalid format.

All of these choices are dependent on your own specific testing requirements.