|
Tech Forum
Generating Test Data
Article summary
When
you make a new application, there is no data in it. Therefore testing can not
be preformed with realistic data volumes. Due to this, the initial rollout may
succeed, but as data accumulates the system performance degrades. Creating realistic
test data is a difficult task. This article shows you how to create large amount
of test data easily using a tool not ordinarily used for this purpose
MS Excel.
In a new project, there is no past data. The database design is freshly done.
During the testing phase, there is hardly any data available. It is commonplace
to put few records to make sure that the UI and report code is running accurately.
Thus testing in the pre-rollout phase is done more for checking functional adequacy
than testing the actual database performance.
When do you test performance? It is often never done before rollout. The real
test happens in a production environment when sufficient data accumulates. Till
then everything is going fine. But as data accumulates (and user load also increases)
the performance starts degrading steadily.
At this stage there is very little modification possible to the data design,
application architecture or code. Thus we end up tweaking some external configuration,
indexes and so on. Finally if everything else fails, we upgrade the hardware.
Sounds familiar?
Now let us look at why this happened? Simply because there was no simple way
available during the initial testing phase to generate random, but real life-like
data of required volume.
Here is a very effective way of achieving this functionality.
The process
What do we need here?
1. A way of generating large amounts of data.
2. Database contains tables and tables contain fields.
3. Tables are related to each other.
4. Each field has a different datatype.
5. Each field will have some additional characteristics. For example, Age can
be between 1 to 120 or Octroi can be between 0 to 5, while both datatypes are
same, i.e. Integer.
6. Therefore, we need to generate random data field by field for each relevant
table.
7. We need a temporary work area where these data points can be generated, validated
and stored.
8. Now we need a mechanism to upload the information into the base table (which
is currently empty)
9. We may need to repeat the process to try out different values or to append
more values to generate larger and larger amount of data.
Now let us see what happens where. The purpose of this article is not to develop
a fully automated utility. The article aims at providing a step-by-step process
to demonstrate how large amounts of realistic test data can be generated quickly.
Third-party tools
There are a lot of third-party tools to generate random numbers. Tools for generating
test data for databases are fewer. The best one I knew, was available as a part
of Erwin Logic Works set products. However, now CA seems to have discontinued
it. In any case each third-party tool costs money. This article is free.
Why use Excel?
There are
many reasons for this.
1. Excel is almost always available. So you dont have to procure another
tool.
2. Excel has a comprehensive random number generator. Yes it is Pseudo-Random,
but it serves our purpose. It is flexible enough to provide most of the functionality
needed to create randomised data in large volumes.
3. Random number generators generate only numbers. Data in a database also has
other types like dates, characters, special codes and so on. Excel has enough
functions to make use of base random numbers and then utilise them to convert
these to other data types.
Excel Add-ins
Before we use some of the functions in Excel you have to enable some Add-ins
in it.
- Choose Tools Add-ins in Excel.
- Select Analysis ToolPak and Analysis ToolPak VBA and choose Ok.
Now the required functions are available in Excel.
Just check that the Tools menu has a new menu called Data Analysis and this
option leads to another dialog that contains Random Number Generator.
Rand() function
This is a simple function to use.
Type =rand() in any cell and see the results. It returns a number between 1
and 0. It is always a decimal number. If you press F9 to recalculate the Excel
sheet, the number changes. More on this later.
The bottom-line is that this function returns a random number everytime it is
called. Do you always require numbers between 0 and 1? Obviously not. That is
where the next function will be usable.
Randbetween() function
Randbetween function is a more usable version. It still uses the rand() function
internally. You can pass the low and high limit to the randbetween() function
and it will return a random number between the low and high limits. Every time
you refresh it, a new number is shown. This is very useful for quickly creating
random data between a known range.
So let us say you had a column called quantity and you wanted 100 rows with
the Quantity value randomly varying between 1000 and 5000. Here is how you manage
this.
- Choose the starting point of the data. Let us say B10.
- Type this formula in B10
=randbetween(1000,5000)
- Now a random value is shown there.
- Drag this formula (or cut paste it) to rows from B11 to B109
- Thats it. You have a list of 100 random values between 1000 and 5000
- Of course this is only one field and thus has limited usage.
- Press F9 and observe, ALL the 100 values will change. Why is this happening?
Recalculation in Excel
Excel is designed for multiple formulas and cells related to each other. Therefore,
whenever anything changes (data or formula), Excel tries to refresh all formulas
by recalculating them. Most formulas when recalculated lead to the same result,
unless the original values are changed.
In case of
random number generation functions, each recalculation results in a different
value. Therefore, everytime you force a recalculation (by pressing F9) or when
you change anything in Excel which causes it to recalculate, the values change.
If you dont want the values to change, you have to copy paste the content
of these formula results as values. I am not recommending it here because we
are going to make use of the automatic generation capabilities.
If you have large numbers generated using such random functions, Excel may take
substantial time to recalculate the results of all random functions. This may
become cumbersome. If you want to avoid this problem, simply turn off Automatic
Recalculation of Excel. Manual recalculation is invoked by F9 key. To turn Automatic
recalculation off, choose Tools Options Calculation Tab and select
the option Manual.
Vlookup function
So far we are only talking about numbers. What about other data types?
Let us assume an example. You manufacture six products.
1. Pencil
2. Paper
3. Eraser
4. Notepad
5. Glue
6. Pen
Now, you want
to create test data for a sales application. You expect there to be 10000 rows
in the table. Each row represents a product. Now, what do you need? You need
to create 10000 rows containing one of the 6 products in a random manner.
It is easy to create 10000 random numbers between 1 to 6. But, how do we translate
these numbers into the product name?
That is where the Vlookup function comes into picture. First we will use the
RandBetween() function to generate 10000 values between 1 and 6 and then use
Vlookup function to translate each value to the actual product name.
First let us put the base data.
Please note that the data is entered from E5 to F10 (excluding the column heading)
Now, in another location, make a list of random numbers between 1 and 6. Let
us say we put the RandBetween(1,6) function for 10 cells from A1 to A10.
Now what do we have just a list of 10 numbers between 1 and 6. Now we
want to translate these into corresponding product names. How do we do that?
This is where the Vlookup function comes into the picture.
Go to A2 (next to the first random value) and type the function
=vlookup(A1, $E$5:$f$10, 2)
What does this syntax mean?
The first argument A1 indicates the value to be looked up. The lookup table
is what we created earlier. It contains the number in first column and product
name in second column. For Vlookup to work, the first column must be sorted
(which is already so in our case).
The second argument is the range containing the entire lookup table. Please
note the $ signs for row and columns. This ensures that when you copy paste
the formula, Excel does not change the location of the lookup table we
have the same lookup table, is it not?
The third argument specifies which column value to return from the lookup table.
We are searching by the number. So what is the point in returning the number?
We obviously want the product name. Therefore, the value 2 appears indicating
that we are interested in the second column.
Of course the lookup table could have had many more columns (the real product-id
for example).
And now, we have the correct product name appearing in A2.
Copy and paste the formula for the remaining 9 rows.
What do you have now? All the random numbers are now translated into product
names.
Now here is the best part. Press F9 and another set of random numbers gets created.
The product names change automatically. Smart and fast. Is it not?
Now just for
completing the scenario, add another formula in the third column. Let us say,
these were the quantities of each product sold. Let us use the formula
=randbetween(100, 1000)
Here is a complete table with random values.
Now you can paste these the formulas for as many rows as you like. Great is
it not?
Excel has a limitation of 65000 rows. So what next? Start another column!
Coming up
This was just the starting point. How do you put this data
in the required table? Watch out for more on this in the next article.
 |
About
the Author:Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline).
He is a consultant with many organisations, covering appropriate technology
utilisation, business application of relevant technology, application architecture
and audit as well as knowledge transfer. He has authored more than 650 articles
on various technology-related subjects. He can be contacted at nitin@mediline.co.in |
|