Issue dated - 26th July 2004

-


Previous Issues

CURRENT ISSUE
INDIA NEWS
NEWSANALYSIS
COLUMNS
TECH FORUM

THE C# COLUMN

BETWEEN THE BYTES
TECHNOLOGY
SPECIALS <NEW>
Symantec Report
Security Headquarters
JobsDB
MINDPRINTS
HMA BANKBIZ
EC SERVICES
ARCHIVES/SEARCH
IT APPOINTMENTS
Openings At Jobstreet.com
WRITE TO US
SUBSCRIBE/RENEW
CUSTOMER SERVICE
ADVERTISE
ABOUT US

 Network Sites
  IT People
  Network Magazine
  Business Traveller
  Exp. Hotelier & Caterer
  Exp. Travel & Tourism
  Exp. Pharma Pulse
  Exp. Healthcare Mgmt.
  Express Textile
 Group Sites
  ExpressIndia
  Indian Express
  Financial Express

 
Front Page > TechSpace > Story Print this Page|  Email this page

Tech Forum

Generating Test Data

Article summary

When you make a new application, there is no data in it. Therefore testing can not be preformed with realistic data volumes. Due to this, the initial rollout may succeed, but as data accumulates the system performance degrades. Creating realistic test data is a difficult task. This article shows you how to create large amount of test data easily using a tool not ordinarily used for this purpose – MS Excel.

In a new project, there is no past data. The database design is freshly done. During the testing phase, there is hardly any data available. It is commonplace to put few records to make sure that the UI and report code is running accurately. Thus testing in the pre-rollout phase is done more for checking functional adequacy than testing the actual database performance.

When do you test performance? It is often never done before rollout. The real test happens in a production environment when sufficient data accumulates. Till then everything is going fine. But as data accumulates (and user load also increases) the performance starts degrading steadily.

At this stage there is very little modification possible to the data design, application architecture or code. Thus we end up tweaking some external configuration, indexes and so on. Finally if everything else fails, we upgrade the hardware. Sounds familiar?

Now let us look at why this happened? Simply because there was no simple way available during the initial testing phase to generate random, but real life-like data of required volume.

Here is a very effective way of achieving this functionality.

The process

What do we need here?

1. A way of generating large amounts of data.

2. Database contains tables and tables contain fields.

3. Tables are related to each other.

4. Each field has a different datatype.

5. Each field will have some additional characteristics. For example, Age can be between 1 to 120 or Octroi can be between 0 to 5, while both datatypes are same, i.e. Integer.

6. Therefore, we need to generate random data field by field for each relevant table.

7. We need a temporary work area where these data points can be generated, validated and stored.

8. Now we need a mechanism to upload the information into the base table (which is currently empty)

9. We may need to repeat the process to try out different values or to append more values to generate larger and larger amount of data.

Now let us see what happens where. The purpose of this article is not to develop a fully automated utility. The article aims at providing a step-by-step process to demonstrate how large amounts of realistic test data can be generated quickly.

Third-party tools

There are a lot of third-party tools to generate random numbers. Tools for generating test data for databases are fewer. The best one I knew, was available as a part of Erwin Logic Works set products. However, now CA seems to have discontinued it. In any case each third-party tool costs money. This article is free.

Why use Excel?

There are many reasons for this.

1. Excel is almost always available. So you don’t have to procure another tool.

2. Excel has a comprehensive random number generator. Yes it is Pseudo-Random, but it serves our purpose. It is flexible enough to provide most of the functionality needed to create randomised data in large volumes.

3. Random number generators generate only numbers. Data in a database also has other types like dates, characters, special codes and so on. Excel has enough functions to make use of base random numbers and then utilise them to convert these to other data types.

Excel Add-ins

Before we use some of the functions in Excel you have to enable some Add-ins in it.

- Choose Tools – Add-ins in Excel.

- Select Analysis ToolPak and Analysis ToolPak VBA and choose Ok.

Now the required functions are available in Excel.

Just check that the Tools menu has a new menu called Data Analysis and this option leads to another dialog that contains Random Number Generator.

Rand() function

This is a simple function to use.

Type =rand() in any cell and see the results. It returns a number between 1 and 0. It is always a decimal number. If you press F9 to recalculate the Excel sheet, the number changes. More on this later.

The bottom-line is that this function returns a random number everytime it is called. Do you always require numbers between 0 and 1? Obviously not. That is where the next function will be usable.

Randbetween() function

Randbetween function is a more usable version. It still uses the rand() function internally. You can pass the low and high limit to the randbetween() function and it will return a random number between the low and high limits. Every time you refresh it, a new number is shown. This is very useful for quickly creating random data between a known range.

So let us say you had a column called quantity and you wanted 100 rows with the Quantity value randomly varying between 1000 and 5000. Here is how you manage this.

- Choose the starting point of the data. Let us say B10.

- Type this formula in B10

=randbetween(1000,5000)

- Now a random value is shown there.

- Drag this formula (or cut paste it) to rows from B11 to B109

- That’s it. You have a list of 100 random values between 1000 and 5000

- Of course this is only one field and thus has limited usage.

- Press F9 and observe, ALL the 100 values will change. Why is this happening?

Recalculation in Excel

Excel is designed for multiple formulas and cells related to each other. Therefore, whenever anything changes (data or formula), Excel tries to refresh all formulas by recalculating them. Most formulas when recalculated lead to the same result, unless the original values are changed.

In case of random number generation functions, each recalculation results in a different value. Therefore, everytime you force a recalculation (by pressing F9) or when you change anything in Excel which causes it to recalculate, the values change.

If you don’t want the values to change, you have to copy paste the content of these formula results as values. I am not recommending it here because we are going to make use of the automatic generation capabilities.

If you have large numbers generated using such random functions, Excel may take substantial time to recalculate the results of all random functions. This may become cumbersome. If you want to avoid this problem, simply turn off Automatic Recalculation of Excel. Manual recalculation is invoked by F9 key. To turn Automatic recalculation off, choose Tools – Options – Calculation Tab and select the option Manual.

Vlookup function

So far we are only talking about numbers. What about other data types?

Let us assume an example. You manufacture six products.

1. Pencil

2. Paper

3. Eraser

4. Notepad

5. Glue

6. Pen

Now, you want to create test data for a sales application. You expect there to be 10000 rows in the table. Each row represents a product. Now, what do you need? You need to create 10000 rows containing one of the 6 products in a random manner.

It is easy to create 10000 random numbers between 1 to 6. But, how do we translate these numbers into the product name?

That is where the Vlookup function comes into picture. First we will use the RandBetween() function to generate 10000 values between 1 and 6 and then use Vlookup function to translate each value to the actual product name.

First let us put the base data.

Please note that the data is entered from E5 to F10 (excluding the column heading)

Now, in another location, make a list of random numbers between 1 and 6. Let us say we put the RandBetween(1,6) function for 10 cells from A1 to A10.

Now what do we have — just a list of 10 numbers between 1 and 6. Now we want to translate these into corresponding product names. How do we do that?

This is where the Vlookup function comes into the picture.

Go to A2 (next to the first random value) and type the function

=vlookup(A1, $E$5:$f$10, 2)

What does this syntax mean?

The first argument A1 indicates the value to be looked up. The lookup table is what we created earlier. It contains the number in first column and product name in second column. For Vlookup to work, the first column must be sorted (which is already so in our case).

The second argument is the range containing the entire lookup table. Please note the $ signs for row and columns. This ensures that when you copy paste the formula, Excel does not change the location of the lookup table – we have the same lookup table, is it not?

The third argument specifies which column value to return from the lookup table. We are searching by the number. So what is the point in returning the number? We obviously want the product name. Therefore, the value 2 appears indicating that we are interested in the second column.

Of course the lookup table could have had many more columns (the real product-id for example).

And now, we have the correct product name appearing in A2.

Copy and paste the formula for the remaining 9 rows.

What do you have now? All the random numbers are now translated into product names.

Now here is the best part. Press F9 and another set of random numbers gets created. The product names change automatically. Smart and fast. Is it not?

Now just for completing the scenario, add another formula in the third column. Let us say, these were the quantities of each product sold. Let us use the formula

=randbetween(100, 1000)

Here is a complete table with random values.

Now you can paste these the formulas for as many rows as you like. Great is it not?

Excel has a limitation of 65000 rows. So what next? Start another column!

Coming up…

This was just the starting point. How do you put this data in the required table? Watch out for more on this in the next article.

About the Author:Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant with many organisations, covering appropriate technology utilisation, business application of relevant technology, application architecture and audit as well as knowledge transfer. He has authored more than 650 articles on various technology-related subjects. He can be contacted at nitin@mediline.co.in
<Back to top>


© Copyright 2003: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in
Mumbai by The Business Publications Division of the Indian Express Group of Newspapers.
Please contact our Webmaster for any queries on this site.