Issue dated - 2nd August 2004

-


Previous Issues

CURRENT ISSUE
INDIA NEWS
NEWSANALYSIS
COLUMNS
TECH FORUM

THE C# COLUMN

BETWEEN THE BYTES
TECHNOLOGY
SPECIALS <NEW>
Symantec Report
Security Headquarters
JobsDB
MINDPRINTS
HMA BANKBIZ
EC SERVICES
ARCHIVES/SEARCH
IT APPOINTMENTS
Openings At Jobstreet.com
WRITE TO US
SUBSCRIBE/RENEW
CUSTOMER SERVICE
ADVERTISE
ABOUT US

 Network Sites
  IT People
  Network Magazine
  Business Traveller
  Exp. Hotelier & Caterer
  Exp. Travel & Tourism
  Exp. Pharma Pulse
  Exp. Healthcare Mgmt.
  Express Textile
 Group Sites
  ExpressIndia
  Indian Express
  Financial Express

 
Front Page > TechSpace > Story Print this Page|  Email this page

Tech Forum

Generating Test Data – II

Article summary

In last 0article, we saw how to create randomised test data of ‘numeric’ and ‘character’ types. This article explains how to create ‘data’ type of data. We also discuss controlling the distribution of data by specifying a pattern, and finally how to insert the data into a database to generate large volumes of data.

Generating dates

Primarily we use numbers, characters and dates for storing business data. We have already seen how to generate numbers within a range and how to generate randomised series of text values like product names or any other text oriented master data.

Now let us see how we can handle dates. There is no random function available for dates. But some smart thinking will give us the solution.

When we want to generate dates, what do we want really? Generally you will want random transaction dates between a start and end date. Is it not?

So let us see how to do this.

As an example, we will generate dates between 1st April 2004 and 23rd May 2004.

The approach is simple.

1. Subtract the start date from end date and find the number of elapsed days.

2. The difference is 52 days.

3. Now, generate random numbers between 1 and 52 first.

4. This is done using the our RANDBETWEEN() function.

5. Now in the next column add the start date to each of the randomly generated numbers and you immediately have the dates randomised exactly as we wanted.

Now, all dates have a time component also. If you apply the right format to these cells containing dates, you will realise that each date has the same time recorded.

23-May-2004 12:00 AM

Now, let us say we want to generate different time duration also, in addition to random dates. To complicate matters further, we have to only generate time values during office hours – 9.30 am to 5.30 pm. In order to make it realistic, the time values should be randomly changed at a ‘minute-level’ granularity. Hour-level would be too gross an approximation and seconds-level would be unnecessarily detailed. How do you achieve this?

To understand this concept we have to first understand how Excel stores dates. All date-time values are stored as numbers internally.

All that you have to do is to change the format of the cell containing the above date to General (Format, Cells, Numbers)

Now, the same value is shown as:

23-May-2004 12:03:19 AM 38130.0023

The integer value 38130 stores the date as number of days elapsed between 1st January 1900. Time value is shown in the fraction. What does this indicate? Time counting starts within a day from 12:00 am or midnight. If you look at the number value of the 12:00:00 am value, it will be shown as 38130 without a fraction.

Now, we want to find out the number of minutes since midnight for 9.30 am and 5.30 pm. To find out the number of minutes between two given date-time values, you use the formula:

What is this formula? When you subtract two date-time values, it gives you the difference. In this case, the day is same only time is different. So the difference will be in a decimal fraction. This is to be multiplied by minutes in a day to calculate how many minutes elapsed between these two date time values. That is why we are multiplying the fraction with 24 (hours) and 60 ( minutes per hour). Finally we don’t want fractional minute values, so we use the int function.

So we now know that 9.30 am means 570 minutes since midnight and 5.30 pm means 1049 minutes since midnight.

We can add hours, minutes or seconds to a given date-time value, without worrying about how it is managed internally. For example, let us add 133 minutes to the datetime value 23-May-2004 12:00 AM. There is a very useful and nice function in Excel called TIME(). It takes three parameters – hours, minutes and seconds.

In this example, we started off with the midnight date-time value 23-May-2004 12:00 am. We wanted to add 133 minutes to this value and get the resulting datetime value. In this example, we only are adding minutes. Therefore in the TIME function we keep the hour and second parameters empty.

The TIME function understands that the base value is date-time, the value being added is a number and also that this number is to be treated as minutes. Finally it gives you the proper datetime value which is 133 minutes from midnight.

Now we have all the knowledge required to generate random datetime values between 9.30 am to 5.30 pm on a given day.

Here are the steps.

1. Start with a date-time value for a given day which is a typical midnight value. Let us start with our same date-time, 23-May-2004 12:00 am.

2. Now we want to add minutes to these which will create a date-time value (at minute level resolution) between 9.30 am to 5.30 pm. We already know these time values actually should be between 570 and 1049 minutes since midnight.

3. Now we use our good old RANDBETWEEN(570, 1049) function to create random minute values in this range.

4. We then add these minutes to the base midnight date time value.

5. That’s it. You have real-life like randomised date-time values at minute level resolution.

Here is how the data would be. I have added this to our base exercise of generating random base date values (with 12:00 AM or midnight default time value).

Controlling the distribution of random values

This is a special case. In all above examples, we have been using the RANDBETWEEN() function. This way, all the values are randomly generated, between two numbers. Suppose you generate values between 1 and 5 randomly, after say 1000 random values are generated, you will observe that each value is generated approximately 25% of time.

What to do when you have to generate the some specific values randomly with a different distribution?

This is done using the Random Number Generation tool within the Data Analysis Add-in in Excel. You enable this add in exactly the same way as you added the earlier add-in in last article.

First you need two columns of data, first column containing the values and the second column contains the distribution weight you want to apply. The sum of all values in the second column should be equal to 1. The decimal values indicate what percentage of random values should be created for each of the base values.

Values Weight
1 0.05
2 0.15
3 0.3
4 0.2
5 0.3

Once you have this table ready, you can choose Tools – Data Analysis from the Excel menu. Within that choose the option, Random Number Generation. Now configure the dialog as follows:

- Number of random numbers: 100

- Distribution: Discrete

- Value and probability input range: Select the above two columns (without column headings).

- Output range, select an empty cell.

- Now click Ok.

- 100 random numbers will be generated with the specified distribution percentage.

This is a very powerful method but can still be used easily and effectively even by end users.

Uploading data to databases

Once you have all the data in Excel, using one or more of the methods shown above, we need this data to be uploaded to a database. This is actually managed easier if we manage this from the database side. To do so, follow these instructions. The assumption is that the database contains the target table containing the appropriate fields and field types.

1. Open Excel. Let us assume that there are three columns containing randomised data which correspond to the three fields of the base table in the database.

2. Select the entire block containing the three columns and all rows.

3. Choose, File – Save As.

4. Save the file as comma delimited file (CSV).

5. From your database use an ETL Tool or a bulk loading tool to upload the file containing real-life like data into the base table.

6. That’s it.

More possibilities

These examples are just that—examples. Apply your mind and you will be able to create realistic data of any volume you desire very easily.

About the Author:Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant with many organisations, covering appropriate technology utilisation, business application of relevant technology, application architecture and audit as well as knowledge transfer. He has authored more than 650 articles on various technology-related subjects. He can be contacted at nitin@mediline.co.in
<Back to top>


© Copyright 2003: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in
Mumbai by The Business Publications Division of the Indian Express Group of Newspapers.
Please contact our Webmaster for any queries on this site.