|
Tech Forum
Generating Test Data II
Article summary
In
last 0article, we saw how to create randomised test data of numeric
and character types. This article explains how to create data
type of data. We also discuss controlling the distribution of data by specifying
a pattern, and finally how to insert the data into a database to generate large
volumes of data.
Generating dates
Primarily we use numbers, characters and dates for storing business data. We
have already seen how to generate numbers within a range and how to generate
randomised series of text values like product names or any other text oriented
master data.
Now let us see how we can handle dates. There is no random function available
for dates. But some smart thinking will give us the solution.
When we want to generate dates, what do we want really? Generally you will want
random transaction dates between a start and end date. Is it not?
So let us see how to do this.
As an example, we will generate dates between 1st April 2004 and 23rd May 2004.
The approach is simple.
1. Subtract the start date from end date and find the number of elapsed days.
2. The difference is 52 days.
3. Now, generate random numbers between 1 and 52 first.
4. This is done using the our RANDBETWEEN() function.
5. Now in the next column add the start date to each of the randomly generated
numbers and you immediately have the dates randomised exactly as we wanted.
Now,
all dates have a time component also. If you apply the right format to these
cells containing dates, you will realise that each date has the same time recorded.
23-May-2004 12:00 AM
Now, let us say we want to generate different time duration also, in addition
to random dates. To complicate matters further, we have to only generate time
values during office hours 9.30 am to 5.30 pm. In order to make it realistic,
the time values should be randomly changed at a minute-level granularity.
Hour-level would be too gross an approximation and seconds-level would be unnecessarily
detailed. How do you achieve this?
To understand this concept we have to first understand how Excel stores dates.
All date-time values are stored as numbers internally.
All that you have to do is to change the format of the cell containing the above
date to General (Format, Cells, Numbers)
Now, the same value is shown as:
23-May-2004 12:03:19 AM 38130.0023
The integer value 38130 stores the date as number of days elapsed between 1st
January 1900. Time value is shown in the fraction. What does this indicate?
Time counting starts within a day from 12:00 am or midnight. If you look at
the number value of the 12:00:00 am value, it will be shown as 38130 without
a fraction.
Now, we want to find out the number of minutes since midnight for 9.30 am and
5.30 pm. To find out the number of minutes between two given date-time values,
you use the formula:
What is this formula? When you subtract two date-time values, it gives you the
difference. In this case, the day is same only time is different. So the difference
will be in a decimal fraction. This is to be multiplied by minutes in a day
to calculate how many minutes elapsed between these two date time values. That
is why we are multiplying the fraction with 24 (hours) and 60 ( minutes per
hour). Finally we dont want fractional minute values, so we use the int
function.
So we now know that 9.30 am means 570 minutes since midnight and 5.30 pm means
1049 minutes since midnight.
We can add
hours, minutes or seconds to a given date-time value, without worrying about
how it is managed internally. For example, let us add 133 minutes to the datetime
value 23-May-2004 12:00 AM. There is a very useful and nice function in Excel
called TIME(). It takes three parameters hours, minutes and seconds.
In this example, we started off with the midnight date-time value 23-May-2004
12:00 am. We wanted to add 133 minutes to this value and get the resulting datetime
value. In this example, we only are adding minutes. Therefore in the TIME function
we keep the hour and second parameters empty.
The TIME function understands that the base value is date-time, the value being
added is a number and also that this number is to be treated as minutes. Finally
it gives you the proper datetime value which is 133 minutes from midnight.
Now we have all the knowledge required to generate random datetime values between
9.30 am to 5.30 pm on a given day.
Here are the steps.
1. Start with
a date-time value for a given day which is a typical midnight value. Let us
start with our same date-time, 23-May-2004 12:00 am.
2. Now we want to add minutes to these which will create a date-time value (at
minute level resolution) between 9.30 am to 5.30 pm. We already know these time
values actually should be between 570 and 1049 minutes since midnight.
3. Now we use our good old RANDBETWEEN(570, 1049) function to create random
minute values in this range.
4. We then add these minutes to the base midnight date time value.
5. Thats it. You have real-life like randomised date-time values at minute
level resolution.
Here is how the data would be. I have added this to our base exercise of generating
random base date values (with 12:00 AM or midnight default time value).
Controlling the distribution of random values
This is a
special case. In all above examples, we have been using the RANDBETWEEN() function.
This way, all the values are randomly generated, between two numbers. Suppose
you generate values between 1 and 5 randomly, after say 1000 random values are
generated, you will observe that each value is generated approximately 25% of
time.
What to do when you have to generate the some specific values randomly with
a different distribution?
This is done using the Random Number Generation tool within the Data Analysis
Add-in in Excel. You enable this add in exactly the same way as you added the
earlier add-in in last article.
First you need two columns of data, first column containing the values and the
second column contains the distribution weight you want to apply. The sum of
all values in the second column should be equal to 1. The decimal values indicate
what percentage of random values should be created for each of the base values.
| 1 |
0.05 |
| 2 |
0.15 |
| 3 |
0.3 |
| 4 |
0.2 |
| 5 |
0.3 |
Once you have this table ready, you can choose Tools Data Analysis from
the Excel menu. Within that choose the option, Random Number Generation. Now
configure the dialog as follows:
- Number of random numbers: 100
- Distribution: Discrete
- Value and probability input range: Select the above two columns (without column
headings).
- Output range, select an empty cell.
- Now click Ok.
- 100 random numbers will be generated with the specified distribution percentage.
This is a very powerful method but can still be used easily and effectively
even by end users.
Uploading data to databases
Once you have all the data in Excel, using one or more of the methods shown
above, we need this data to be uploaded to a database. This is actually managed
easier if we manage this from the database side. To do so, follow these instructions.
The assumption is that the database contains the target table containing the
appropriate fields and field types.
1. Open Excel. Let us assume that there are three columns
containing randomised data which correspond to the three fields of the base
table in the database.
2. Select the entire block containing the three columns and all rows.
3. Choose, File Save As.
4. Save the file as comma delimited file (CSV).
5. From your database use an ETL Tool or a bulk loading tool to upload the file
containing real-life like data into the base table.
6. Thats it.
More possibilities
These examples are just thatexamples. Apply your mind
and you will be able to create realistic data of any volume you desire very
easily.
 |
About
the Author:Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline).
He is a consultant with many organisations, covering appropriate technology
utilisation, business application of relevant technology, application architecture
and audit as well as knowledge transfer. He has authored more than 650 articles
on various technology-related subjects. He can be contacted at nitin@mediline.co.in |
|