|
Techforum
Managing data text files like database tables
ODBC
is now superseded by OLEDB. But still there are some drivers of ODBC which are
still extremely useful. In fact, these have no equivalent in OLEDB at all. One
such driver is the Text driver. This driver is not commonly known and used.
But I have used it extensively while building various data crunching applications.
Like any ODBC driver, the Text ODBC driver also makes the underlying data available
using standard data access syntax. If you use the OLEDB driver for ODBC, you
can access text data even through OLEDB. As you know it is also possible to
use linked server functionality of SQL Server to directly include text file
based data in a query using the driver.
Sample data
Let us see how we can configure this driver. For this you require a comma /
tab delimited file. Here is our sample data.
This is a tab delimited file. The data is actually a task list. It contains
the following columns:
1. Customer name
2. Task description
3. Delay (deadline - todays date )
4. Task Owner
5. Deadline for the task
Now as you know, such files are typically created as download/dump from some
existing systems. Thus there is no header information generally present.
Further,
note that we have taken various data types heretext, numeric and date.
One important thing to remember is that while using the text file date
type of data, it is important to ensure that the format of data matches with
the current date time setting on the Windows side. Otherwise, the parsing may
generate errors due to misinterpretation of the date type of data.
Typical text data handling techniques
Here is a list of commonly used techniques for handling such text data.
1. The file is submitted as input to another process / application.
2. The file is manually imported into Excel and data is analysed.
3. The data is merged with similar data originating elsewhere and the large
consolidated file is then processed
4. The file is parsed programmatically and the data is inserted into a database.
5. The file is loaded into database using some loading utility like BCP tool
of SQL Server.
6. A full-featured ETL tool is used to merge the file or process it in the required
manner. For example, Data Transformation Service of SQL Server is the ETL tool.
These tools have the ability of parsing, uploading, processing, merging, splitting,
lookup, error handling, format conversion and so on.
Now let us see how a new powerful method is available using the ODBC Text driver.
ODBC configuration
This is the most interesting part. Here are the steps:
1. Ensure that the date format is correct by checking in Control Panel
Regional Settings
2. Let us assume the file is stored at the path
C:\sample.txt
3. Now choose Control Panel ODBC Data sources. (If you are using Windows
2000 or above, you will need to choose Control Panel Administrative tools
ODBC data sources)
4. You can create three types of ODBC connections using DSNs. System DSNs are
available across the system. User DSNs are available for a particular user and
File DSNs are typically used when the DSN configuration needs to be applied
on another computer because the DSN definition itself is saved as a file.
5. For our
purpose we will use System DSN. Choose the System tab and click on Add
6. Now all available drivers are listed. Find and choose the Microsoft Text
Driver (*.csv, *.txt)
7. Now, the configuration dialog for the Text Driver will appear.
8. Unselect the Use current directory option. Select the directory C:\
9. Name the source TextDemo.
10. The concept here is simple. As you know databases contain multiple tables.
In case of the text driver, it considers the directory as a database and text
files within it as tables. Smart is it not?
By the way, in case of Excel, named ranges become tables and the entire Excel
workbook becomes the database. It is important to note these design approaches
because you can then apply these concepts while creating your own, custom OLEDB
providers.
11. Now close the dialog.
12. One important tip. Whenever you want to quickly check out any data source
(ODBC or OLEDB), the quickest and most useful way is to use Excel. Choose Data
Import External Data option to quickly connect to data sources. Use the
Import data option to get data from OLEDB providers. Use the New database query
option to get data from ODBC providers.
13. Now let us quickly try to access the sample.txt file from Excel.
14. Open Excel, create a new file and choose Data menu Import
External Data New database query
15. Select the TextDemo source from the list shown.
16. Now Excel
will query the text data source to make a list of available tables (in this
case text files) and display the list for you. In the example shown, there were
more text files in the C:\ directory. All these files are listed as tables.
17. Open the Sample.txt node in the list on the left side. As you can see, the
data is not shown properly. Ideally field names should have been listed in the
dialog. But here it shows a long text caption.
18. Why did this happen? Because we did not specify details of how the data
is delimited in the file. Further we did not specify that there was no header
row. Therefore, it considered the first row as the header row and converted
it to a single string containing the entire data from all fields.
19. So let us go back to our data source definition and provide this information.
Cancel the Excel import for the time being. We will come back to it later.
20. Now, open the System DSN list from ODBC Data sources and select the TextDemo
source which we just created.
21. Click on Configure. In the dialog, click on Options button. The dialog now
expands and displays more options. This is the most important, interesting and
powerful part of the humble Text ODBC Driver.
Defining formats in ODBC
22. First of all notice that it is possible to define particular file extensions.
By default it is *.*.
23. Unclear the Default option. Now the list will be enabled. This list shows
various types of extensions already. These include CSV, TAB, ASC and TXT. You
can add your custom extensions if any. This way, the ODBC users will get the
correct list of relevant files (tables) available in the data source.
24. Please note the ODBC text driver works only on one directory at a time.
Files stored in subdirectories are not accessible. For this you will need to
create another source which points to that directory.
25. Anyway. Now notice the small button called Define Formats
.
It is so inconspicuous that most of us miss it. Those who notice it may never
feel like clicking on it. But go ahead and click.
26. Now you are treated with a great collection of brilliant functionality which
can make your data processing tasks dramatically easier.
What is all
this? Look carefully and you will understand the concept. Now that we have selected
a directory and that contains various text files, it is possible that each text
file has a different format, delimiters and data fields. Who will decipher all
this?
Traditionally, we have done this type of decoding and deciphering at code level
or batch process level. But think about it. Why not let the ODBC driver provide
this information to all applications which want to use this data? Why hard code
these things within individual applications?
That is how this dialog was created. The concept is simple. This dialog allows
you to specify details about the format and structure of each text file within
the selected directory. Once this information is available it is made available
to any ODBC consuming application as a part of standard metadata information.
Brilliant way of handling things, I must say.
27. Now choose the sample.txt from the list. Select the Format as Tab
Delimited
28. Ensure that the option Column Name Header is cleared. This is
enabled only if the first row in the file contains column names. In our case
we do not have a header row.
29. Now on the right side there is a list of fields. By default it is empty.
Click on the Guess button. Now, the text driver will try to analyse the data
types of the fields available and guess their types and widths. If the delimiter
information is wrongly configured, the guess option may generate wrong results.
30. Now the Guess option should detect 5 fields in our file sample.txt
31. As the driver does not know what name to give to each field, it names them
F1, F2, F3 and so on.
32. Now you can select each of the fields, cross check that the data type is
right. You can override the guessed datatype if required. You can also change
the default names to more readable and legible names. Click on modify to store
the changes done to each field setting.
33. After you have defined name for each field, choose OK.
34. Now you have defined the complete format for our sample.txt in the text
driver.
Accessing data in Excel
35. Now you can return to the Excel data import option.
36. As you can see, the field details are now shown correctly.
37. Complete the wizard and then you will see the data being shown in Excel
in the appropriate format.
38. If the data in the text file changes let us say we get the current
data as sample.txt on a daily basis, you just need to right click on the data
in Excel and choose the Refresh button.
39. Internally it will access the text data again by firing a select *
command and refresh the data.
Accessing data in programs
This is very simple and similar to any other type of data access. You can use
any programmatic method you use for data handling with text files now.
This is a very
powerful feature because, without making any effort for parsing or uploading
the data, you are now in a position to edit the data, fire queries, compute
aggregates and even participate in heterogeneous database queries.
Imagine the amount of time, energy and code so many developers must have put
in writing their own sequential delimiter parsing code looking for particular
characters in a loop, till CR-LF is found and so on. Why did they not discover
this simple but effective feature of something as ancient as ODBC?
Rethink Text file processing
Think of all the places within your organization (or whatever context) where
text files (comma delimited, fixed width, tab delimited) are being used and
processed. Think of whether this method would provide a better approach. If
the answer is yes, go ahead and try this method out.
Think of new uses
Apart from enhancing existing text data handling, you can now think of completely
different new ways of utilising the text based data. Because now you have ODBC
access directly to the raw data, many new opportunities can be thought of, which
were simply not feasible earlier.
In short, a small but powerful set of features provides you
with a great opportunity to manage text files in ways that were traditionally
unthinkable. Make use of this and enhance your applications.
 |
About the Author:Dr Nitin
Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant
with many organisations, covering appropriate technology utilisation, business
application of relevant technology, application architecture and audit as
well as knowledge transfer. He has authored more than 650 articles on various
technology-related subjects. He can be contacted at nitin@mediline.co.in |
|