Issue dated - 22nd December 2003

-


Previous Issues

CURRENT ISSUE
NEWS ANALYSIS
INDIA NEWS
COLUMNS
TECH FORUM

THE C# COLUMN

BETWEEN THE BYTES
TECHNOLOGY
SPECIALS <NEW>
Symantec Report
Security Headquarters
JobsDB
MINDPRINTS
HMA BANKBIZ
EC SERVICES
ARCHIVES/SEARCH
IT APPOINTMENTS
Openings At Jobstreet.com
WRITE TO US
SUBSCRIBE/RENEW
CUSTOMER SERVICE
ADVERTISE
ABOUT US

 Network Sites
  IT People
  Network Magazine
  Business Traveller
  Exp. Hotelier & Caterer
  Exp. Travel & Tourism
  Exp. Pharma Pulse
  Exp. Healthcare Mgmt.
  Express Textile
 Group Sites
  ExpressIndia
  Indian Express
  Financial Express

 
Front Page > TechSpace > Story Print this Page|  Email this page

Techforum

Managing data text files like database tables

ODBC is now superseded by OLEDB. But still there are some drivers of ODBC which are still extremely useful. In fact, these have no equivalent in OLEDB at all. One such driver is the Text driver. This driver is not commonly known and used. But I have used it extensively while building various data crunching applications.

Like any ODBC driver, the Text ODBC driver also makes the underlying data available using standard data access syntax. If you use the OLEDB driver for ODBC, you can access text data even through OLEDB. As you know it is also possible to use linked server functionality of SQL Server to directly include text file based data in a query using the driver.

Sample data

Let us see how we can configure this driver. For this you require a comma / tab delimited file. Here is our sample data.

This is a tab delimited file. The data is actually a task list. It contains the following columns:

1. Customer name

2. Task description

3. Delay (deadline - today’s date )

4. Task Owner

5. Deadline for the task

Now as you know, such files are typically created as download/dump from some existing systems. Thus there is no header information generally present.

Further, note that we have taken various data types here—text, numeric and date. One important thing to remember is that while using the text file ‘date’ type of data, it is important to ensure that the format of data matches with the current date time setting on the Windows side. Otherwise, the parsing may generate errors due to misinterpretation of the ‘date’ type of data.

Typical text data handling techniques

Here is a list of commonly used techniques for handling such text data.

1. The file is submitted as input to another process / application.

2. The file is manually imported into Excel and data is analysed.

3. The data is merged with similar data originating elsewhere and the large consolidated file is then processed

4. The file is parsed programmatically and the data is inserted into a database.

5. The file is loaded into database using some loading utility like BCP tool of SQL Server.

6. A full-featured ETL tool is used to merge the file or process it in the required manner. For example, Data Transformation Service of SQL Server is the ETL tool. These tools have the ability of parsing, uploading, processing, merging, splitting, lookup, error handling, format conversion and so on.

Now let us see how a new powerful method is available using the ODBC Text driver.

ODBC configuration

This is the most interesting part. Here are the steps:

1. Ensure that the date format is correct by checking in Control Panel – Regional Settings

2. Let us assume the file is stored at the path

C:\sample.txt

3. Now choose Control Panel – ODBC Data sources. (If you are using Windows 2000 or above, you will need to choose Control Panel – Administrative tools – ODBC data sources)

4. You can create three types of ODBC connections using DSNs. System DSNs are available across the system. User DSNs are available for a particular user and File DSNs are typically used when the DSN configuration needs to be applied on another computer because the DSN definition itself is saved as a file.

5. For our purpose we will use System DSN. Choose the System tab and click on Add…

6. Now all available drivers are listed. Find and choose the Microsoft Text Driver (*.csv, *.txt)

7. Now, the configuration dialog for the Text Driver will appear.

8. Unselect the Use current directory option. Select the directory C:\

9. Name the source ‘TextDemo’.

10. The concept here is simple. As you know databases contain multiple tables. In case of the text driver, it considers the directory as a database and text files within it as tables. Smart – is it not?

By the way, in case of Excel, named ranges become tables and the entire Excel workbook becomes the database. It is important to note these design approaches because you can then apply these concepts while creating your own, custom OLEDB providers.

11. Now close the dialog.

12. One important tip. Whenever you want to quickly check out any data source (ODBC or OLEDB), the quickest and most useful way is to use Excel. Choose Data – Import External Data option to quickly connect to data sources. Use the Import data option to get data from OLEDB providers. Use the New database query option to get data from ODBC providers.

13. Now let us quickly try to access the sample.txt file from Excel.

14. Open Excel, create a new file and choose – Data menu – Import External Data – New database query

15. Select the TextDemo source from the list shown.

16. Now Excel will query the text data source to make a list of available tables (in this case text files) and display the list for you. In the example shown, there were more text files in the C:\ directory. All these files are listed as tables.

17. Open the Sample.txt node in the list on the left side. As you can see, the data is not shown properly. Ideally field names should have been listed in the dialog. But here it shows a long text caption.

18. Why did this happen? Because we did not specify details of how the data is delimited in the file. Further we did not specify that there was no header row. Therefore, it considered the first row as the header row and converted it to a single string containing the entire data from all fields.

19. So let us go back to our data source definition and provide this information. Cancel the Excel import for the time being. We will come back to it later.

20. Now, open the System DSN list from ODBC Data sources and select the TextDemo source which we just created.

21. Click on Configure. In the dialog, click on Options button. The dialog now expands and displays more options. This is the most important, interesting and powerful part of the humble Text ODBC Driver.

Defining formats in ODBC

22. First of all notice that it is possible to define particular file extensions. By default it is *.*.

23. Unclear the Default option. Now the list will be enabled. This list shows various types of extensions already. These include CSV, TAB, ASC and TXT. You can add your custom extensions if any. This way, the ODBC users will get the correct list of relevant files (tables) available in the data source.

24. Please note the ODBC text driver works only on one directory at a time. Files stored in subdirectories are not accessible. For this you will need to create another source which points to that directory.

25. Anyway. Now notice the small button called “Define Formats…”. It is so inconspicuous that most of us miss it. Those who notice it may never feel like clicking on it. But go ahead and click.

26. Now you are treated with a great collection of brilliant functionality which can make your data processing tasks dramatically easier.

What is all this? Look carefully and you will understand the concept. Now that we have selected a directory and that contains various text files, it is possible that each text file has a different format, delimiters and data fields. Who will decipher all this?

Traditionally, we have done this type of decoding and deciphering at code level or batch process level. But think about it. Why not let the ODBC driver provide this information to all applications which want to use this data? Why hard code these things within individual applications?

That is how this dialog was created. The concept is simple. This dialog allows you to specify details about the format and structure of each text file within the selected directory. Once this information is available it is made available to any ODBC consuming application as a part of standard metadata information. Brilliant way of handling things, I must say.

27. Now choose the sample.txt from the list. Select the Format as “Tab Delimited”

28. Ensure that the option “Column Name Header” is cleared. This is enabled only if the first row in the file contains column names. In our case we do not have a header row.

29. Now on the right side there is a list of fields. By default it is empty. Click on the Guess button. Now, the text driver will try to analyse the data types of the fields available and guess their types and widths. If the delimiter information is wrongly configured, the guess option may generate wrong results.

30. Now the Guess option should detect 5 fields in our file sample.txt

31. As the driver does not know what name to give to each field, it names them F1, F2, F3 and so on.

32. Now you can select each of the fields, cross check that the data type is right. You can override the guessed datatype if required. You can also change the default names to more readable and legible names. Click on modify to store the changes done to each field setting.

33. After you have defined name for each field, choose OK.

34. Now you have defined the complete format for our sample.txt in the text driver.

Accessing data in Excel

35. Now you can return to the Excel data import option.

36. As you can see, the field details are now shown correctly.

37. Complete the wizard and then you will see the data being shown in Excel in the appropriate format.

38. If the data in the text file changes – let us say we get the current data as sample.txt on a daily basis, you just need to right click on the data in Excel and choose the Refresh button.

39. Internally it will access the text data again by firing a ‘select *’ command and refresh the data.

Accessing data in programs

This is very simple and similar to any other type of data access. You can use any programmatic method you use for data handling with text files now.

This is a very powerful feature because, without making any effort for parsing or uploading the data, you are now in a position to edit the data, fire queries, compute aggregates and even participate in heterogeneous database queries.

Imagine the amount of time, energy and code so many developers must have put in writing their own sequential delimiter parsing code – looking for particular characters in a loop, till CR-LF is found and so on. Why did they not discover this simple but effective feature of something as ancient as ODBC?

Rethink Text file processing

Think of all the places within your organization (or whatever context) where text files (comma delimited, fixed width, tab delimited) are being used and processed. Think of whether this method would provide a better approach. If the answer is yes, go ahead and try this method out.

Think of new uses

Apart from enhancing existing text data handling, you can now think of completely different new ways of utilising the text based data. Because now you have ODBC access directly to the raw data, many new opportunities can be thought of, which were simply not feasible earlier.

In short, a small but powerful set of features provides you with a great opportunity to manage text files in ways that were traditionally unthinkable. Make use of this and enhance your applications.

About the Author:Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant with many organisations, covering appropriate technology utilisation, business application of relevant technology, application architecture and audit as well as knowledge transfer. He has authored more than 650 articles on various technology-related subjects. He can be contacted at nitin@mediline.co.in
<Back to top>


© Copyright 2003: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in
Mumbai by The Business Publications Division of the Indian Express Group of Newspapers.
Please contact our Webmaster for any queries on this site.