[an error occurred while processing this directive]
Issue dated - 20th October 2003

-


Previous Issues

CURRENT ISSUE
INDIA NEWS
STOCK FILE
INDIA TRENDS
NEWS ANALYSIS
OPINION
E-BUSINESS
FOCUS
TECHNOLOGY
TECHSPACE
PRODUCTS
COLUMNS
TECH FORUM

THE C# COLUMN

BETWEEN THE BYTES
TECHNOLOGY
SPECIALS <NEW>
Symantec Report
Security Headquarters
JobsDB
MINDPRINTS
HMA BANKBIZ
EC SERVICES
ARCHIVES/SEARCH
IT APPOINTMENTS
WRITE TO US
SUBSCRIBE/RENEW
CUSTOMER SERVICE
ADVERTISE
ABOUT US

 Network Sites
  IT People
  Network Magazine
  Business Traveller
  Exp. Hotelier & Caterer
  Exp. Travel & Tourism
  Exp. Pharma Pulse
  Exp. Healthcare Mgmt.
  Express Textile
 Group Sites
  ExpressIndia
  Indian Express
  Financial Express

 
Front Page > TechSpace > Story Print this Page|  Email this page

Eliminating duplicate rows in Excel

Tech Forum - Dr. Nitin Paranjpe

Most organisations have databases systems in place for business automation. However, most end users still have a habit of creating their own Excel sheets to store and manipulate data. Ideally Excel should be used for analysing data stored in database. However, you will be surprised to know that a very substantial percentage of companies still use Excel more as a data storage engine than a spreadsheet.

Spreadsheet is not designed to store data. It does not have adequate support for maintaining data integrity, validations, relations and so on. Therefore, problems that never arise with databases are common with Excel-based data storage.

One of the common problems (which does not seem to have a solution) is how to eliminate duplicate rows. This is a very common issue when Excel is used for entering data directly. If the data came from a database, duplicate rows are unlikely.

In this section I will provide a very useful and effective solution for this problem.

The solution

This solution processes a specified list of Excel data ranges and returns a list of unique rows. If there are duplicate rows, it simply eliminates them.

Here is some sample Excel data. As you can see, there are duplicate rows. Please note that this range has been named as SampleRange. Let us assume that this file is saved as Duplicate.xls.

How do you give a name to a range? Simple. Just highlight the range (including the headings) and then type the name in the Name Box (the area where you see the name “SampleRange” in the diagram)

Most users try to use some sort of built-in function to manage duplicates. However these methods are simply not effective. We have to look for something more sophisticated and smart to make this work.

If you know database programming, the solution is simple. You simply use the ‘distinct’ syntax of an SQL command to eliminate duplicates. The command would be very simple.

select distinct * from <tablename>

The problem is that our data is not in a database, it is in Excel. Now what do we do?

If we could treat Excel data like a database table, and run SQL commands onto it, our work would be done quickly. Fortunately this is possible.

It has been there for years. But very few people notice, know and use the ODBC driver for Excel. This is a very powerful mechanism of working on Excel data as though it was a database. The workbook is the database and specific ranges are like tables.

Now, the next issue is where to run the ‘Select’ command from?

That’s easier than you thought. You can actually run the query from the same Excel sheet! Don’t believe me?

Just follow these steps:

1. Create an ODBC data source for the Excel file.

2. Create an external data connection based upon this data source.

3. Run the SELECT DISTINCT command.

4. Return data to the same Excel sheet.

Now let us do this step by step.

Creating the ODBC data source

1. Choose Start menu – Settings – Control Panel – Administrative Tools – Data Sources (ODBC)

2. You will see an existing entry called Excel files. Select the entry and choose Configure.

3. In the dialog that appears, click on Select Workbook and choose the file ‘Duplicate.xls’. Click Ok.

4. Now we have created the data source. Even if this file is open in Excel it is fine. It still allows you to create it as a data source.

Creating an external data connection

1. Now come back to Excel and open ‘Duplicate.xls’ (if it is not already opened).

2. To get data from any source, you need to choose the Data - Import External Data – New Database Query option.

3. A dialog appears which lists all available data sources. Choose the Excel Files data source and choose Ok.

4. Now it tries to connect to the data source, which in this case is the same Excel file.

5. It tries to search for all named ranges. In this case only the ‘SampleRange’ range is available.

6. It shows this on the left side of the dialog. Click on the ‘SampleRange’ and click on the ‘>’ button.

7. Click Next twice (skip the Filter data and Sort Order steps).

8. Now in the last step (Dialog is now titled ‘Finish’) choose the option – ‘View data or edit query in Microsoft Query’.

9. Click Finish.

Using the SELECT DISTINCT clause

1. Now the MS Query application appears. Actually MS Query has been there for years. But now it is very well integrated with Excel query engine. MS Query shows all the records in the SampleRange range. This obviously shows two rows containing the same Item and Qty (Pen – 3). This is because the default query is ‘SELECT * from <tablename>’. We now want to insert the DISTINCT clause.

2. To do this, click on the View SQL button.

3. Now the default SQL statement will be shown. Add the word DISTINCT immediately after the SELECT clause and choose Ok.

4. Now the records will be shown filtered in such a way that only one occurrence of “Pen – 3” is seen.

5. Choose the File menu and select the ‘Return data to Microsoft Excel’ option.

Returning data to Excel

1. Now another dialog appears which allows you to specify whether you want the data to be returned to Excel as raw data, or you want to create a Pivot Table.

2. In this case we will choose the option – Existing worksheet. Click on any empty cell and click on Ok.

3. Now the data that you have in Excel contains no duplicates, but it is internally kept as a result of an external query. You actually want to convert it to actual Excel data (with duplicates removed) and then delete the raw data containing duplicates. To achieve this, copy the cleaned up data and choose ‘Paste Special…’ to paste it at another location or another Excel file with the Values option. This de-links data from the base query.

4. That’s all there is to it.

ASCII raw data

This same concept can be used to clean raw data available in ASCII format (comma delimited / tab delimited or Fixed width format). For this also there is an ODBC driver available. To create a data source to a simple ASCII text file…

1. Open the Data Sources menu.

2. Choose Add… button.

3. Choose Microsoft Text driver from the list of available drivers.

4. Specify the directory. Now click on Options.

5. Here you can specify the file extensions.

6. As you know ODBC requires a database that contains multiple tables. In this case the Directory itself acts like a database.

7. Individual ASCII files represent the tables.

8. Now, click on the Define Format … button.

9. Another dialog appears which shows all available files with selected extensions.

10. For each file you can specify how many columns it contains, whether it contains header row (first row containing the column headings, the format and so on.

11. The most interesting feature is the Guess button.

12. If you don’t know the exact data types of each column in the incoming text file, the ODBC driver can guess it for you. You can, of course, correct the guess, if required.

13. Here is a sample file ASCII.TXT, which is actually a comma-delimited file.

Item,Qty,SaleDate

x,10,12-1-03
y,3,11-11-03

14. Now if you click on the Guess button, ODBC will search for specified number of rows in the incoming data to try and find out what the column data types are. Default value is zero, which means all rows are searched. This is fine if the incoming data is small. For large data, it is better to specify a finite value of rows to scan for determining data types.

15. In this case, leave the value of ‘Rows to Scan’ as zero and click on Guess button. Choose the ‘Column Name Header’ option. This indicates that the first row contains the column names. This is a brilliant feature.

16. Now the Columns list will be populated to show Item, Qty and SaleDate columns. Click on each one of them to see that the appropriate data type has been guessed correctly.

17. Now choose Ok. Your ASCII file-based data source is now ready.

18. To import this data and eliminate duplicates follow the same steps as shown above. For Excel it does not matter whether the data is coming from an ASCII file, another Excel sheet range or an actual database. It simply works on ODBC. This is a real representation of Universal Data Access!

About the Author: Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant with many organisations, covering appropriate technology utilisation, business application of relevant technology, application architecture and audit as well as knowledge transfer. He is a Certified Information Systems Auditor and has authored more than 650 articles on various technology-related subjects. He can be contacted at nitin@mediline.co.in.
<Back to top>


© Copyright 2003: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in
Mumbai by The Business Publications Division of the Indian Express Group of Newspapers.
Please contact our Webmaster for any queries on this site.