|
Eliminating duplicate rows in Excel
Tech Forum - Dr. Nitin Paranjpe
Most
organisations have databases systems in place for business automation. However,
most end users still have a habit of creating their own Excel sheets to store
and manipulate data. Ideally Excel should be used for analysing data stored
in database. However, you will be surprised to know that a very substantial
percentage of companies still use Excel more as a data storage engine than a
spreadsheet.
Spreadsheet is not designed to store data. It does
not have adequate support for maintaining data integrity, validations, relations
and so on. Therefore, problems that never arise with databases are common with
Excel-based data storage.
One of the common problems (which does not seem to
have a solution) is how to eliminate duplicate rows. This is a very common issue
when Excel is used for entering data directly. If the data came from a database,
duplicate rows are unlikely.
In this section I will provide a very useful and effective
solution for this problem.
The solution
This solution processes a specified list of Excel data
ranges and returns a list of unique rows. If there are duplicate rows, it simply
eliminates them.
Here
is some sample Excel data. As you can see, there are duplicate rows. Please
note that this range has been named as SampleRange. Let us assume that this
file is saved as Duplicate.xls.
How do you give a name to a range? Simple. Just highlight
the range (including the headings) and then type the name in the Name Box (the
area where you see the name SampleRange in the diagram)
Most users try to use some sort of built-in function
to manage duplicates. However these methods are simply not effective. We have
to look for something more sophisticated and smart to make this work.
If you know database programming, the solution is simple.
You simply use the distinct syntax of an SQL command to eliminate
duplicates. The command would be very simple.
select distinct * from <tablename>
The problem is that our data is not in a database, it
is in Excel. Now what do we do?
If we could treat Excel data like a database table,
and run SQL commands onto it, our work would be done quickly. Fortunately this
is possible.
It has been there for years. But very few people notice,
know and use the ODBC driver for Excel. This is a very powerful mechanism of
working on Excel data as though it was a database. The workbook is the database
and specific ranges are like tables.
Now, the next issue is where to run the Select
command from?
Thats easier than you thought. You can actually
run the query from the same Excel sheet! Dont believe me?
Just follow these steps:
1. Create an ODBC data source for the Excel file.
2. Create an external data connection based upon this
data source.
3. Run the SELECT DISTINCT command.
4. Return data to the same Excel sheet.
Now let us do this step by step.
Creating the ODBC data source
1. Choose Start menu Settings Control
Panel Administrative Tools Data Sources (ODBC)
2. You will see an existing entry called Excel files.
Select the entry and choose Configure.
3. In the dialog that appears, click on Select Workbook
and choose the file Duplicate.xls. Click Ok.

4. Now we have created the data source. Even if this
file is open in Excel it is fine. It still allows you to create it as a data
source.
Creating an external data connection
1. Now come back to Excel and open Duplicate.xls
(if it is not already opened).
2. To get data from any source, you need to choose
the Data - Import External Data New Database Query option.
3. A dialog appears which lists all available data
sources. Choose the Excel Files data source and choose Ok.
4. Now it tries to connect to the data source, which
in this case is the same Excel file.
5. It tries to search for all named ranges. In this
case only the SampleRange range is available.
6. It shows this on the left side of the dialog. Click
on the SampleRange and click on the > button.
7. Click Next twice (skip the Filter data and Sort
Order steps).
8. Now in the last step (Dialog is now titled Finish)
choose the option View data or edit query in Microsoft Query.
9. Click Finish.
Using the SELECT DISTINCT clause
1. Now the MS Query application appears. Actually MS
Query has been there for years. But now it is very well integrated with Excel
query engine. MS Query shows all the records in the SampleRange range. This
obviously shows two rows containing the same Item and Qty (Pen 3). This
is because the default query is SELECT * from <tablename>.
We now want to insert the DISTINCT clause.
2. To do this, click on the View SQL button.

3. Now the default SQL statement will be shown. Add
the word DISTINCT immediately after the SELECT clause and choose Ok.
4. Now the records will be shown filtered in such a
way that only one occurrence of Pen 3 is seen.
5. Choose the File menu and select the Return
data to Microsoft Excel option.
Returning data to Excel
1. Now another dialog appears which allows you to specify
whether you want the data to be returned to Excel as raw data, or you want to
create a Pivot Table.
2. In this case we will choose the option Existing
worksheet. Click on any empty cell and click on Ok.
3. Now the data that you have in Excel contains no
duplicates, but it is internally kept as a result of an external query. You
actually want to convert it to actual Excel data (with duplicates removed) and
then delete the raw data containing duplicates. To achieve this, copy the cleaned
up data and choose Paste Special
to paste it at another location
or another Excel file with the Values option. This de-links data from the base
query.
4. Thats all there is to it.
ASCII raw data
This same concept can be used to clean raw data available
in ASCII format (comma delimited / tab delimited or Fixed width format). For
this also there is an ODBC driver available. To create a data source to a simple
ASCII text file
1. Open the Data Sources menu.
2. Choose Add
button.
3. Choose Microsoft Text driver from the list of available
drivers.
4. Specify the directory. Now click on Options.
5. Here you can specify the file extensions.
6. As you know ODBC requires a database that contains
multiple tables. In this case the Directory itself acts like a database.
7. Individual ASCII files represent the tables.
8. Now, click on the Define Format
button.
9. Another dialog appears which shows all available
files with selected extensions.
10. For each file you can specify how many columns
it contains, whether it contains header row (first row containing the column
headings, the format and so on.
11. The most interesting feature is the Guess button.
12. If you dont know the exact data types of
each column in the incoming text file, the ODBC driver can guess it for you.
You can, of course, correct the guess, if required.
13. Here is a sample file ASCII.TXT, which is actually
a comma-delimited file.
Item,Qty,SaleDate
x,10,12-1-03
y,3,11-11-03
14. Now if you click on the Guess button, ODBC will
search for specified number of rows in the incoming data to try and find out
what the column data types are. Default value is zero, which means all rows
are searched. This is fine if the incoming data is small. For large data, it
is better to specify a finite value of rows to scan for determining data types.
15. In this case, leave the value of Rows to
Scan as zero and click on Guess button. Choose the Column Name Header
option. This indicates that the first row contains the column names. This is
a brilliant feature.
16. Now the Columns list will be populated to show
Item, Qty and SaleDate columns. Click on each one of them to see that the appropriate
data type has been guessed correctly.

17. Now choose Ok. Your ASCII file-based data source
is now ready.
18. To import this data and eliminate duplicates follow
the same steps as shown above. For Excel it does not matter whether the data
is coming from an ASCII file, another Excel sheet range or an actual database.
It simply works on ODBC. This is a real representation of Universal Data Access!
 |
About the Author: Dr Nitin
Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant
with many organisations, covering appropriate technology utilisation, business
application of relevant technology, application architecture and audit as
well as knowledge transfer. He is a Certified Information Systems Auditor
and has authored more than 650 articles on various technology-related subjects.
He can be contacted at nitin@mediline.co.in. |
|