Issue dated - 9th September 2003

-


Previous Issues

CURRENT ISSUE
INDIA NEWS
STOCK FILE
FOCUS
INDIA TRENDS
NEWS ANALYSIS
OPINION
COMPANY WATCH
TECHSPACE
TECHNOLOGY
E-BUSINESS
BOOK REVIEWS
EVENTS
PRODUCTS
COLUMNS
TECH FORUM

THE C# COLUMN

BETWEEN THE BYTES
TECHNOLOGY
SPECIALS <NEW>
Symantec Report
Security Headquarters
JobsDB
MINDPRINTS
HMA BANKBIZ
EC SERVICES
ARCHIVES/SEARCH
IT APPOINTMENTS
WRITE TO US
SUBSCRIBE/RENEW
CUSTOMER SERVICE
ADVERTISE
ABOUT US

 Network Sites
  IT People
  Network Magazine
  Business Traveller
  Exp. Hotelier & Caterer
  Exp. Travel & Tourism
  Exp. Pharma Pulse
  Exp. Healthcare Mgmt.
  Express Textile
 Group Sites
  ExpressIndia
  Indian Express
  Financial Express

 
Front Page > TechSpace > Story Print this Page|  Email this page

Quick and easy document indexing

Tech Forum - Dr. Nitin Paranjpe

We create a lot of documents everyday. Some of these are stored on servers, and some on client desktops. Ideally all corporate documents should be stored centrally so that they can be shared, protected and backed up easily. Personal or draft documents can be kept on local disks.

When you centralise document storage, the number and size of the documents increases very quickly. Searching for documents across a large repository becomes more and more important.

However, searching becomes difficult due to the following:

  1. Users do not choose appropriate filenames. Although Windows offers 255 character names, we still tend to use short and cryptic names. I have even seen documents being stored as ‘document43.doc’ or ‘book13.doc’ (the default names!)
  2. We do not specify document properties. All Office documents, as well as many other types of documents provide for extensive property lists. Common properties like document creation date, author, etc, are automatically specified. However, these properties are hardly useful for locating a specific document from a large list. Ideally title, subject, category, keywords should be specified for easing searches. Unfortunately, this discipline simply does not exist in most user communities.
  3. Even if verbose names are used, we tend to forget names.
  4. Searching for a specific string across documents is possible but very slow.

The solution to all these issues can be found in a very nice feature of Windows 2000, and later versions, called ‘Indexing service’. What’s more, it can be activated and used very easily—right from Explorer itself.

What is Indexing Service?

It is a service available on Windows 2000 and later versions. The scenario is simple. You have one or more directories that contain lots of documents and you want to make these searchable.

What does the Indexing Service do?

  1. It searches specified files when the computer is idle. This way, it does not affect the performance of regular applications running on the computer.
  2. It creates a catalog containing all the text in all files.
  3. This catalog can be searched much faster than searching individual file.
  4. When a file contains text and other non-text data, Indexing Service filters the text content only for indexing.
  5. Technically any file (even proprietary formats) can be indexed by this service if the appropriate filter is available.
  6. Catalogs contain all the words found in the files and an index that points to individual files.

Features of Indexing service

  1. Search by content (text – individual words or phrases)
  2. Document properties. These can be default properties or custom properties.
  3. Boolean search (AND, OR, NOT). This makes searching more powerful and specific.
  4. No specific syntax is required for searching. This is called Free Text search.
  5. However, more complex and sophisticated searching is also possible using special query syntax.
  6. For developers it also offers an OLEDB provider.
  7. It can work on files on local machine, network shares as well as UNIX
  8. It understands security. For example, if you have a central server for storing multiple user documents, it will only search in files for which the currently logged user has access rights.
  9. It also integrates with IIS for indexing Web content.

So much for features! Now let us see how to activate and use it.

Enabling the service

  1. Open Explorer.
  2. Click on Search (or press F3).
  3. Click on the hyperlink Indexing Service.
  4. Now the following dialog will appear. Choose "Yes, enable Indexing Service…."
  5. Click on the Advanced button.
  6. Now the Indexing Service management console will appear.
  7. Click on the Show/Hide Console Tree toolbar button. This will open a tree on the left side.
  8. There will be two existing catalogs listed called System and Web. However, in this example, we will create a new catalog for our own directories. I am assuming that there is a separate directory called ‘c:\ documents’ that contains your files.
  9. From the Action menu, choose New – Catalog. Specify a name and choose the place where the catalog will be stored. Choose Ok.
  10. From the tree open the new catalog. Right-click on Directories and choose New – Directory.
  11. In the dialog that appears, specify the required directory (in this case ‘c:\ documents’). Ensure that the button "Include in index" is selected as Yes.
  12. You can add server shares by specifying the UNC path like ‘\ \ servername\ sharename’. You can specify the username/password where required.
  13. You can add more directories using the same method.
  14. Now restart the service.
  15. Now click on the root of the tree on left side "Index service on …"
  16. The right pane will now show the status of the indexing process. Total documents, remaining documents and overall status is displayed.
  17. When the ‘Docs to index’ column shows 0, it means the indexing on that catalog is completed. Some documents may not be indexed if these are in use.

How to search?

When the catalog is created, we are ready to search. Searching is very simple and very, very fast.

  1. Open the catalog node from the left side and click on "Query the catalog" node.
  2. Now the following form appears on the right side.
  3. Just type the words that you want to search documents for and choose the Search button. The files that contain the search criteria are listed below as hyperlinks. When you click on the hyperlink the document opens. It is very simple and very fast—much better than using the base Explorer search.
  4. While searching for text, the Indexing Service assigns a rank to each search result. You can sort by rank or by Title, Path, Size or Timestamp.
  5. For advanced searches you need to learn the query syntax. Click on the ‘Tips for Searching’ hyperlink. Now click on the ‘Query Syntax’ button. A separate help file opens and provides complete reference on the query options.

Advanced query

The syntax is complex and comprehensive. It is not possible to cover all of it here. However, I will offer an overview of the types of queries available.

Type Details
Free-text query This is the simple query. Just type words and phrases and search.
Phrase query Here phrases, rather than words, are searched. Just enclose the phrase in inverted commas.
Pattern matching query This uses wildcards based patterns to find words / phrases.
Document properties These queries work on the properties rather than the content of the files.
Vector spaced queries Here each word in the search phrase can have a different weight (or importance).

Here are some examples of various types of queries possible.

  • Document Property based search

The properties can be specified by prefixing these with an @ sign.

@DocTitle= "sample" will find all documents whose title is "sample".

@DocTitle contains "sample" will find all document where title contains the word "sample".

@filename jan*.doc will search for all files with jan*.doc filename. Multiple properties can be combined with AND / OR / NOT.

All documents contain the ALL property. This includes the text in all properties, which are Content, Filename, Size and Write (datetimestamp).

  • With Boolean operators

AND / OR / NOT keywords are used. Each can be used in various combinations and grouped using brackets.

  • Near keyword

If you specify "one" NEAR "two", it will search for all documents where these two words are within 50 words of each other.

  • Pattern matching

This is based upon wildcards. ‘*’ means one or more characters and ‘?’ means single character.

One nice way available is to find word forms. If you query on "walk*", it will find walking, walked and walker.

  • Vector spaced queries

Here you can specify how important it is to find individual word being searched. When multiple words are searched in simple queries, each of the words must be found.

However, in vector queries, you can specify importance (or weight) for each search word. The weight can be from 0 to 1. Default is 1.

{ weight value=.250} express AND
{ weight value=.500} computers

This means, finding the word ‘computers’ is twice as much important than finding the word ‘express’.

Programmatic querying

Other than the query syntax, you also have full-featured SQL syntax for querying of the catalog programmatically.

Appropriate use of the OLEDB Provider and query syntax can be used for building advanced knowledge management solutions. In fact, IIS, Exchange, SharePoint internally use the Indexing service as the base catalog management and search engine.

About the Author:Dr Nitin Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant with many organisations, covering appropriate technology utilisation, business application of relevant technology, application architecture and audit as well as knowledge transfer. He has authored more than 650 articles on various technology-related subjects. He can be contacted at nitin@mediline.co.in
<Back to top>


© Copyright 2003: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in
Mumbai by The Business Publications Division of the Indian Express Group of Newspapers.
Please contact our Webmaster for any queries on this site.