|
Quick and easy document indexing
Tech Forum - Dr. Nitin Paranjpe
We
create a lot of documents everyday. Some of these are stored on servers, and
some on client desktops. Ideally all corporate documents should be stored centrally
so that they can be shared, protected and backed up easily. Personal or draft
documents can be kept on local disks.
When you centralise document storage, the
number and size of the documents increases very quickly. Searching for documents
across a large repository becomes more and more important.
However, searching becomes difficult due
to the following:
- Users do not choose appropriate filenames. Although
Windows offers 255 character names, we still tend to use short and cryptic
names. I have even seen documents being stored as ‘document43.doc’ or ‘book13.doc’
(the default names!)
- We do not specify document properties. All Office
documents, as well as many other types of documents provide for extensive
property lists. Common properties like document creation date, author, etc,
are automatically specified. However, these properties are hardly useful for
locating a specific document from a large list. Ideally title, subject, category,
keywords should be specified for easing searches. Unfortunately, this discipline
simply does not exist in most user communities.
- Even if verbose names are used, we tend to forget
names.
- Searching for a specific string across documents
is possible but very slow.
The solution to all these issues can be
found in a very nice feature of Windows 2000, and later versions, called ‘Indexing
service’. What’s more, it can be activated and used very easily—right from Explorer
itself.
What is Indexing Service?
It is a service available on Windows 2000
and later versions. The scenario is simple. You have one or more directories
that contain lots of documents and you want to make these searchable.
What does the Indexing Service do?
- It searches specified files when the computer
is idle. This way, it does not affect the performance of regular applications
running on the computer.
- It creates a catalog containing all the text
in all files.
- This catalog can be searched much faster than
searching individual file.
- When a file contains text and other non-text
data, Indexing Service filters the text content only for indexing.
- Technically any file (even proprietary formats)
can be indexed by this service if the appropriate filter is available.
- Catalogs contain all the words found in the files
and an index that points to individual files.
Features of Indexing service
- Search by content (text – individual words or
phrases)
- Document properties. These can be default properties
or custom properties.
- Boolean search (AND, OR, NOT). This makes searching
more powerful and specific.
- No specific syntax is required for searching.
This is called Free Text search.
- However, more complex and sophisticated searching
is also possible using special query syntax.
- For developers it also offers an OLEDB provider.
- It can work on files on local machine, network
shares as well as UNIX
- It understands security. For example, if you
have a central server for storing multiple user documents, it will only search
in files for which the currently logged user has access rights.
- It also integrates with IIS for indexing Web
content.
So much for features! Now let us see how
to activate and use it.
Enabling the service
- Open Explorer.
- Click on Search (or press F3).
- Click on the hyperlink Indexing Service.
- Now the following dialog will appear. Choose
"Yes, enable Indexing Service…."

- Click on the Advanced button.
- Now the Indexing Service management console will
appear.
- Click on the Show/Hide Console Tree toolbar button.
This will open a tree on the left side.
- There will be two existing catalogs listed called
System and Web. However, in this example, we will create a new catalog for
our own directories. I am assuming that there is a separate directory called
‘c:\ documents’ that contains your files.
- From the Action menu, choose New – Catalog. Specify
a name and choose the place where the catalog will be stored. Choose Ok.
- From the tree open the new catalog. Right-click
on Directories and choose New – Directory.
- In the dialog that appears, specify the required
directory (in this case ‘c:\ documents’). Ensure that the button "Include
in index" is selected as Yes.
- You can add server shares by specifying the UNC
path like ‘\ \ servername\ sharename’. You can specify the username/password
where required.
- You can add more directories using the same method.
- Now restart the service.
- Now click on the root of the tree on left side
"Index service on …"
- The right pane will now show the status of the
indexing process. Total documents, remaining documents and overall status
is displayed.
- When the ‘Docs to index’ column shows 0, it means
the indexing on that catalog is completed. Some documents may not be indexed
if these are in use.
How to search?
When the catalog is created, we are ready
to search. Searching is very simple and very, very fast.
- Open the catalog node from the left side and
click on "Query the catalog" node.
- Now the following form appears on the right side.
- Just type the words that you want to search documents
for and choose the Search button. The files that contain the search criteria
are listed below as hyperlinks. When you click on the hyperlink the document
opens. It is very simple and very fast—much better than using the base Explorer
search.
- While searching for text, the Indexing Service
assigns a rank to each search result. You can sort by rank or by Title, Path,
Size or Timestamp.
- For advanced searches you need to learn the query
syntax. Click on the ‘Tips for Searching’ hyperlink. Now click on the ‘Query
Syntax’ button. A separate help file opens and provides complete reference
on the query options.
Advanced query
The syntax is complex and comprehensive. It is
not possible to cover all of it here. However, I will offer an overview of the
types of queries available.
| Free-text query |
This is the simple query. Just type words
and phrases and search. |
| Phrase query |
Here phrases, rather than words, are
searched. Just enclose the phrase in inverted commas. |
| Pattern matching query |
This uses wildcards based patterns to
find words / phrases. |
| Document properties |
These queries work on the properties
rather than the content of the files. |
| Vector spaced queries |
Here each word in the search phrase can
have a different weight (or importance). |
Here are some examples of various types
of queries possible.
- Document Property based search
The properties can be specified by prefixing
these with an @ sign.
@DocTitle= "sample"
will find all documents whose title is "sample".
@DocTitle contains
"sample" will find all document where title contains the word
"sample".
@filename jan*.doc
will search for all files with jan*.doc filename. Multiple properties can be
combined with AND / OR / NOT.
All documents contain the ALL property.
This includes the text in all properties, which are Content, Filename, Size
and Write (datetimestamp).
AND / OR / NOT keywords are used. Each
can be used in various combinations and grouped using brackets.
If you specify "one"
NEAR "two", it will search for all documents where these two
words are within 50 words of each other.
This is based upon wildcards. ‘*’ means
one or more characters and ‘?’ means single character.
One nice way available is to find word
forms. If you query on "walk*", it will
find walking, walked and walker.
Here you can specify how important it is
to find individual word being searched. When multiple words are searched in
simple queries, each of the words must be found.
However, in vector queries, you can specify
importance (or weight) for each search word. The weight can be from 0 to 1.
Default is 1.
{ weight value=.250} express AND
{ weight value=.500} computers
This means, finding the word ‘computers’
is twice as much important than finding the word ‘express’.
Programmatic querying
Other than the query syntax, you also have
full-featured SQL syntax for querying of the catalog programmatically.
Appropriate use of the OLEDB Provider and
query syntax can be used for building advanced knowledge management solutions.
In fact, IIS, Exchange, SharePoint internally use the Indexing service as the
base catalog management and search engine.
 |
About the Author:Dr Nitin
Paranjape is the Chairman and MD of Maestros (Mediline). He is a consultant
with many organisations, covering appropriate technology utilisation, business
application of relevant technology, application architecture and audit as
well as knowledge transfer. He has authored more than 650 articles on various
technology-related subjects. He can be contacted at nitin@mediline.co.in |
|