|
Lead
Managing unstructured data
The exponential growth of unstructured data and need for
regulatory compliance have resulted in the management of unstructured information
becoming a critical imperative in the enterprise. By Vinita Gupta
Managing
unstructured data is a problem that has been around for as long
as people have been using computers. Most organizations at that
time were not even aware that they had a problem on hand. Today
however it has become a pressing concern. The reason behind this
is the rapid growth of unstructured and semi-structured information
in the enterprise. Unstructured data now is four times larger in
volume than structured data. While the ability to exploit unstructured
data has turned into a competitive differentiator, little effort
is being spent on managing and analyzing it.
US federal regulations, such as Sarbanes-Oxley and HIPAA,
have made organizations responsible for the mammoth, scattered piles of information.
It is of course a gigantic task that they cannot accomplish easily. Before looking
at what are the challenges faced being faced by companies in managing unstructured
data, it is important to understand what exactly is unstructured data, its source
and repository.
|
Apart
from core applications,
there is a large volume of unstructured data which is critical
for business as a great deal of processing is done in the
form of e-mails.
- Sanjay Sharma
CTO,
IDBI Bank
|
Data
management is not an IT job. The top management who should
take
initiatives to formulate a suitable policy taking into account
the nature of the business.
- Babu V
Nodal officer-CashTree/BANCS group of ATMs with e Funds International
|
Defining unstructured data
Wikipedia defines unstructured data as unstructured information, referring to
masses of (usually) computerized information, which either do not have a data
structure or one that is not easily readable by a machine. Examples of unstructured
data may include audio, video and unstructured text such as the body of an e-mail
message or word processor document. Also, data with some form of structure may
also be referred to as unstructured data if the structure is not helpful in
processing it. For example, an HTML web page is highly structured, but this
structure is often oriented towards formatting, rather than performing more
complex tasks with the contents of a page.
A great deal of confusion surrounds the subject of data structures. General
agreement exists that database information is structured information as data
and its associated metadata are tightly coupled in this instance. The way to
determine whether data is structured or not is to ask whether or not the data
can be sorted. If the answer is yes, the data is structured.
Substantial portions of data can be classified as semi-structured data. How
would you differentiate semi-structured data from unstructured data? It is generally
believed that e-mail is semi-structured while videos, pictures, audio files,
and medical images are unstructured. While word processing documents and presentations
are considered unstructured, they are actually semi-structured documents. Consequently,
if a search is possible using standard tools like Google, the data is either
structured or semi-structured.
A clear understanding of different types of data helps in
managing them differently. As substantial portion of data in public domain is
either unstructured or semi-structured, special analytical tools are needed
to add intelligence to analyze and understand them. Nevertheless, semi-structured
and unstructured data are generally understood as unstructured in the IT world
at present.
- How and where to store the large volumes
of unstructured dataalso the hardware costs, management overhead
costs, etc., associated with storage.
- How to manage retention of unstructured
data and archival policies thereof.
- How to secure unstructured data, and ensuring
consistent security.
- Finally, availability and recoverability
of unstructured data.
|
A difficult objective
In any organization a large number of electronic transactions take place. Due
to technology inroads in all fields, a data explosion is visible. Just going
in for a centralized architecture will not yield results unless proper data
management takes place through the establishment of a data warehouse.
T G Dhandapani, Corporate CIO, SCL-TVS Group said, Nowadays, more importance
is being attached to understanding, analyzing and adding intelligence to unstructured
data. This is because valuable information is available in unstructured form
on Web sites, blogs, mails, minutes, notes, etc., for an organization about
its products and services. Merrill Lynch estimates that more than 85% of business
information exists as unstructured data. The challenge is to understand where
such data exists and the availability of an appropriate tool to convert unstructured
data into meaningful information.
Unstructured data mostly comes from sources like e-mail messages. Sometimes
customers complain that online transactions are not reflected in their accounts
in case of transactions routed through different delivery channels, stated
Babu V, Nodal officer-CashTree/BANCS group of ATMs with eFunds International.
He pointed out that companies typically have a number of information systems
in place which adds to the challenge. This is more so in case of organizations
having legacy systems. Converting them to suit current needs through data enrichment
is a bigger challenge some of the banks are facing today.
Sanjay Sharma, CTO, IDBI Bank revealed, Apart from
core applications, there is a large volume of unstructured data which is critical
for businesses as a great deal of processing is done in the form of e-mail messages.
Extraction of required data from the unstructured lot at the right time is difficult,
hence its crucial for organizations to have clear policies and discipline
for storing data.
|
Nowadays,
more importance is being attached to understand, analyze and
adding intelligence to unstructured data.
- T G Dhandapani
corporate CIO,
SCL-TVS Group
|
Unstructured
data management in organizations is mostly done by
individuals themselves.
- Vijay Sethi
VP-Information Systems,
Hero Honda
|
Data management
The format of managing unstructured data may differ from one institution to
another, depending upon the nature of the business and necessary access controls
put in place at the time of customization. The three Rs of data management
are: to make the Right information available to the Right people and at the
Right time.
Acknowledging the fact that most organizations struggle with data and knowledge
stored in unstructured form, Vijay Sethi, VP-Information Systems, Hero Honda,
stated, Unstructured data management in organizations is mostly done by
individuals themselves. They put data in folders and files as per their convenience
and then spend a fair amount of time searching for right data. Some common techniques
for structuring text usually involves manual tagging with metadata for further
text mining-based structuring.
The basic solution that Hero Honda uses currently is storing their data in directory
formats and structures which give some meaning. On the systems front, the company
does microfilming and archiving of data and files and indexes them for easier
search and retrieval. It has also considered some knowledge management solutions.
Babu said, In our organization the data which has originated
from transactions at the data center is mostly in a structured format and the
same is kept in a centralized place for access by the employees. Much importance
is not given at present to e-mail activity by employees, even though as per
the Sarbanes Oxley Act the top management should be aware and is responsible
for all the activities of the employees. He asserted that management of
data will help financial institutions improve their efficiency. Availability
of correct data has assumed much importance with the implementation of BASEL
II knocking at the doors of financial institutions. This is a major challenge
faced by some public sector banks possessing legacy data which is not of much
importance in the current scenario. Some of the banks have started data warehouse
activities to keep their data at one place and in a structured format.
| Attributes |
Structured |
Semi-structured |
Unstructured |
| Sorting feasibility |
Possible |
Not Possible |
Not possible |
| Searching facility using standard
tool |
Possible |
Possible |
Not possible |
| Sensing |
Possible |
Possible |
Possible |
Data management is not an IT job. The top management who should take
initiatives to formulate a suitable policy taking into account the nature of
the business and use technology to make it available as per the policy,
added Babu.
IDBI has a document management system in place which takes care of the data
in Word documents and images. Now the bank is looking at some solutions that
can help in the easy retrieval of unstructured data. Sharma said, If cost
prevails, instead of upgrading the existing solutions, we will migrate to a
new solution.
Dhandapani revealed that the major problem in management of unstructured data
has been limited or lack of tools to extract data from the public domain in
a systematic way constantly. There is also non-availability of tools to analyze
data available in various languages. Consequently, what is known is known, and
what is not picked remains under the carpet.
In the near future there will be a growing awareness of correct data in a structured
format, mostly online, to take care of compliance issues.
vinita.gupta@expressindia.com
|