|
Recap
The purpose of Learning from (others) mistakes
series is to make sure that we prevent similar mistakes from
occurring rather than repeating them. TechForum will present
the causes, and suggest corrective and preventive steps for
commonly observed mistakes.
In the last article, I described a scenario of business discontinuity
due to a very harmless and unintentional error on the part
of a developer.
The entire system stopped responding to 1000+ users simply
because one of the important transactional table was BLOCKED
by a simple SELECT type of query.
In this article, we will explore the causes and remedies for
blocking of tables / rows.
What is blocking?
Blocking simply means a state where one query is waiting for
another query to release a resource which it has locked. When
the running query finishes execution, the locked resource
is released. The waiting query can now lock the resource and
execute.
One of the most important things to understand here is that
BLOCKING IS NORMAL. SQL commands could originate from
various sources like individual user connections, SQL engine
processes, scheduled jobs, components, ASP pages and so on.
SQL Server works on requests for each connection in the order
in which they arrive. Therefore, in a site with lots of users
working in parallel, it is very difficult to predict which
resources will be locked in which order by which connection.
Locking of a resource while updating data, ensures that data
can be updated only by one connection, and no other connection
can interfere with this process.
Granularity of locking
While updating data, the entire table need not be locked.
Only the relevant part of the table needs to be blocked. The
idea is to lock the smallest possible area of a table and
to lock it for the shortest possible time. Various regions
of tables which can be locked are:
Locking granularity is decided at execution time by query
optimiser. The decision is taken depending upon very sophisticated
analysis of the query. The optimiser calculates the number
and distribution of rows which is required to be locked. Based
on these parameters, the cheapest (with least overhead) type
of lock is chosen. Even after initially deciding and using
a particular type of lock, the query optimizer keeps analysing
the actual number of locks required to lock the affected entity
(a page, for example).
There is an internal setting called Escalation Threshold.
This is the total number of locks which are permitted for
the currently selected granularity. If the actual number of
locks exceeds the threshold, the query optimiser automatically
chooses the next higher (broader) granularity available. This
way, the overheads are reduced, query execution time is improved
and the most appropriate locking level is achieved.
Lock timeout
When a resource is locked, other transactions (connections)
can not update it. Others have to wait, in a queue, for the
lock to be released. If the original transaction takes too
long, other waiting connections (and their end users) will
perceive a degradation in performance and may even complain
that the machine / application appears to be hanging.
The only way to know whether a resource is currently locked
is to try to lock it.
There is no islocked() function! Now, if you want to know
if something is locked, and you try to lock it, you are taking
the risk of getting blocked, if the original lock is not released
quickly. This waiting is ENDLESS. Because the default value
of Lock Timeout is -1, that means no timeout periodwait
endlessly.
If you put some value for lock timeout property, the blocked
connection will be released after that period elapses. The
blocked connection is now free to do something else.
How will the application which is handling the connection
know that the transaction is actually aborted due to lock
timeout?
To inform the client application about a forcible cancellation
of blocking, a special error number is returned.
1222 Lock request time-out
period exceeded
This error needs to be trapped and acted upon. The current
transaction will NOT be aborted when locking timeout occurs.
The application developer needs to decide whether to wait,
retry or rollback. Therefore, blindly putting a timeout value
will not necessarily solve all blocking problems. It may even
introduce more problems due to unhandled errors and unpredictable
transaction management.
Reasons for blocking
Something which leads to a critical, often used resource being
locked for a long time will lead to blocking of other connections.
The underlying cause is immaterial. What is important is the
blockade of a shared, critical resource. Here are some common
causes:
1. Long running queries
2. Canceling uncommitted queries
3. Partial processing of query results by application
4. Distributed deadlock
When do you call a transaction LONG RUNNING?
Unfortunately, there is no fixed time limit to call a transaction
long running. The first thing to understand is that whether
a transaction is long running depends upon whether it was
expected or unexpected. It has nothing to do with the actual
time taken to execute the query.
Let me explain. Consider a monthly batch process which churns
lots of data and generates complex reports. This process is
scheduled to run at midnight of the last day of every month.
This process contains a very complex query which takes 30
minutes to execute. The fact that it takes 30 minutes is known
and is acceptable from a business perspective. This query
is expected to take that much time. Therefore, it is NOT a
long running query.
Now let us take another scenario. Consider a Web-based application
supporting live transactions from 1,000 connected customers.
Customer entered data is updated in the database using a stored
procedure. This usually takes 100 milliseconds. This performance
is adequate enough to support the peak load. Now, if this
query takes 10 seconds, IT IS a long running query. Because,
it was not expected to take much time.
Very often the query text is generated at runtime by concatenating
multiple string pieces. This type of requirement is often
present when the query content changes based on various parameters
which are known or controllable only at run-time, not at design-time.
This type of query generation is potentially dangerous. Generally
the query should generate some SQL statement like:
Select <some fields> from a
, b where <join condition> <filter condition>
Now, under some rare conditions, which were not thought of
during testing (do we think during testing ?!!) it could generate
a statement like:
Select <some fields> from a
, b
This still does not sound dangerous enough. If earlier query
resulted in 10 records, how many records do you think will
the second query return? It depends on the total number of
records in both tables. So if both a and b tables had substantial
number of rows (say 1 lac and 10000) the returned number of
rows would be 1000000000.
That would be a realllly long running query!
Prevention and cure
Set Query Timeout
Put a query timeout to avoid long running queries. This should
be done at least for queries which are likely to be long running.
Set lock timeout
This is also a good setting which prevents endless waiting
for locks to be released. Remember to handle the error and
rollback the transaction or retry.
Keep transactions as short as possible
This is easier said than done. Testing with data which is
of real-life-like peak size and connection load is very important.
Most often testing is done in development scenarios with few
rows and no load. This almost always leads to query optimization
using TABLE level locks. When the size grows, different type
of locks may be required. There is no other way of predicting
behavior and performance. You must do stress test with peak
data and user load.
Use Query Governer.... setting
Set Query governer timelimit in SQL Server
This prevents ANY query which exceeds the execution cost mentioned
from running at all. The query timeout setting kills the query
AFTER it has been detected as long running. Query governor
setting simply DOES NOT EXECUTE the query. It detects the
problem BEFORE it occurs.
This is a very useful setting. However, please ensure that
the value entered is tested and proven to high enough based
upon the typical cost of longest possible queries. Otherwise,
genuine queries may also be prevented from running due to
this setting. This setting can be modified using the Server
Properties dialog in Enterprise Manager.
Read
more about the usage before you use the setting on production
server.
Processing
all results
If the application does not process all results returning
from a query, the locks are retained. Ensure that all query
results are processed completely.
Rollback explicitly after query cancellation.
This is a commonly unknown fact. Query cancellation is done
through ODBC. However it DOES NOT implicitly mean rollback.
You need to issue a ROLLBACK in addition to canceling the
query.
Find
out the right setting of Isolation level
The higher the isolation level, the more stringent is the
locking. Therefore, understand what isolation levels are available
and decide the most appropriate one for your application.
In addition, understand the impact of the isolation level
on the locking behavior for various types of SQL commands
by ACTUALLY TRYING OUT EACH TYPE OF COMMAND AND CHECKING FOR
LOCKS PRODUCED.
Never have user activity between BEGIN TRAN and COMMIT
Begin Tran and Commit must be issued after all information
required to process transaction is available. This may sound
obvious and too simple a thing to mention here. However, I
have myself seen BEGIN TRAN being issued on a dynamic recordset
in the FORM - LOAD Event and COMMIT in the FORM - UNLOAD event!
So dont take anything for granted. Check it out.
Tip of the week for the IT Industry
Please press F1
Sounds weird? Not really. All the content I have written in
this and the current article did not come from some costly
book or a secret web site or some great genius. It was , sorry,
IT IS available in Books Online of the SQL Server product!
It is really pitiable that most of the so called great IT
Professionals want more salaries, better perks, higher positions,
overseas opportunities but forget to press F1. Expecting the
best in the world is fine. But for that you need to be (or
at least attempt to be) the best. Fortunately, being the best
in IT is simpler than you thought. Just press F1 often enough
and utilize that knowledge effectively in your work. Thats
it. Sorry Thats IT.
Feedback
Your feedback, suggestions, requests for covering specific
topics or issues are welcome. Please send feedback to techforum@mediline.co.in
 |
About
the Author Dr Nitin Paranjape is the Chairman and MD of
Maestros (Mediline). He is a consultant with many organisations,
covering appropriate technology utilisation, business
application of relevant technology, application architecture
and audit as well as knowledge transfer. He has authored
more than 650 articles on various technology-related subjects.
He can be contacted at nitin@mediline.co.in |
|