The mystery of rdb$db

Back
Home
Up
Next

The mystery of rdb$db_key II.

Probably the first question that comes to mind are why a person would want to use db_keys and when and how long they are valid in the database. In a general answer, people come to low levels when they can't do what they want in higher levels of abstractions or when they need better performance than in standard SQL. Now, on IB specifically:

A db_key is faster than even a PK. Don't get surprised: a PK must be mapped through an index into a raw position, but a db_key is already this raw record position. Jump to the explanation.
If for some special reason a table has no PK or the indexes are inactive (so exact duplicates may exist), db_keys are the only way to identify univocally each record in a table. Jump to the explanation.
There are several statements that run faster when put in a stored procedure using a db_key than using the original SQL sentence. The typical examples are updates and deletions with complex conditions. Jump to the explanation.
Some tricks can be made to write special statements that would require procedural logic instead of declarative logic (SQL) to be resolved without db_keys. Jump to the explanation (on page IV).

The duration of db_keys is a point that's not very clear in the manuals unless you read with sharp eye the API Reference Guide. By default, a db_key is valid only for the duration of the current transaction. After you use Commit or Rollback, the db_key values you had are inaccurate. If you are using auto-commit, you still can use db_keys because this mode uses an implicit CommitRetaining, that means the transaction context is retained, so garbage collection cannot proceed and hence, the db_keys used remain valid until you do a "hard" Commit.

At risk of being repetitive, the explanation is when a txn is committed, garbage collection may proceed so "old record versions" can be collected and "compacted", namely, marked as free to use. Also, another txn might have deleted a record you were using but your txn isolation level prevented you from being aware of this change. Your db_key grabbed when the txn was active can point now to a record version that doesn't exist anymore. If you took much time to do your update to the record identified with the db_key, you'll want to check that the record has not being changed by another transaction in the meantime. IB helps tracking some of these conflicts, see writing behavior among conflicting transactions.

You can change the default duration of db_key values at connection time by using the API or IBO. You can specify that the values must be kept along the entire session; this means across the time elapsed between the time you connect to the db and the time you disconnect. However, keep in mind that this means garbage collection is stuck for all this time. Internally, IB keeps a transaction open that stops garbage collection. In highly interactive environments, this may result in your database file growing at a big rate and the operations in the database being each time slower. As a corollary, having an auto-commit transaction working for all the day or using only CommitRetaining or simply opening a transaction that's not committed for several hours without need is not a wise solution.

Now, we will analyze briefly the four points on using transactions that were shown above as possible advantages or workarounds:

Point a) A db_key is faster than a PK. One level less of indirection. However, if the database pages needed are in main memory and the subset requested exhibits locality (all the records are near to the others) and it's small, the difference in access speed shouldn't be noticeable.

Point b) Sometimes, indexes are turned off or not defined at all for efficiency reasons. For example, maybe a table used as an audit trail won't be indexed so it imposes a few penalties for inserting actions and data related to the operations on other tables. Should you need to make interactive alterations to the data, your only point of identification for a record is its raw position, IE, the db_key. When there's no key, IBO automatically uses the db_key. This method works well with updates and deletions. However, newly inserted records cannot be shown: how do you know what's the value of the db_key assigned to the new record? You only can refresh your query and fetch again the records.

Point c) The IB optimizer stills has trouble with some sentences. If you try to run something like

update tableA A set sumfield = (select sum(B.valfield) from tableB B where B.FK = A.PK) where condition

against a huge table, you are likely to find performance problems. If you run the same operation often, then it's worth the time to write a stored procedure doing

for select B.FK, sum(B.valfield) from tableB B group by B.FK into :b_fk, :sum_vf do update tableA A set sumfield = :sum_vf where A.PK= :b_fk and condition

Although speedier, it's still has the problem that records in A have to be located by PK each time a new pass of the for/do loop happens. Some people claim better results with this alien syntax:

for select [B.FK,] sum(B.valfield), A.rdb$db_key from tableB B join tableA A on A.PK=B.FK where condition group by B.FK into [:b_fk,] :sum_vf, :dbk do update tableA A set sumfield = :sum_vf where A.rdb$db_key = :dbk

The last form seems to have some advantages:

First, the filtering of the common records for A and B can be done in an efficient manner if the optimizer can make a good filter from the explicit join.
Second, the WHERE follows to make additional filtering (it may be appended by the user with and AND to the basic JOIN condition depending on the clause contained in the WHERE) before the update checks its own condition.
Third, the dependent table (A) has its records located by raw db_key values, extracted at the time of the join, so it's faster than the looking for through the PK. I put two parts between square brackets because IB6 doesn't require them but older IB versions might insist that the field used in the GROUP BY must be present in the SELECT clause and the second optional part is dependant on the former because all fields described in the SELECT must be received in a variable after the INTO clause, in the order of appearance. As aforementioned, dbk, the variable holding the db_key value in the example, must be defined as char(8) because we are dealing with a table.

Usually, it's not possible to use an explicit query plan in insertions, updates or deletes, so the only way to proceed seems to be using a procedure in case you get slow performance. You may wonder why I include insertions. Well, an insert statement is not contrived to a single sequence of values that's sent by the client application. In fact, it can be very complicated as "insert into this table the summary, average and count of distinct values from the join of these two tables while condition C.e=somevalue is met but only if the item inserted is not already in this table, otherwise perform an update". There's an special syntax for inserting the results of a select statement:

insert into tableA select C.pkey, sum(B.a), avg(B.b), count(distinct C.c) from tableB B join tableC C on B.a = C.h and C.e = somevalue where C.pkey not in (select pkey from tableA) group by C.pkey

Of course, the update part has to be left for another sentence. You can change
where C.pkey not in (select pkey from A)
to be
where exists(select pkey from tableA A where A.pkey = C.pkey)
for a measure of relative performance. It's possible to rewrite the whole statement with a stored procedure. You have to test if this is faster or slower and remember that the statement shown above only covers the insert portion, not the update portion as in the following procedure:

for select C.pkey, sum(B.a), avg(B.b), count(distinct C.c), from tableB B join tableC C on B.a = C.h and C.e = somevalue group by C.pkey, into :c_key, :sum_a, :avg_b, :count_c do begin select A.rdb$db_key from tableA A where A.pkey = :c_key into :dbk; [if sum_a is NULL then sum_a = 0; if avg_b is NULL then avg_b = 0;] if (dbk is NULL) then insert into tableA(pkey, summary, average, count_of_distinct) values(:c_key, :sum_a, :avg_b, :count_c); else update tableA A set summary = summary + :sum_a, average = average + :avg_a, count_of_distinct = count_of_distinct + :count_c where A.rdb$db_key = :dbk; end

It might be the case that B.a and B.b are allowed to be NULL and in this case, this procedure would be the only native way of dealing with that condition before the insert or update. This is the purpose of the two lines between square brackets to mean they must be enabled if NULLs can occur. Remember that the accepted comment delimiters in Interbase procedures and triggers are the same than in language C, so if you really want to left these lines in the code but without effect, you must enclose them between /* and */ to be bypassed. Conversely, count() never can be NULL, because it only counts records. This is a rule you'll want to remember:

When you make a count(*) in a table, you are counting all records.
When you make a count(field), you are counting all records where field is not NULL.
When you make a count(distinct field), you are counting only the different values in that field, namely, all repetitions of the same value account for one item. Warning: if the field allows NULLs, they are ignored regardless of how many NULLs are, so all repetitions of NULL don't account for one item.

If you don't mind writing convoluted sentences, you can count all NULL as one occurrence more in a "distinct" clause with a weird syntax like this:
select count(distinct table.field) + (select count(*) from rdb$database where (select count(*) from table t where t.field is null)>0) from table

or in a more standard way:
select count(distinct table.field) + (select count(*) from rdb$database where exists (select * from table t where t.field is null)) from table

So, I hope you understood that in the worst case, the result of count is zero, but it can't be NULL nor negative.