The mystery of rdb$db

Back
Home
Up
Next

The mystery of rdb$db_key IV.

As in any practical case, the extra power of db_key is needed when the utmost performance is required. In the last example in the prior page, if there are 42 employees as it's the case of the sample employee.gdb database that comes with IB, using db_key to find and delete employees is clearly overkill. The idea was to use an example with a database that any IB user can play with. But in real life, there are cases with multiple dependencies on tables that have millions of records and in this case, taking time to write a small stored procedure to automate and optimize a job that could be ran several times a month is an effort that pays for itself.

Point d) This is pending from page II of this theme. There are some interesting tricks. For example, let's assume we want to make a column unique, but there are several records that have the same value on this field (column). We will simply get rid of the duplicates and leave only one record with each different value. Let's assume the column is already defined as NOT NULL (if it was accepting nulls, the default at definition time, changing it to NOT NULL is another story). A typical table like this needs some amendment:
create table repet(a int not null);

However, trying to put a PK fails because there's repeated values. We can build a stored procedure to get rid of duplicates. But there's a neat trick from Ruslan Strelba, in his own words "use undocumented rdb$db_key column so you can use such statement" and this is the tip:

select c0.some_value from your_table c0, your_table c1 where c0.some_value=c1.some_value and c0.rdb$db_key>c1.rdb$db_key

This works when you have two occurrences of the same value in the table: it will show only one record, so you can wipe it out and you'll be left with one occurrence. However, keep in mind that this trick doesn't work when a value is repeated three or more times. You can put this statement in a stored procedure and it will clean the table:

for select c0.rdb$db_key from your_table c0 join your_table c1 on c0.some_value=c1.some_value and c0.rdb$db_key>c1.rdb$db_key into :dbk do delete from your_table yt where yt.rdb$db_key = :dbk;

There's no need of a begin-end block in the loop because "delete" is the only instruction. Now, let assume you don't want to build a stored procedure. First, let's use the original sentence but with the table "repet" shown previously:

delete from repet r where r.rdb$db_key in ( select c0.rdb$db_key from repet c0, repet c1 where c0.a=c1.a and c0.rdb$db_key>c1.rdb$db_key)

But after executing it, oh surprise, no record has been deleted! Whether this is a bug on the IB optimizer or an esoteric problem is left as an exercise. I will take advantage of this to preach the use of the EXPLICIT JOIN SYNTAX, and we will see that this time it succeeds:

delete from repet r where r.rdb$db_key in ( select c0.rdb$db_key from repet c0 join repet c1 on c0.a=c1.a and c0.rdb$db_key>c1.rdb$db_key)

Probably you are interested in getting a listing of records to delete leaving only one representative of each repeated value, no matter how many times it appears. I've found this is the general sentence that must be put in the procedure instead of the one shown above:

for select distinct c0.a, c0.rdb$db_key from repet c0 join repet c1 on c0.a=c1.a and c0.rdb$db_key>c1.rdb$db_key into :dummy, :dbk do delete from your_table yt where yt.rdb$db_key = :dbk;

The other alternative is to build a stored procedure that walks the table in order and always skips the first value of a group of identical values, but this requires a flag and a temporal variable.

Now on another example with data you can test for yourself. I've decided again to use a the table salary history, because it's primary key is
(EMP_NO, CHANGE_DATE, UPDATER_ID)
This table is intended to keep the historical salaries that an employee has had while working for the company. Of course, the employee's number doesn't suffice to make up a PK, because the same employee will appear several times as his/her salary changes. The identification of the user_id making the update is part of the PK for preserving consistency. Now, our historical table as grown too much so we backed up its information and intent to leave only the most recent change for each employee. Hence, the simpler way is to ensure that emp_no is not repeated. First, we want to know how many repetitions are:

select count(*)-count(distinct emp_no) from salary_history

If you haven't fiddled before with this database, you should get 16 as the answer, so we know we have to delete 16 records. This can be accomplished easily in a grid, but what if a real case returns 1500, for example? Clearly, a manual operation would only be chosen by a bureaucratic DB Admin seeking for ways to justify his full-time contract. At risk of being boring, we will walk several alternatives:

1.- A convoluted statement can be the most immediate solution. For each group of records that have the same emp_no, we need to preserve the one that has the most recent date. In terms of date comparisons, this means the one that has the higher date. By default, indexes are ascending, so getting the minimum is fast but getting the maximum won't get any help from the index. Fortunately, the designers of this sample employee.gdb database already had defined the index we need:
CREATE DESCENDING INDEX CHANGEX ON SALARY_HISTORY (CHANGE_DATE)

Now, the problem is to write the SQL statement itself. Clearly, we need a DELETE with a nested SELECT clause. This doesn't suffice, however: we need the maximum but for the employee that's being deleted, not the maximum date of all the table, so the inner statement depends upon the outer statement. This is called a correlated query, although the common cases are two SELECTs, not a DELETE plus a SELECT statement.

delete from salary_history es where change_date < (select max(change_date) from salary_history es2 where es2.emp_no=es.emp_no)

and IB will answer with the following plan:
PLAN (ES2 ORDER CHANGEX)
PLAN (ES NATURAL)

This looks good and bad. First, MAX is using the descending index for the inner select, but not for the outer scan to perform the deletion. It could use it, because the LESS THAN condition would walk again the same index to find the first value less than the inner select and then continue walking the index to retrieve the rest of values. Instead IB has decided to make a natural scan, too slow if the table has thousands of records. The problem is the correlation, namely, the fact the inner SELECT depends on the outer statement. To verify that IB can use such descending index with LESS THAN, just prepare the statement
delete from salary_history es where change_date < 'today'
and IB will answer PLAN (ES INDEX (CHANGEX)) so there's place for enhancements. However, in this case, such index doesn't help our goal, so we will leave it. Let's build a procedure, then:

set term ^; create procedure del_old_salary_history as declare variable dbk char(8); declare variable dbk_prev char(8); declare variable emp_no smallint; declare variable emp_no_prev smallint; begin emp_no_prev = NULL; FOR select sh.rdb$db_key, emp_no from salary_history sh order by emp_no, change_date INTO :dbk, :emp_no DO BEGIN IF (emp_no = emp_no_prev) then delete from salary_history where rdb$db_key = :dbk_prev; emp_no_prev = emp_no; dbk_prev = dbk; END end ^ set term ;^

Let's explain what it does: first, we need to wipe out all past salaries for a person, namely, all salaries less the newest for each employee. So, we order by emp_no and change_date. Since there's an index due to the PK that spans emp_no, change_date and updater_id, ordering even for emp_no only will cause the engine to use the index and hence the correct order of the three fields. Since a PK generates an ascending index automatically and we're using such index, we need to pick the latest record for each employee. This makes the logic more complex. We trap the db_key and the emp_no of the current_record. If the employee number has changed, we have a new employee. This means the previous record is the latest from the previous employee and the one we should keep. Since we track the prior employee and the previous db_key in auxiliary variables, we know what the previous record was. Conversely, if the employee number didn't change, then we have the same employee and the prior record (not the current one) can be deleted, since it was not the latest for such employee. This is the reason we keep the previous db_key to delete by position the prior record. Now, on boundary cases:

The first record: we can assume that an employee number is a positive one. However, to make the case generic, the auxiliary variable is assigned NULL. Since emp_no is part of the PK, it cannot be NULL, hence the first record won't be deleted by accident, since any value compared with NULL is unknown and this means FALSE in practice. Using a negative value for the auxiliary variable works in this case, but in other case, negative values in the PK might be allowed, hence NULL is safer.
The last record: when inspecting the last employee, there may be one or more entries for salary_history. Being N the latest record and N-1 the prior, we can see: if N-1 and N are records for the same employee, N-1 is deleted in the last iteration. Otherwise, N-1 is preserved. Whether N is the only record for the latest employee or not, it won't be compared with anything, since there won't be another iteration, hence it will be preserved.

It has been said that ordering by an index is not a very good idea, since the engine should read random pages and load them in the cache to satisfy each record in the order mandated by the index. However, the assumption in this case was different: each time our condition matches, we delete the record the was loaded previously. So, there's a great chance that such page is still loaded in memory when the deletion happens. Since we are using db_key, there's no index search. Also, since the procedure does all the job, the operation should be fast, so there's no much chance that other requests just unload our previously accessed page before we need it.

Still pending:

Other examples: drawbacks.

Never use db_key+summary