The mystery of rdb$db_key IV.As in any practical case, the extra power of db_key is needed when the utmost performance is required. In the last example in the prior page, if there are 42 employees as it's the case of the sample employee.gdb database that comes with IB, using db_key to find and delete employees is clearly overkill. The idea was to use an example with a database that any IB user can play with. But in real life, there are cases with multiple dependencies on tables that have millions of records and in this case, taking time to write a small stored procedure to automate and optimize a job that could be ran several times a month is an effort that pays for itself. Point d) This is pending from page II of this theme. There are some interesting tricks. For example, let's assume we want to make a column unique, but there are several records that have the same value on this field (column). We will simply get rid of the duplicates and leave only one record with each different value. Let's assume the column is already defined as NOT NULL (if it was accepting nulls, the default at definition time, changing it to NOT NULL is another story). A typical table like this needs some amendment: However, trying to put a PK fails because there's repeated values. We can build a stored procedure to get rid of duplicates. But there's a neat trick from Ruslan Strelba, in his own words "use undocumented rdb$db_key column so you can use such statement" and this is the tip: This works when you have two occurrences of the same value in the table: it will show only one record, so you can wipe it out and you'll be left with one occurrence. However, keep in mind that this trick doesn't work when a value is repeated three or more times. You can put this statement in a stored procedure and it will clean the table:
There's no need of a begin-end block in the loop because "delete" is the only instruction. Now, let assume you don't want to build a stored procedure. First, let's use the original sentence but with the table "repet" shown previously:
But after executing it, oh surprise, no record has been deleted! Whether this is a bug on the IB optimizer or an esoteric problem is left as an exercise. I will take advantage of this to preach the use of the EXPLICIT JOIN SYNTAX, and we will see that this time it succeeds:
Probably you are interested in getting a listing of records to delete leaving only one representative of each repeated value, no matter how many times it appears. I've found this is the general sentence that must be put in the procedure instead of the one shown above:
The other alternative is to build a stored procedure that walks the table in order and always skips the first value of a group of identical values, but this requires a flag and a temporal variable. Now on another example with data you can test for yourself. I've decided again to use a the table salary history, because it's primary key is
If you haven't fiddled before with this database, you should get 16 as the answer, so we know we have to delete 16 records. This can be accomplished easily in a grid, but what if a real case returns 1500, for example? Clearly, a manual operation would only be chosen by a bureaucratic DB Admin seeking for ways to justify his full-time contract. At risk of being boring, we will walk several alternatives: 1.- A convoluted statement can be the most immediate solution. For each group of records that have the same emp_no, we need to preserve the one that has the most recent date. In terms of date comparisons, this means the one that has the higher date. By default, indexes are ascending, so getting the minimum is fast but getting the maximum won't get any help from the index. Fortunately, the designers of this sample employee.gdb database already had defined the index we need: Now, the problem is to write the SQL statement itself. Clearly, we need a DELETE with a nested SELECT clause. This doesn't suffice, however: we need the maximum but for the employee that's being deleted, not the maximum date of all the table, so the inner statement depends upon the outer statement. This is called a correlated query, although the common cases are two SELECTs, not a DELETE plus a SELECT statement.
and IB will answer with the following plan: This looks good and bad. First, MAX is using the descending index for the inner select, but not for the outer scan to perform the deletion. It could use it, because the LESS THAN condition would walk again the same index to find the first value less than the inner select and then continue walking the index to retrieve the rest of values. Instead IB has decided to make a natural scan, too slow if the table has thousands of records. The problem is the correlation, namely, the fact the inner SELECT depends on the outer statement. To verify that IB can use such descending index with LESS THAN, just prepare the statement
Let's explain what it does: first, we need to wipe out all past salaries for a person, namely, all salaries less the newest for each employee. So, we order by emp_no and change_date. Since there's an index due to the PK that spans emp_no, change_date and updater_id, ordering even for emp_no only will cause the engine to use the index and hence the correct order of the three fields. Since a PK generates an ascending index automatically and we're using such index, we need to pick the latest record for each employee. This makes the logic more complex. We trap the db_key and the emp_no of the current_record. If the employee number has changed, we have a new employee. This means the previous record is the latest from the previous employee and the one we should keep. Since we track the prior employee and the previous db_key in auxiliary variables, we know what the previous record was. Conversely, if the employee number didn't change, then we have the same employee and the prior record (not the current one) can be deleted, since it was not the latest for such employee. This is the reason we keep the previous db_key to delete by position the prior record. Now, on boundary cases:
It has been said that ordering by an index is not a very good idea, since the engine should read random pages and load them in the cache to satisfy each record in the order mandated by the index. However, the assumption in this case was different: each time our condition matches, we delete the record the was loaded previously. So, there's a great chance that such page is still loaded in memory when the deletion happens. Since we are using db_key, there's no index search. Also, since the procedure does all the job, the operation should be fast, so there's no much chance that other requests just unload our previously accessed page before we need it. Still pending: Other examples: drawbacks. Never use db_key+summary More on db_keys.
|
This page was last updated on 2001-01-12 23:57:46 |