Part of lp.services.database.multitablecopy View In Hierarchy
This allows data from a combination of tables, possibly with foreign-key references between them, to be copied to a set of corresponding "holding tables"; processed and modified there; and then be inserted back to the original tables. The holding tables are created on demand and dropped upon completion.
You can tell the algorithm to redirect foreign keys. Say you're copying a row x1 in a table X, and x1 has a foreign key referring to a row y1 in a table Y that you're also copying. You will get copied rows x2 and y2 respectively. But if you declare the foreign-key relationship between X and Y to the algorithm, then x2's instance of that foreign key will refer not to y1 but to the new y2. Any rows in X whose associated rows of Y are not copied, are also not copied. This can be useful when copying data in entire sub-trees of the schema graph, e.g. "one distroseries and all the translations associated with it."
All this happens in a two-stage process:
1. Extraction stage. Use the extract() method to copy selected data to a holding table, one table at a time. Ordering matters: always do this in such an order that the table you are extracting has no foreign-key references to another table that you are yet to extract.
This stage is relatively fast and holds no locks on the database. Do any additional processing on the copied rows in the holding tables, during or after the extraction stage, so you do not hold any locks on the source tables yourself. It's up to you to make sure that all of the rows in the holding tables can be inserted into their source tables: if you leave primary keys and such unchanged, unique constraints will be violated in the next stage.
2. Pouring stage. All data from the holding tables is inserted back into the source tables. This entire stage, which normally takes the bulk of the copying time, is performed by calling the pour method.
This stage will lock the rows that are being inserted in the source tables, if the database is so inclined (e.g. when using postgres with REPEATABLE READ isolation level). For that reason, the pouring is done in smaller, controlled batches. If you give the object a database transaction to work with, that transaction will be committed and restarted between batches.
A MultiTableCopy is restartable. If the process should fail for any reason, the holding tables will be left in one of two states: if stage 1 has not completed, needsRecovery will return False. In that case, drop the holding tables using dropHoldingTables and either start again (or give up). But if a previous run did complete the extraction stage, the holding tables will remain and contain valid data. In that case, just pour again to continue the work (and hopefully complete it this time).
Holding tables will have names like "temp_POMsgSet_holding_ubuntu_feisty", in this case for one holding data extracted from source table POMsgSet by a MultiTableCopy called "ubuntu_feisty". The holding tables are dropped when we're done, but they are not temp tables. They have to be persistent so we get a chance to resume interrupted copies and analyse failures.
# We use a regular, persistent table rather than a temp table for # this so that we get a chance to resume interrupted copies, and # analyse failures. If we used temp tables, they'd be gone by the # time we knew something went wrong.
The tables to be copied must meet a number of conventions:
- First column must be an integer primary key called "id."
- id values must be assigned by a sequence, with a name that can be used in SQL without quoting.
- Every foreign-key column that refers to a table that is also being copied, has the same name as the table it refers to. This can be changed by subclassing and overriding the _pointsToTable method.
- Foreign-key column names and the tables they refer to can be used in SQL without quoting.
- For any foreign key column "x" referring to another table that is also being copied, there must not be a column called "new_x"
If you use this class, it is up to you to ensure that the data that you copy does not break compliance between the time it is extracted from its source tables and the moment it has been poured back there. This means that rows that the data refers to by foreign keys must not be deleted while the multi-table copy is running, for instance.
Method | __init__ | Define a MultiTableCopy, including an in-order list of tables. |
Method | dropHoldingTables | Drop any holding tables that may exist for this MultiTableCopy. |
Method | getRawHoldingTableName | Name for a holding table, but without quotes. Use with care. |
Method | getHoldingTableName | Name for a holding table to hold data being copied in tablename. |
Method | extract | Extract (selected) rows from source_table into a holding table. |
Method | needsRecovery | Do we have holding tables with recoverable data from previous run? |
Method | pour | Pour data from holding tables back into source tables. |
Method | _pointsToTable | Name of table that source_table.foreign_key refers to. |
Method | _selectToHolding | Create holding table based on data from source table. |
Method | _indexIdColumn | Index id column on holding table. |
Method | _retargetForeignKeys | Replace foreign keys in new holding table. |
Method | _pourTable | Pour contents of a holding table back into its source table. |
Method | _checkExtractionOrder | Verify order in which tables are extracted against tables list. |
Method | _checkForeignKeyOrder | Verify that we're in a position to "retarget" a foreign key. |
Method | _commit | Commit our transaction and create replacement cursor. |
Parameters | name | a unique identifier for this MultiTableCopy operation, e.g. "ubuntu_feisty". The name will be included in the names of holding tables. |
tables | a list of tables that will be extracted and poured, in the order in which they will be extracted (and later, poured). This is essential when analyzing recoverable state. You may attempt multiple extractions from the same table, but all tables listed must be extracted. If you do not wish to copy any rows from a source table, extract with "false" as its where clause. | |
seconds_per_batch | a time goal (in seconds) to define how long, ideally, the algorithm should be allowed to hold locks on the source tables. It will try to do the work in batches that take about this long. | |
minimum_batch_size | minimum number of rows to pour in one batch. You may want to set this to stop the algorithm from resorting to endless single-row batches in situations where response times are too slow but batch size turns out not to matter much. | |
restartable | whether you want the remaining data to be available for recovery if the connection (or the process) fails while in the pouring stage. If False, will extract to temp tables. CAUTION: our connections currently get reset every time we commit a transaction, obliterating any temp tables possibly in the middle of the pouring process! | |
logger | a logger to write informational output to. If none is given, the default is used. |
Return value is properly quoted for use as an SQL identifier.
By default, all foreign keys that play a role in the MultiTableCopy are expected to have the same names as the tables they refer to. If that is not the case for your copy, subclass this class and override this method with a version that has specific knowledge of what points where.
The holding table gets an additional new_id column with identifiers
generated by id_sequence
. Apart from this extra column, (and
indexes and constraints), the holding table is schematically identical
to source_table
. A unique index is created for the original id
column.
There is a special facility for redirecting foreign keys to other
tables in the same copy operation. The joins
argument can pass a
list of foreign keys. The foreign keys given in joins must be columns
of source_table
, and refer to tables that are also being copied.
The selection used in populating the holding table for source_table
will be joined on each of the foreign keys listed in joins
. The
foreign keys in the holding table will point to the new_ids of the
copied rows, rather than the original ids. Rows in source_table
will only be copied to their holding table if all rows they are joined
with through the joins
parameter are also copied.
When joining, the added tables' columns are not included in the
holding table, but where_clause
may still select on them.
Parameters | source_table | table to extract data from (and where the extracted data will ultimately be poured back to). |
joins | list of foreign keys from source_table to other source
tables being copied in the same operation. By default, each of
these foreign keys is assumed to refer to a table of the same
name. When breaking this convention, override the
_pointsToTable method to provide the right target table for each
foreign key that the operation should know about. | |
where_clause | Boolean SQL expression characterizing rows to be extracted. The WHERE clause may refer to rows from table being extracted as "source." | |
id_sequence | SQL sequence that should assign new identifiers
for the extracted rows. Defaults to source_table with "_seq_id"
appended, which by SQLObject/Launchpad convention is the sequence
that provides source_table 's primary key values. Used verbatim,
without quoting. | |
inert_where | Boolean SQL expression characterizing rows that
are extracted, but should not poured back into source_table
during the pouring stage. For these rows, new_id will be null.
This clause is executed in a separate query, therefore will have
no access to any tables other than the newly created holding
table. The clause can reference the holding table under the name
"holding." Any foreign keys from joins will still contain the
values they had in source_table , but for each "x" of these
foreign keys, the holding table will have a column "new_x" that
holds the redirected foreign key. | |
pre_pouring_callback | a callback that is called just before pouring this table. At that time the holding table will no longer have its new_id column, its values having been copied to the regular id column. This means that the copied rows' original ids are no longer known. The callback takes as arguments the holding table's name and the source table's name. The callback may be invoked more than once if pouring is interrupted and later resumed. | |
batch_pouring_callback | a callback that is called before each
batch of rows is poured, within the same transaction that pours
those rows. It takes as arguments the holding table's name; the
source table's name; current batch size; the id of the first row
being poured; and the id where the batch stops. The "end id" is
exclusive, so a row with that id is not copied (and in fact may
not even exist). The time spent by batch_ouring_callback is
counted as part of the batch's processing time. | |
external_joins | a list of tables to join into the extraction
query, so that the extraction condition can select on them. Each
entry must be a string either simply naming a table ("Person"), or
naming a table and providing a name for it in the query (e.g.,
"Person p"). Do the latter if the same table also occurs
elsewhere in the same query. Your where_clause can refer to the
rows included in this way by the names you give or by their table
name, SQL rules permitting. If your join yields multiple rows
that have the same source_table id, only one arbitrary pick of
those will be extracted. |
Create holding table based on data from source table. We don't need to know what's in the source table exactly; we just create a new table straight from a "select *." Except we also add a few columns for the purpose of retargeting foreign keys, as well as a new_id column.
Creates a unique index on "id" column in holding table. The index gets the name of the holding table with "_id" appended.
Replace foreign keys in new holding table. Set the values of the foreign keys that are to be retargeted to those in their respective "new_" variants, then drop those "new_" columns we added from the holding table.
Returns Boolean answer.
Rows in the holding table that have their new_id set to null are skipped.
The transaction manager is committed and re-opened after every batch run.
Batch sizes are dynamically adjusted to meet the stated time goal.
This will commit transaction_manager, typically multiple times.
Check that the caller follows the stated plan, extracting from tables in the same order as in self.tables.