l.s.d.m.MultiTableCopy : class documentation

Part of lp.services.database.multitablecopy View In Hierarchy

Copy interlinked data spanning multiple tables in a coherent fashion.

This allows data from a combination of tables, possibly with foreign-key references between them, to be copied to a set of corresponding "holding tables"; processed and modified there; and then be inserted back to the original tables. The holding tables are created on demand and dropped upon completion.

You can tell the algorithm to redirect foreign keys. Say you're copying a row x1 in a table X, and x1 has a foreign key referring to a row y1 in a table Y that you're also copying. You will get copied rows x2 and y2 respectively. But if you declare the foreign-key relationship between X and Y to the algorithm, then x2's instance of that foreign key will refer not to y1 but to the new y2. Any rows in X whose associated rows of Y are not copied, are also not copied. This can be useful when copying data in entire sub-trees of the schema graph, e.g. "one distroseries and all the translations associated with it."

All this happens in a two-stage process:

1. Extraction stage. Use the extract() method to copy selected data to a holding table, one table at a time. Ordering matters: always do this in such an order that the table you are extracting has no foreign-key references to another table that you are yet to extract.

This stage is relatively fast and holds no locks on the database. Do any additional processing on the copied rows in the holding tables, during or after the extraction stage, so you do not hold any locks on the source tables yourself. It's up to you to make sure that all of the rows in the holding tables can be inserted into their source tables: if you leave primary keys and such unchanged, unique constraints will be violated in the next stage.

2. Pouring stage. All data from the holding tables is inserted back into the source tables. This entire stage, which normally takes the bulk of the copying time, is performed by calling the pour method.

This stage will lock the rows that are being inserted in the source tables, if the database is so inclined (e.g. when using postgres with REPEATABLE READ isolation level). For that reason, the pouring is done in smaller, controlled batches. If you give the object a database transaction to work with, that transaction will be committed and restarted between batches.

A MultiTableCopy is restartable. If the process should fail for any reason, the holding tables will be left in one of two states: if stage 1 has not completed, needsRecovery will return False. In that case, drop the holding tables using dropHoldingTables and either start again (or give up). But if a previous run did complete the extraction stage, the holding tables will remain and contain valid data. In that case, just pour again to continue the work (and hopefully complete it this time).

Holding tables will have names like "temp_POMsgSet_holding_ubuntu_feisty", in this case for one holding data extracted from source table POMsgSet by a MultiTableCopy called "ubuntu_feisty". The holding tables are dropped when we're done, but they are not temp tables. They have to be persistent so we get a chance to resume interrupted copies and analyse failures.

# We use a regular, persistent table rather than a temp table for # this so that we get a chance to resume interrupted copies, and # analyse failures. If we used temp tables, they'd be gone by the # time we knew something went wrong.

The tables to be copied must meet a number of conventions:

If you use this class, it is up to you to ensure that the data that you copy does not break compliance between the time it is extracted from its source tables and the moment it has been poured back there. This means that rows that the data refers to by foreign keys must not be deleted while the multi-table copy is running, for instance.

Method __init__ Define a MultiTableCopy, including an in-order list of tables.
Method dropHoldingTables Drop any holding tables that may exist for this MultiTableCopy.
Method getRawHoldingTableName Name for a holding table, but without quotes. Use with care.
Method getHoldingTableName Name for a holding table to hold data being copied in tablename.
Method extract Extract (selected) rows from source_table into a holding table.
Method needsRecovery Do we have holding tables with recoverable data from previous run?
Method pour Pour data from holding tables back into source tables.
Method _pointsToTable Name of table that source_table.foreign_key refers to.
Method _selectToHolding Create holding table based on data from source table.
Method _indexIdColumn Index id column on holding table.
Method _retargetForeignKeys Replace foreign keys in new holding table.
Method _pourTable Pour contents of a holding table back into its source table.
Method _checkExtractionOrder Verify order in which tables are extracted against tables list.
Method _checkForeignKeyOrder Verify that we're in a position to "retarget" a foreign key.
Method _commit Commit our transaction and create replacement cursor.
def __init__(self, name, tables, seconds_per_batch=2.0, minimum_batch_size=500, restartable=True, logger=None):
Define a MultiTableCopy, including an in-order list of tables.
Parametersnamea unique identifier for this MultiTableCopy operation, e.g. "ubuntu_feisty". The name will be included in the names of holding tables.
tablesa list of tables that will be extracted and poured, in the order in which they will be extracted (and later, poured). This is essential when analyzing recoverable state. You may attempt multiple extractions from the same table, but all tables listed must be extracted. If you do not wish to copy any rows from a source table, extract with "false" as its where clause.
seconds_per_batcha time goal (in seconds) to define how long, ideally, the algorithm should be allowed to hold locks on the source tables. It will try to do the work in batches that take about this long.
minimum_batch_sizeminimum number of rows to pour in one batch. You may want to set this to stop the algorithm from resorting to endless single-row batches in situations where response times are too slow but batch size turns out not to matter much.
restartablewhether you want the remaining data to be available for recovery if the connection (or the process) fails while in the pouring stage. If False, will extract to temp tables. CAUTION: our connections currently get reset every time we commit a transaction, obliterating any temp tables possibly in the middle of the pouring process!
loggera logger to write informational output to. If none is given, the default is used.
def dropHoldingTables(self):
Drop any holding tables that may exist for this MultiTableCopy.
def getRawHoldingTableName(self, tablename, suffix=''):
Name for a holding table, but without quotes. Use with care.
def getHoldingTableName(self, tablename, suffix=''):
Name for a holding table to hold data being copied in tablename.

Return value is properly quoted for use as an SQL identifier.

def _pointsToTable(self, source_table, foreign_key):
Name of table that source_table.foreign_key refers to.

By default, all foreign keys that play a role in the MultiTableCopy are expected to have the same names as the tables they refer to. If that is not the case for your copy, subclass this class and override this method with a version that has specific knowledge of what points where.

def extract(self, source_table, joins=None, where_clause=None, id_sequence=None, inert_where=None, pre_pouring_callback=None, batch_pouring_callback=None, external_joins=None):
Extract (selected) rows from source_table into a holding table.

The holding table gets an additional new_id column with identifiers generated by id_sequence. Apart from this extra column, (and indexes and constraints), the holding table is schematically identical to source_table. A unique index is created for the original id column.

There is a special facility for redirecting foreign keys to other tables in the same copy operation. The joins argument can pass a list of foreign keys. The foreign keys given in joins must be columns of source_table, and refer to tables that are also being copied. The selection used in populating the holding table for source_table will be joined on each of the foreign keys listed in joins. The foreign keys in the holding table will point to the new_ids of the copied rows, rather than the original ids. Rows in source_table will only be copied to their holding table if all rows they are joined with through the joins parameter are also copied.

When joining, the added tables' columns are not included in the holding table, but where_clause may still select on them.

Parameterssource_tabletable to extract data from (and where the extracted data will ultimately be poured back to).
joinslist of foreign keys from source_table to other source tables being copied in the same operation. By default, each of these foreign keys is assumed to refer to a table of the same name. When breaking this convention, override the _pointsToTable method to provide the right target table for each foreign key that the operation should know about.
where_clauseBoolean SQL expression characterizing rows to be extracted. The WHERE clause may refer to rows from table being extracted as "source."
id_sequenceSQL sequence that should assign new identifiers for the extracted rows. Defaults to source_table with "_seq_id" appended, which by SQLObject/Launchpad convention is the sequence that provides source_table's primary key values. Used verbatim, without quoting.
inert_whereBoolean SQL expression characterizing rows that are extracted, but should not poured back into source_table during the pouring stage. For these rows, new_id will be null. This clause is executed in a separate query, therefore will have no access to any tables other than the newly created holding table. The clause can reference the holding table under the name "holding." Any foreign keys from joins will still contain the values they had in source_table, but for each "x" of these foreign keys, the holding table will have a column "new_x" that holds the redirected foreign key.
pre_pouring_callbacka callback that is called just before pouring this table. At that time the holding table will no longer have its new_id column, its values having been copied to the regular id column. This means that the copied rows' original ids are no longer known. The callback takes as arguments the holding table's name and the source table's name. The callback may be invoked more than once if pouring is interrupted and later resumed.
batch_pouring_callbacka callback that is called before each batch of rows is poured, within the same transaction that pours those rows. It takes as arguments the holding table's name; the source table's name; current batch size; the id of the first row being poured; and the id where the batch stops. The "end id" is exclusive, so a row with that id is not copied (and in fact may not even exist). The time spent by batch_ouring_callback is counted as part of the batch's processing time.
external_joinsa list of tables to join into the extraction query, so that the extraction condition can select on them. Each entry must be a string either simply naming a table ("Person"), or naming a table and providing a name for it in the query (e.g., "Person p"). Do the latter if the same table also occurs elsewhere in the same query. Your where_clause can refer to the rows included in this way by the names you give or by their table name, SQL rules permitting. If your join yields multiple rows that have the same source_table id, only one arbitrary pick of those will be extracted.
def _selectToHolding(self, source_table, joins, external_joins, where_clause, holding_table, id_sequence, inert_where):
Create holding table based on data from source table.

We don't need to know what's in the source table exactly; we just
create a new table straight from a "select *."  Except we also add a
few columns for the purpose of retargeting foreign keys, as well as a
new_id column.
def _indexIdColumn(self, holding_table, source_table, cur):
Index id column on holding table.

Creates a unique index on "id" column in holding table. The index gets the name of the holding table with "_id" appended.

def _retargetForeignKeys(self, holding_table, joins, cur):
Replace foreign keys in new holding table.

Set the values of the foreign keys that are to be retargeted to those
in their respective "new_" variants, then drop those "new_" columns we
added from the holding table.
def needsRecovery(self):
Do we have holding tables with recoverable data from previous run?

Returns Boolean answer.

def pour(self, transaction_manager):
Pour data from holding tables back into source tables.

Rows in the holding table that have their new_id set to null are skipped.

The transaction manager is committed and re-opened after every batch run.

Batch sizes are dynamically adjusted to meet the stated time goal.

def _pourTable(self, holding_table, table, has_new_id, transaction_manager):
Pour contents of a holding table back into its source table.

This will commit transaction_manager, typically multiple times.

def _checkExtractionOrder(self, source_table):
Verify order in which tables are extracted against tables list.

Check that the caller follows the stated plan, extracting from tables in the same order as in self.tables.

def _checkForeignKeyOrder(self, fk, referenced_table):
Verify that we're in a position to "retarget" a foreign key.

We've been asked to retarget a foreign key while copying. Check that the table it refers to has already been copied.

def _commit(self, transaction_manager):
Commit our transaction and create replacement cursor.

Use this as "cur = self._commit(transaction_manager)" to commit a transaction, restart it, and create a cursor that lives within the new transaction.

API Documentation for Launchpad, generated by pydoctor at 2022-06-16 00:00:12.