Archive and restore media with python
This tutorial will guide you through archiving and restoring media using tator-py. We assume you already have a project with media in it.
Determine which media to archive
In this example, we will archive all media objects from a section. First, get the list of media ids that will be archived:
import tator
api = tator.get_api(host="https://cloud.tator.io", token="YOUR_TOKEN")
media_list = api.get_media_list(project_id, section=section_id)
media_ids = [m.id for m in media_list]
The section id can be found in the web ui when you select the section from the list on the left side of the project detail page.
Archive the media
Users don't actually perform the archive operation, they mark media objects as ready to be archived
and a nightly cron job tags the object in the bucket to trigger a lifecycle rule. To mark objects as
ready to archive, perform a bulk update where the archive_state
property is set to to_archive
:
bulk_update = {"archive_state": "to_archive"}
response = api.update_media_list(project_id, media_bulk_update=bulk_update)
print(response)
What happens next
After the user has set the archive_state
flag to to_archive
, the following happens:
- The next time the nightly cron job runs, the
archive
tag is added to the object in the bucket (e.g. S3) and is given the valuetrue
. - The next time the lifecycle rule polls for new
archive: true
tags, it will run on those objects. Amazon S3 runs lifecycle rules once every day, but the timing is undocumented.
The actual result of the bucket's lifecycle rule depends on the project configuration. If
the project does not have a backup bucket defined (either a deployment-wide default or a
project-specific backup bucket), then the rule will be set up to transition the storage class from
the previously defined live storage class to the archive storage class (e.g. DEEP_ARCHIVE
for S3).
If the project does have a backup bucket defined, then the rule will delete the object from the
live bucket, leaving only the backup copy in the backup bucket (which usually defaults to a
high-latency, low-cost storage class like DEEP_ARCHIVE
).
Restore the media
Once archived, it is possible to restore media to the live
state. The user does this the same way
they archived the media, by performing a bulk update:
bulk_update = {"archive_state": "to_live"}
response = api.update_media_list(project_id, media_bulk_update=bulk_update)
print(response)
This value will be read by a cron job that will request that the object store temporarily move the
object in question into the live storage class (e.g. STANDARD
for S3). This request is
asynchronous and can take up to 48 hours, so there is a second job that looks for the completion of
this request and performs the final step to permanently restore the object in the live bucket and at
the live storage class. After a user performs the bulk update to to_live
, the order and rough
timing of these steps is as follows:
- The next time the nightly request restoration cron job runs, it sends a request to temporarily
restore the object to the live storage class in its current bucket (e.g. in the backup bucket, if
the project has one, otherwise in the live bucket). It also sets the
restoration_requested
flag on the media object toTrue
, signaling the finish restoration cron job to run on this media.This process may take up to 48 hours, so it might take more than one day before the next step runs. - Once the object is restored to the live (e.g. accessible) storage class, the next run of the finish restoration cron job will permanently restore the object to the live storage class. If the project has a backup bucket, this means the object will be copied from the backup bucket to the live bucket and the object in the backup bucket will "expire" and drop back into the archived storage class, leaving the backup intact.