Changes

Nick Bywell · c11710d4
--- a/Digitisation-workflow.md
+++ b/Digitisation-workflow.md
+[[_TOC_]]
+
+## Procedural note
+
+The workflow indicated below assumes that file fixity validation is required as part of the processing. It also assumes that a tranche is populated with asset-files on the drive on which it is initially created, rather than the tranche being copied to an external drive, then given to the digitisation provider who populates it, and returns it, so that it can be copied back to the original drive.
+
+If the latter is the case, each time the tranche is copied, the script that validates the checksums should be run in order to ensure that no corruption has resulted. So, in the latter case, the workflow will be slightly more complex.
+
+Instructions on how to form the command line for each script can be found at the top of the script.
+ 
+## Workflow
+
+Determine the required child-item-set granularity for the material to be included in the project
+
+Conceptually divide the project into tranches according to subject matter, manageable size, etc
+
+Enter rows for the tranches in the project csv file and add the parent-child metadata to the tranche csv file(s)
+
+Run gfs_validate_project_csv.py.
+
+The following section of the workflow should be repeated for each tranche within the project.
+
+Run gfs_validate_tranche_csv.py.
+
+Run gfs_create_tranche_folder.py.
+
+Run gfs_create_or_delete_tranche_asset_folder.py with the <action> set to "create_folder" and ist all the types of asset-folders that should be in the tranche.
+
+Run gfs_create_or_delete_checksum_folder.py with the <action> set to “create_folder” and list all the types of asset-folders that are present in the tranche.
+
+Populate the asset-folders within the tranche with the digitised asset-files using whatever digitisation software is appropriate, preferably giving the asset-files sequenced filenames that accord with the GFS filenaming convention.
+
+Run gfs_validate_tranche_folder.py with <gfs filenaming convention flag> set to “y”
+
+Run gfs_create_tranche_checksum_file.py with <asset file location> set to “gfs” for all the asset-folders.
+
+Run gfs_validate_tranche_checksum.py with the <asset file location> parameter set to “gfs”.
+
+Once the workflow indicated above has been repeated for all the tranches within the project, the digitisation workflow can be considered complete.
+
+
+## Possible post-digitisation actions
+
+The user might wish to upload the tranches to a preservation and dissemination platform.
+
+One example of achieving this in relation to the Arkivum platform, would be to run the gfs_create_arkivum_upload.py script which creates a BagIt folder package that can be ingested into the Arkivum platform.
+
+The user will also probably wish to periodically validate the file fixity of the asset-files by adding the appropriate lines for the tranches to the gfs_batch_validate_checksum.bat script. This script can be run periodically to check the file fixity of all the asset-files in the entire GFS.
+
+
+## General notes
+
+If faced with a challenging deadline by which a project must be completed, priority may be a factor in determining the order in which tranches are processed. Tranches can be processed in any order.
+
+However, it would not be a good idea to let priority determine the division of project content into tranches. The primary determining factor for this should be the subject matter of the content, so that the material displays in a meaningful way when it is uploaded to a dissemination target. For example, if a project were to contain sets of issues for multiple journals, it would be appropriate to divide the project up so that there is one tranche per journal. It would not be a good idea to combine short runs of several journals in one tranche.
+
+Although the scripts make it easy to manipulate asset-files that are stored within the GFS, once a tranche folder structure for a project has been created and populated, it is not a simple task to change that folder structure. It is therefore important to perform test workflows through to upload and dissemination before committing to a final structure for the project and going into “production mode”.
+
+Material type can cut across subject type. For example, it may be appropriate to provide the digitisation provider with all the material that falls within a particular set of dimensions in one batch because the digitisation provider has a specific equipment set-up for the digitisation of materials of a particular dimension.
+
+However, it is not advisable to divide up the tranches of a project according to material type. Therefore, in such a scenario, it would be necessary to fully brief the digitisation provider in advance that, for a particular batch of material, it will be necessary for the digitisation provider to switch between tranche folders when populating the child-item-sets with the digitised asset-files.   
+
+Entering the metadata for the child-item-sets in advance of the digitisation provider becoming involved in a project has the useful by-product of allowing the identification of physical items that are missing from a collection. This allows time for replacements to be sourced in good time.
+
+Note for users of Arkivum’s Perpetua
+
+When using Arkivum as an upload target, the gfs_create_arkivum_upload.py script has a “type of data pool” command-line-parameter. This parameter can be set to either “pres_and_acc” (preservation_and_access) or “pres” (preservation).
+
+If it is set to the former, the IDs that are created in the metadata.csv file (which is part of the upload package) and that are displayed in AtoM, correspond to the format that can be seen in the online example.  However, if “pres” is entered as the parameter, all the IDs have the prefix “PV01-” applied to them.
+
+The prefix is applied so that if different sets of file-types contained within the same tranche are uploaded to the two distinct datapools, via two separate executions of the script (one “pres_and_acc”, the other “pres”), the consequence is that when the resulting records in the Archivematica module of Perpetua are viewed, it should be clear that the records for the two separate uploads are distinct, but that there is a connection between the two.
+
+Testing with V6.0 of Arkivum has flagged up three issues that should be taken into account when dividing a project into separate tranches. The first of these is that beyond a certain size, the file transfer software (e.g. WinSCP) can fail to upload all the folders contained in an unzipped folder structure to the Amazon S3 Bucket. Various factors may influence the cut-off size at which this may occur. On the LSE’s system, this happened with an upload of 0.5TB. This problem can be circumvented by setting the relevant command line parameter on the gfs_create_arkivum_upload.py script to trigger the zipping of the folder structure.
+
+The second issue is that beyond a certain size of upload, the Arkivum software may divide the upload into multiple segments. When this happens, if the the upload is of the “preservation and access” datapool-type, the ordering of entries in the AtoM module may be incorrect. There are
+Probably multiple factors that can influence the cut-off point at which this segmentation occurs.
+
+Therefore, to be on the safe side, it is recommended that when deciding on the division of 
+Tranches within a project, it should be born in mind that the maximum size for a tranche
+upload, should be no larger than 100GB. At this size, it is not necessary to set the command line
+parameter to trigger the zipping of the whole folder structure, and the Arkivum software will
+not segment the uploaded data.
+
+The figure of 100GB is a “best guess” arrived at on the basis of a limited amount of testing in
+V6.0 of the Arkivum software. It may be that on your particular system, the cut-off point could be higher or lower.
+
+The third issue is that the gfs_create_arkivum_upload.py script creates a folder structure suitable for upload to Arkivum and copies files from the asset-folders specified on the command-line to this folder structure. Therefore, a suitable amount of space should be present on the drive to accommodate this extra space requirement. If specifying on the command line that the entire upload structure should be zipped, then an equivalent amount of space to that occupied by the upload folder structure will also be required on the drive until the zip file construction is complete and the corresponding upload folder structure can be automatically deleted by the script.