|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
|
|
Asset-files can become corrupted, either when being copied, transferred, or just while residing on a hard drive, due to factors such as operating system defects, defects in apps, hardware malfunction, cosmic rays causing bit values within the bytes that comprise an asset-file to flip. It is therefore advisable to store checksum values for each asset-file. If an asset-file's content changes in any way, the calculated checksum value will change, so if a regular comparison is performed between the stored checksum value and a newly calculated checksum value, this will flag up any change in the asset-file's content.
|
|
|
|
|
|
|
|
The toolkit provides a mechanism for storing the checksum values of asset-files, and for checking these values against freshly created checksum values. If the user wishes to make use of this functionality, it can be incorporated into the workflows.
|
|
|
|
|
|
|
|
The various toolkit scripts that are involved in file file fixity checking operate upon checksum manifest files that are contained in checksum folders that reside within the child-folders within the GFS. Some of the toolkit scripts create these folders and files, some validate the content, and some, whose primary purpose is some other function, nevertheless take account of file fixity checking as part of their processing.
|
|
|
|
|
|
|
|
Those scripts whose primary purpose relates to checksum-file processing are listed below:
|
|
|
|
|
|
|
|
- gfs_batch_validate_checksum.bat
|
|
|
|
- gfs_create_or_delete_tranche_checksum_folder.py
|
|
|
|
- gfs_create_tranche_checksum_file.py
|
|
|
|
- gfs_generate_checksum.py
|
|
|
|
- gfs_validate_tranche_checksum.py
|
|
|
|
|
|
|
|
Those scripts that have some file fixity checking in their processing are listed below:
|
|
|
|
|
|
|
|
- gfs_create_arkivum_upload.py
|
|
|
|
- gfs_distribute_arkivum_export_to_tranche.py
|
|
|
|
- gfs_copy_tranche_files_to_folder.py
|
|
|
|
- gfs_copy_tranche_to_tranche.py
|
|
|
|
|
|
|
|
The length of the alphanumeric checksum string varies according to the type of checksum used. The checksum-types catered for in the toolkit are md5, sha256, sha512, although others could be added as needed.
|
|
|
|
The md5 checksum will be used as the default type in example command lines because it takes around 30% less time to create, compared with a sha256 checksum.
|
|
|
|
The creation and validation scripts will only run 10% faster though when compared with the sha256 checksum-type, because of all the other processing that the scripts have to perform.
|
|
|
|
However, sha256 should be used when running the script gfs_distribute_arkivum_export_to_tranche.py because it is the checksum-type used in the BagIt exports from Arkivum.
|
|
|
|
|
|
|
|
## Functionality
|
|
|
|
|
|
|
|
The script gfs_create_or_delete_tranche_checksum_folder.py can be used to create checksum-folders within the child-folders of a tranche of the GFS. The folder names have the form:
|
|
|
|
|
|
|
|
“<asset folder name>_checksum_<checksum type>”
|
|
|
|
|
|
|
|
For example, if the corresponding asset-folder has the name "jpg", the checksum-folder that is created will have the name "jpg_checksum_md5"
|
|
|
|
|
|
|
|
Once the checksum-folders have been created for a particular tranche, the gfs_create_tranche_checksum_file.py script can be run, which creates checksum manifest files within the checksum-folders that have names of the form:
|
|
|
|
|
|
|
|
“manifest-<checksum type>.txt”
|
|
|
|
|
|
|
|
For example, inside the checksum-folder named "jpg_checksum_md5" mentioned above, a file is created that has the name “manifest-md5.txt”. This file contains a list of pairied values, a checksum followed by the corresponding asset-file-name.
|
|
|
|
|
|
|
|
An example of the content of a manifest-md5.txt file that has three rows, can be seen below.
|
|
|
|
|
|
|
|
116907a4ca1efc40a57d48ab1db7adfc5 UKLSE_EX1_ZT01_001_001_0001_0001.jpg
|
|
|
|
5501bc5ef3f7dc0d09e7e4d073d4902d7 UKLSE_EX1_ZT01_001_001_0001_0002.jpg
|
|
|
|
f9fd6bd53b67bf22188ba1597ced3ee7d UKLSE_EX1_ZT01_001_001_0001_0003.jpg
|
|
|
|
|
|
|
|
Once the checksum-folders and checksum-manifest-files have both been created for a tranche, the gfs_validate_tranche_checksum.py script can be run to check that newly calculated values for the checksums for the asset-files match the checksum values contained in the checksum-file.
|
|
|
|
If the two values are not identical for any asset-file, the script will report the discrepancy, and action can be taken to replace the corrupted asset-file with an uncorrupted version from a backup system.
|
|
|
|
|
|
|
|
The gfs_batch_validate_checksum.bat script can be set up so that, with just one click of a file-icon, the gfs_validate_tranche_checksum.py script can be run against all the tranches in the entire GFS.
|
|
|
|
|
|
|
|
If the asset-files are located on a legacy drive and are to be migrated into the GFS, using the gfs_migrate_tranche_folder.py script, it is possible to check that the fixity of the asset-files has been retained, as they are copied or moved, by first creating the appropriate checksum-folders, and then running the gfs_create_tranche_checksum_file.py script with the <asset file location> parameter set to "legacy" (rather than "gfs").
|
|
|
|
This will result in the path for the asset-files being determined by the path specified in the "gfs.legacyPath" column within the tranche.csv file, rather than the asset-folders within the GFS.
|
|
|
|
So, after the asset-files have been copied or moved into the asset-folders of the GFS, the gfs_validate_tranche_checksum.py script can be run to check that the checksums pre-migration, match the checksums post-migration.
|
|
|
|
|
|
|
|
## Policy
|
|
|
|
|
|
|
|
It is recommended that fixity checking be applied to all the asset-folders within a tranche.
|
|
|
|
It could be argued that it is only worth checking the fixity of asset-files such as tiff and wav files because the asset-files that are derived from them (such as jpg and mp3) could be reconstituted, but it will be simpler from an operational point of view to just do them all, and not have to ponder on which types of asset-files merit inclusion.
|
|
|
|
The storage space required for the checksum-manifest-files is trivial, and the derived-files tend to be small, so the time taken to create the checksums is unlikely to be a significant component of the time taken for the entire process.
|
|
|
|
|
|
|
|
Since the md5 checksum was developed, other checksums of greater length have been developed. The need for increased lenght came from applications in which checksums are used in applications relating to security.
|
|
|
|
File fixity validation is not a process that benefits from having checksums with a greater length and the checksum types that are longer take a greater amount of time to construct.
|
|
|
|
It is therefore recommended that the md5 checksum type is used because the scripts will take less time to coomplete their processing compared with when using longer checksums.
|
|
|
|
The only other factor to take into account in selecting a checksum type is whether the platform into which a tranche may be ingested validates the checksum values and if so, which checksum types are catered for.
|
|
|
|
|
|
|
|
[Return to documentation home page](https://git.lse.ac.uk/hub/lse_digital_toolkit/-/wikis/LSE-Digital-Toolkit) |
|
|
|
\ No newline at end of file |