... | @@ -10,7 +10,7 @@ To provide a generic framework, independent of schema or upload target, in which |
... | @@ -10,7 +10,7 @@ To provide a generic framework, independent of schema or upload target, in which |
|
|
|
|
|
Specific:
|
|
Specific:
|
|
|
|
|
|
To provide a mechanism to control the end-to-end workflow for the digitisation and upload of large collections that have granular and detailed metadata.
|
|
To provide a mechanism to control the end-to-end workflow for the digitisation and upload of large collections. In addition, to provide a mechanism to control some aspects of the workflow for born-digital materials.
|
|
|
|
|
|
**Attributes**
|
|
**Attributes**
|
|
|
|
|
... | @@ -48,7 +48,7 @@ If the toolkit is used to manage large digitisation projects with a digitisation |
... | @@ -48,7 +48,7 @@ If the toolkit is used to manage large digitisation projects with a digitisation |
|
|
|
|
|
**Resources**
|
|
**Resources**
|
|
|
|
|
|
[v2.0.0 release](https://itsagit.lse.ac.uk/hub/lse_digital_toolkit/-/releases/v2.0.0)
|
|
[v2.0.1 release](https://itsagit.lse.ac.uk/hub/lse_digital_toolkit/-/releases/v2.0.1)
|
|
|
|
|
|
[Repository](https://itsagit.lse.ac.uk/hub/lse_digital_toolkit)
|
|
[Repository](https://itsagit.lse.ac.uk/hub/lse_digital_toolkit)
|
|
|
|
|
... | @@ -66,15 +66,17 @@ The levels of the GFS are reflected in an (optional) naming convention for the a |
... | @@ -66,15 +66,17 @@ The levels of the GFS are reflected in an (optional) naming convention for the a |
|
|
|
|
|
The documentation is currently skewed towards archival processing, and specifically, upload to, and download from, Arkivum’s Digital Preservation Platform (Perpetua), which uses the ISAD(G) schema. However, if external developers wish to write scripts for other upload targets and download sources that use different schema, the documentation could become more generic, with all such scripts given their own sections in the documentation. For example, if there is a need to migrate a legacy collection of images of algae to the GFS, the Darwin Core Schema could be used, and an upload script could be written that has a biological database as a target.
|
|
The documentation is currently skewed towards archival processing, and specifically, upload to, and download from, Arkivum’s Digital Preservation Platform (Perpetua), which uses the ISAD(G) schema. However, if external developers wish to write scripts for other upload targets and download sources that use different schema, the documentation could become more generic, with all such scripts given their own sections in the documentation. For example, if there is a need to migrate a legacy collection of images of algae to the GFS, the Darwin Core Schema could be used, and an upload script could be written that has a biological database as a target.
|
|
|
|
|
|
One of the first things to consider when embarking on either a digitisation project, or a migration, is to what level the material should be divided up (the granularity). For example, should a bound volume of pamphlets be treated as a single item, and given just one metadata entry, or should each pamphlet have its own metadata entry? If it is the latter, the discoverability of the material will be improved once it has been uploaded to a website, and the download size of the files will be more convenient. However, it will require more cataloguer-time to achieve this outcome. The toolkit provides a mechanism for expressing the required granularity. The smallest division becomes the child of a parent. So in the example mentioned above, each pamphlet would be the child of the parent, which would be the bound volume of pamphlets. The tranche csv files contain columns that allow this parent-child relationship to be expressed.
|
|
One of the first things to consider when embarking on either a digitisation project, or a migration relating to born-digital material, is to what level the material should be divided up (the granularity). For example, in the case of a digitisation project, should a bound volume of pamphlets be treated as a single item, and given just one metadata entry, or should each pamphlet have its own metadata entry? If it is the latter, the discoverability of the material will be improved once it has been uploaded to a website, and the download size of the files will be more convenient. However, it will require more cataloguer-time to achieve this outcome. The toolkit provides a mechanism for expressing the required granularity. The smallest division becomes the child of a parent. So in the example mentioned above, each pamphlet would be the child of the parent, which would be the bound volume of pamphlets. The tranche csv files contain columns that allow this parent-child relationship to be expressed.
|
|
|
|
|
|
|
|
For those familiar with archival terminology, a project might equate to the collection level, a tranche to the series level, a parent to the subseries level, and a child to the file level. The archival "file level" potentially containing one or more digital files.
|
|
|
|
|
|
It is important that the granularity aspect of a project is considered before a tranche folder structure is created and populated because it is very time consuming to rectify mistakes made in this aspect of a project post-digitisation. The information gained from assessing the granularity will allow a project manager to take account of the resourcing levels that will be required for a project. See the [Workflows](https://itsagit.lse.ac.uk/hub/lse_digital_toolkit/-/wikis/LSE-Digital-Toolkit#workflows) section for more information about assessing the appropriate level of granularity for material.
|
|
It is important that the granularity aspect of a project is considered before a tranche folder structure is created and populated because it is very time consuming to rectify mistakes made in this aspect of a project post-digitisation. The information gained from assessing the granularity will allow a project manager to take account of the resourcing levels that will be required for a project. See the [Workflows](https://itsagit.lse.ac.uk/hub/lse_digital_toolkit/-/wikis/LSE-Digital-Toolkit#workflows) section for more information about assessing the appropriate level of granularity for material.
|
|
|
|
|
|
Once the fields in the tranche csv file(s) have been filled out and validated, a script is run that creates a corresponding folder structure. It is this folder structure (along with the tranche csv file) that can either be given to the Digitisation Provider to populate, or be the receptacle for migrated files.
|
|
Once the fields in the tranche csv file(s) have been filled out and validated, a script is run that creates a corresponding folder structure. It is this folder structure (along with the tranche csv file) that can either be given to the Digitisation Provider to populate, or be the receptacle for migrated files.
|
|
|
|
|
|
There is a script to validate that the Digitisation Provider (or migration script) has populated the tranche folder structure correctly, in terms of both the existence of files in the correct folders and, if the GFS naming convention is used for the files, that the filenames have continuous sequence numbers. It also checks whether the number of each set of derivative files matches the number of master files. For example, it will check that the number of jpg files matches the number of tif files in each child-item-set.
|
|
There is a script to validate that the Digitisation Provider (or migration script, or archivist responsible for born-digital material) has populated the tranche folder structure correctly, in terms of both the existence of files in the correct folders and, if the GFS naming convention is used for the files, that the filenames have continuous sequence numbers. It also checks whether the number of each set of derivative files matches the number of master files. For example, it will check that the number of jpg files matches the number of tif files in each child-item-set.
|
|
|
|
|
|
The toolkit is a relatively mature piece of software in some respects. The LSE has used it to process [many collections](https://lse-atom.arkivum.net/informationobject/browse) over the last four years, both for digitised and born-digital material.
|
|
The toolkit is a relatively mature piece of software in some respects. The LSE has used it to process [many collections](https://lse-atom.arkivum.net/informationobject/browse) over the last five years, both for digitised and born-digital material.
|
|
|
|
|
|
The [Economic History Collection](https://lse-atom.arkivum.net/uklse-dl1eh01) is an example of a large collection that the LSE has processed using the toolkit. It contains around 6300 child-item-sets. Each child-item-set contains ten to twenty alto, jpg, msword, text and tif files, plus one pdf file. It has a total of disk space usage of around 7TB. Only the pdf files are disseminated through to the AtoM module of Arkivmum's Perpetua Platform.
|
|
The [Economic History Collection](https://lse-atom.arkivum.net/uklse-dl1eh01) is an example of a large collection that the LSE has processed using the toolkit. It contains around 6300 child-item-sets. Each child-item-set contains ten to twenty alto, jpg, msword, text and tif files, plus one pdf file. It has a total of disk space usage of around 7TB. Only the pdf files are disseminated through to the AtoM module of Arkivmum's Perpetua Platform.
|
|
|
|
|
... | | ... | |