Changes

Nick Bywell · 4067b069
--- a/LSE-Digital-Toolkit.md
+++ b/LSE-Digital-Toolkit.md
+[[_TOC_]]
+## Purpose, attributes, caveats and resources
+**Purpose**
+General:
+To provide a generic framework, independent of schema or upload target, in which files can be stored and easily manipulated.
+Specific:
+To provide a mechanism to control the end-to-end workflow for the digitisation and upload of large collections that have granular and detailed metadata.
+**Attributes**
+To be configurable so that any schema, and any upload target, can be accommodated.
+To be scalable. The maximum number of child-item-sets per project is just under 10 billion. Each child-item-set can contain a maximum of 9999 files per file-type.
+To possess a folder structure that has a fixed number of levels and yet which can accommodate a sufficient number of archival levels to cater for most requirements. This structure is referred to in this document as the [Generic Folder Structure (GFS)](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#the-generic-folder-structure-gfs).
+To be configurable so that it is possible to manage the collections of multiple organisations that each have different requirements for schema, upload target and download source at organisational, departmental, project and tranche level (a tranche being a sub-component of a project, and the “unit” that is processed by the toolkit scripts)
+The toolkit is “open-source software” and is released under the [GNU General Public License v3.0](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/blob/master/LICENSE). It is currently written in the Perl 5 scripting language but will soon be rewritten in the Python scripting language. The transition from Perl to Python should be seamless for the user.
+For users of [Arkivm's Digital Preservation Platform (Perpetua)](https://arkivum.com/) this version of the toolkit will work for all V5 customers, although Arkivum will have to configure the customer's system.
+Column headers in the csv files (that are associated with the internal functioning of the toolkit) use an internal "schema of convenience" called "gfs" that has no relation to any other schema of that name.
+**Caveats**
+The toolkit currently only caters for metadata text that is written in the English language and uses the standard ascii character set. Even when the metadata text is in the English language, if metadata is copied and pasted into, or migrated into, the toolkit’s csv file manifests, and it contains non-standard ascii characters, the scripts are likely to fail in an uncontrolled manner. It is therefore only advisable to use this toolkit if someone on your team is familiar with script-writing, metadata-wrangling, and character sets. There will be an attempt to cater for other languages in future versions but nothing can be guaranteed in this regard.
+The toolkit only works with the Windows operating system. It is hoped that it will also work with macOS and Linux in the next (Python) version. 
+If legacy collections are migrated into the GFS, there are three different outcomes depending on the nature of the filenames in the collection:
+- if the filenames contain spaces or have no extension, they can reside in the GFS but the toolkit scripts cannot process them
+- if the filenames have no spaces in them and they have an extension, all the scripts except one will be able to process them (see the [Script groups](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#script-groups) section)
+- if the migrated filenames can be renamed using the relevant toolkit script, all the scripts will be able to process them
+The advantages and disadvantages of using the GFS file-naming convention for migrated files are indicated in the [The Generic Folder Structure (GFS)](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Version-1-LSE-Digital-Toolkit-(Perl-Version)#the-generic-folder-structure-gfs) section.     
+If the toolkit is used to manage large digitisation projects with a digitisation company performing the digitisation, it is advisable to include certain items in the contract. These items are detailed in the [Managing your relationship with your Digitisation Provider](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#managing-your-relationship-with-your-digitisation-provider) section.
+**Resources**
+[v1.0.2 release](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/releases/v1.0.2)
+[Repository](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version)
+[Example Generic Folder Structure](https://drive.google.com/file/d/17F8rtleD-213YfIMEC5hhHSTQ4qwQIr3/view)
+[Twitter](https://twitter.com/LSEDigitalTK)
+## Overview of the toolkit
+The toolkit consists of a suite of scripts that are executed via command line, plus some configuration files. The scripts operate on [The Generic Folder Structure (GFS)](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#the-generic-folder-structure-gfs) which provides an organisational hierarchy (organisation, department, project, tranche), represented by codes and numbers.
+The use of codes and numbers allows for the automatic creation of unique IDs at every level, and of unique upload slugs (a slug being a string of characters that forms part of a [URL](https://en.wikipedia.org/wiki/URL)). The easiest way to understand the capabilities of the toolkit, and to determine its utility for your organisation, is to follow the instructions in the [Getting started with the toolkit](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Getting-started-with-the-toolkit) section.
+The levels of the GFS are reflected in an (optional) naming convention for the asset files. If this naming convention is adopted, the files can be manipulated more easily by the scripts. It also ensures that each filename is unique. All but one of the scripts will still function if the file-naming convention is not adopted for a tranche within a project so long as the filenames abide by some minimum requirements that are listed in [The Generic Folder Structure (GFS)](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#the-generic-folder-structure-gfs) section.
+The documentation is currently skewed towards archival processing, and specifically, upload to, and download from, Arkivum’s Digital Preservation Platform (Perpetua), which uses the ISAD(G) schema. However, if external developers wish to write scripts for other upload targets and download sources that use different schema, the documentation could become more generic, with all such scripts given their own sections in the documentation. For example, if there is a need to migrate a legacy collection of images of algae to the GFS, the Darwin Core Schema could be used and an upload script could be written that has a biological database as a target.
+One of the first things to consider when embarking on either a digitisation project, or a migration, is to what level the material should be divided up (the granularity). For example, should a bound volume of pamphlets be treated as a single item, and given just one catalogue entry, or should each pamphlet have its own catalogue entry? If it is the latter, the discoverability of the material will be improved once it has been uploaded to a website, and the download size of the files will be more convenient. However, it will require more cataloguing time to achieve this outcome. The toolkit provides a mechanism for expressing the required granularity. The smallest division becomes the child of a parent. So in the example above, each pamphlet would be the child of the one bound volume, which is the parent. The tranche csv files contain columns that allow this relationship to be expressed.  
+It is important that the granularity aspect of a project is considered before a tranche folder structure is created and populated because it very time consuming to rectify mistakes made in this aspect of a project post-digitisation. The information gained from assessing the granularity will allow a project manager to take account of the resourcing levels that will be required for a project. See the [Workflows](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#workflows) section for more information about assessing the appropriate level of granularity for material.
+Once the tranche csv file(s)s have been filled out and validated, a script is run that creates a corresponding folder structure. It is this folder structure, along with the tranche csv file, that can either be given to the Digitisation Provider to populate, or be the receptacle for migrated files.
+There is a script to validate that the Digitisation Provider (or migration script) has populated the tranche folder structure correctly, in terms of both the existence of files in the correct folders, and also whether the number of each set of derivative files matches the number of master files. For example, it will check that the number of jpg files matches the number of tif files in each child-item-set.  
+The toolkit is a relatively mature piece of software in some respects. The LSE has used it to process [many collections](https://lse-atom.arkivum.net/informationobject/browse) over the last three years, both for digitised and born-digital material.
+The [Economic History Collection](https://lse-atom.arkivum.net/uklse-dl1eh01) is an example of a large collection that the LSE has processed using the toolkit. It contains around 6300 child-item-sets. Each child-item-set contains ten to twenty alto, jpg, msword, text and tif files, plus one pdf file. It has a total of disk space usage of about 7TB. Only the pdf files are disseminated through to the AtoM module of Arkivmum's Perpetua Platform.  
+The toolkit is only mature in the relatively narrow band of activity for which the LSE has used it. The toolkit has fourteen scripts, but only about eight of these are used in day-to-day work.
+When the toolkit is used with the ISAD(G) schema, it can be configured for "Library Processing". This allows certain fields, which are commonly used in bibliographic cataloguing, but are not present in the ISAD(G) schema, such as “Personal author”, “Corporate author”, “Publisher”, and “Note” to have their own columns in the tranche csv file manifests.
+When an upload target such as Arkivum's Perpetua is used, the content of these columns are combined and formatted within the “isadg.scopeAndContent” column.
+Tags, plus their content, can be created “on the fly” by entering them in the “gfs.contextualInformation” column of the tranche csv files. These tags are formatted and added to the content of the “isadg.scopeAndContent” column, as can be seen [here](https://lse-atom.arkivum.net/uklse-ex1zt01001001). This is a dissemination to Perpetua’s Atom module that contains the uploaded content of the example GFS used in the [Getting started with the toolkit](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Getting-started-with-the-toolkit) guide.
+This feature is documented in the the "Library Processing" sub-section of the word document that can be downloaded from  of the [Generic Folder Structure (GFS)](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#the-generic-folder-structure-gfs) section.
+## Getting started with the toolkit
+[Getting started with the toolkit](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Getting-started-with-the-toolkit)
+## The Generic Folder Structure (GFS)
+[The_Generic_Folder_Structure.docx](uploads/ad529954af77081656b28917d8372c9c/The_Generic_Folder_Structure.docx)
+## Configuration
+[Configuration](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Configuration)
+## Workflows
+**Digitisation Workflow**
+[Digitisation_workflow.docx](uploads/e8d843c40668950d940fb0aba5d89ee1/Digitisation_workflow.docx)
+**Migration workflow**
+[Migration_workflow.docx](uploads/3fcd72dad43a714d5e23914b3cc7a9e6/Migration_workflow.docx)
+**Arkivum tranche-cycle workflow**
+[Arkivum_tranche_cycle_workflow.docx](uploads/5dfa077936667eddc7bbba0aaf80e2dc/Arkivum_tranche_cycle_workflow.docx)
+**Temporary-folder tranche-cycle workflow**
+[Temporary-folder_tranche-cycle_workflow.docx](uploads/c2f02989cf88d715623c3e2f465fd38e/Temporary-folder_tranche-cycle_workflow.docx)
+## Cataloguing guide
+The cataloguing guide that can be downloaded via the link below is the LSE's internal cataloguing guide and is specific to its own requirements. 
+[LSE_Digital_Toolkit_v-1_User_Guide.docx](uploads/067f3ad7e69ad0745cf4562204a67468/LSE_Digital_Toolkit_v-1_User_Guide.docx)
+You may wish to consult this guide while evaluating the toolkit and then, if you decide to use the toolkit on a production basis, modify the document so that it suits the requirements of your own organisation.
+The content of this guide matches the default configuration of the [example GFS](https://drive.google.com/file/d/17F8rtleD-213YfIMEC5hhHSTQ4qwQIr3/view) that is used in the [Getting started with the toolkit](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Getting-started-with-the-toolkit) section.
+## Script groups
+[Script groups](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Script-groups)
+## Managing your relationship with your Digitisation Provider
+The Digitisation Provider could be a department within your own organisation, or it could be a company that provides such services. If it is the former, although it will be necessary to communicate a clear set of requirements to the department, contracts are unlikely to be involved.
+If you use this toolkit and sign a contract with a company to provide digitisation services, it is advisable for the contract to state that the company will be expected to populate the relevant folder-types within tranches of the LSE's Generic Folder Structure with the required master files and derivative files according to the granularity indicated in the tranche csv files.
+It should also be stated that the files should be named according the toolkit's Generic File-naming Convention and a table listing required file-types and derivative files should be included in the contract.
+Finally, it should be stated that one small test tranche will be populated by the company prior to commencing " production mode". This is so that the personnel of the Digitisation Provider have a chance to develop their own workflows and any teething problems encountered can be resolved. The customer can also verify that the outcome has met with expectations.
+The legalistic approach indicated above is not indicative of the LSE having a problem with contracting a Digitisation Provider. However, populating the GFS is likely to require the provider to develop new workflows, so ensuring that the provider is bound into this requirement is advisable.
+In fact, the LSE found that its Digitisation Provider took a positive view of the GFS, and requested that the scripts be installed on its own devices so that its personnel could run the validation script to ensure the tranches had been populated correctly. If the intention is to install scripts on the devices of the Digitisation Provider, it should be noted that the toolkit currently only functions on devices that use the Windows operating system. 
+Having a validation script can prove beneficial to both parties. In the course of implementing a large digitisation project, with thousands of child-item-sets, it would be easy for the staff of the Digitisation Provider to miss out some items. The Digitisation Provider will not want to have to bring all its equipment and staff back on site a month or two after the project has finished, just to digitise a small number of items have been missed, so it is in its interest to have the output validated.
+The LSE's Digitisation Provider made suggestions for script enhancements, and evolved its own workflows for populating the folder-types with the master and derivative files. These workflows involved the use of the following scripts:
+- gfs_copy_folder_type_to_target.pl
+- gfs_distribute_files_to_tranche.pl
+There are a number of scenarios that could be applicable to the population of the folder-types in the tranches of the GFS by the Digitisation Provider:
+**Scenario 1**
+Direct population of the folder-types in the tranches of the GFS by the Digitisation Provider using the GFS File-naming convention. This is the ideal scenario.
+**Scenario 2**
+The Digitisation Provider is not willing to directly populate the folder-types within the GFS, but is willing to create the files with names that accord with the GFS File-naming Convention . In this scenario the desired outcome can still be achieved if the Digitisation Provider can deliver an entire tranche's worth of digitised files in one temporary folder. The gfs_distribute_files_to_tranche.pl script can be used to move the files to the folder-types within the tranche of the GFS. This scenario relies on the staff of the Digitisation Provider being very accurate in their naming of the files.
+**Scenario 3**
+The requirement is for the Digitisation Provider to directly populate the folder-types within the GFS, but not to create the files with names that accord with the GFS File-naming-convention. This can be done but with the proviso that those scripts listed in the "Scripts that can only be used if all the files in a tranche comply with the GFS File-naming Convention" sub-section within the [Script groups](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/Script-groups) section, will not function.
+If subsequently deemed appropriate, the files could be renamed to comply with the GFS File-naming Convention by using the gfs_rename_tranche_files.pl script. This is perhaps not an ideal scenario because of potential issues with renaming files.  These issues are indicated in the [Generic Folder Structure (GFS)](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#the-generic-folder-structure-gfs) section. 
+**Scenario 4**
+The Digitisation Provider cannot directly populate the folder-types within the GFS, or create the files with names that accord with the GFS File-naming Convention. In such a scenario, a migration script would have to be written to move the digitised files to the GFS. The gfs_migrate_tranche_files.pl script might form the basis of such a script but the degree of customisation that would be required would depend upon the nature of the folder structure created by the Digitisation Provider, and the file-naming convention that was used. 
+**Note**
+When a Digitisation Provider is in "production mode", it may have multiple staff on site digitising the material. It is therefore advisable that any delays to production are minimised by having someone on your team who is familiar with the toolkit, script writing, metadata-wrangling, and character sets and available, at short notice, to trouble-shoot any problems encountered by the Digitisation Provider.
+## Future developments
+A rewrite of the toolkit into the Python Scripting Language
+Development of a configurable utility to delete and substitute non-standard ascii characters in a file
+An attempt to improve the ability of the toolkit to cater for non-English metadata text. Hopefully, a partial improvement will be an outcome of the rewrite of the toolkit in the Python scripting language. Unfortunately, nothing can be guaranteed in this regard.
+Addition of checksum processing to the toolkit
+Enable the toolkit to cater for archival levels of unlimited depth
+## Contact
+Neil Stewart (Digital Library Manager)
+Email n.stewart@lse.ac.uk
+Nick Bywell (Digital Library Developer)
+Email: n.bywell@lse.ac.uk
+## Author's note
+When I first joined the Digital Library Team at the LSE, four years ago, there was a pressing need to control the workflow for the digitisation of the LSE's [Economic History Collection](https://lse-atom.arkivum.net/uklse-dl1eh01). I looked around for a suitable tool but could not find one, and so began creating this toolkit. At the same time, the library was in the latter stages of a tender process for access to a hosted digital preservation platform. The library opted for Arkivum's Perpetua, which provided for both the preservation and dissemination of digital assets. Creating a suitable upload package for this platform became another requirement of the toolkit.
+The toolkit is written in the Perl 5 scripting language, which is rather ancient technology. I used it because I was in a hurry, and it was a scripting language with which I was familiar. Publication of the toolkit could have been delayed until after it has been rewritten in the Python scripting language but as it has sufficient functionally to potentially be of use to other organisations, it seems appropriate to publish it now.
+The rewrite of the toolkit into Python will allow it to integrate more easily with other software packages and, hopefully, allow metadata text that is not in the English language to be used.
+I am grateful to the following colleagues for their input into the development of the toolkit:
+- **Fabi Barticioti**, whose archival expertise has been invaluable in developing various aspects of the toolkit. Fabi also developed the [LSE's internal cataloguing guide](https://git.lse.ac.uk/bywell/lse-digital-toolkit-perl-version/-/wikis/LSE-Digital-Toolkit#cataloguing-guide)
+- **Neil Stewart**, my line manager, who has always proved to be a wise sounding board, and who (along with those further up the management hierarchy) gave me the time to develop a generic toolkit, rather than one that was specific to the LSE's requirements
+- **Sylvia Gallotti**
+- **Emma Pizarro**
+- **Clare Mulhall**
+- **George Jukes**
+- **Andy Jack**
+- **Wendy Lynwood**
+I am also grateful to:
+- the staff of our Digitisation Provider, who took a positive approach to the GFS, when I feared that the population of the tranches, using the GFS's naming convention, might prove to be a stumbling block
+- the staff at Arkivum, who kindly made some modifications to their system so that the upload-file could be processed and displayed appropriately
+It would be interesting hear from anyone who starts using the toolkit, or has problems with it, or is willing to give some feedback on its functionality. It would also be encouraging to hear from any developers who wish to contribute new scripts for additional upload targets. I can be contacted at n.bywell@lse.ac.uk
+Nick Bywell (18th October 2021)
\ No newline at end of file