CarLibrary.org - Greenstone Collections Created with Embedded Metadata

CarLibrary.org - Greenstone Collections Created with Embedded Metadata

June 15, 2016

CarLibrary.org promotes the creation of digital archives by car historians, collectors, museums and collections with the use of the open-source Greenstone Digital Library program and other software.

This webpage describes using Google's Picasa and other Windows programs to improve the identification - and eventual classification - of digital photos and scanned documents and photos. This is not only highly useful for Greenstone, but for many other programs which classify and display digital assets.

A further CarLibrary.org webpage/topic on "Embedded Metadata" describes how to use the open-source ExifTool program, which has more advanced functions to add and modify metadata than Picasa or any yet tested photo imaging software.

Any experience gained from the steps below will be very useful to understand and use the ExixTool program.

1. Classifying and Archiving Photos for Greenstone

Other topics/webpages on this website - and many Internet guides - offer good instructions on downloading and installing Greenstone.

A. After the program is installed, digital files which have been captioned and tagged can be dragged ("gathered") into Greenstone (version 2.85 for Windows was used for the following examples).

B. No other metadata needs to be added to any photograph in the Greenstone "Enrich" panel.

C. In the "Design" panel, configure the Greenstone Image "plugin" to extract "OIDmetadata dc:Description"

D. "Create" the collection. When this is complete, "Preview the Collection" and you should see Picasa captions displayed in the "ex.XMP.Description" metadata.

E. This metadata category can be renamed in the Format panel as "Caption" for the search results display. Greenstone will display the Picasa caption with each digital photograph.

F. The screenshot below (Figure 1) shows that Greenstone has also extracted the photo's file name from "ex.Source" and the date it was taken from the EXIF metadata "ex.EXIF.DateTimeOriginal".

G. Photos in this trial collection can also be found by searching for any desired text in the "Captions" or "Photo Dates" categories.

Figure 1- Screen shot shows browse results on Captions starting with "T"

H. Picasa Captions and Tags (keywords) can be used directly in Greenstone. Figure 2, below, shows results from browsing in the keywords category, specifically to display the photo ID numbers, as embedded as Picasa "tags."

Figure 2 - A display of keywords starting with "1", which show the photos with the trial ID numbers starting with "12", etc. The Greenstone "search" function can also be used to locate a specific ID number - or other keyword.

A Greenstone test archive of 190 personal photos taken at the Mullin Automotive Museum was made using only embedded metadata added with Picasa.  The metadata includes location data for each photo: latitude and longitude. The newest Greenstone, version 3.0, can use this data for map displays.

Figure 3 - An archive of Mullin museum photos, this is the initial display of "Captions". Picasa was used to identify each image with captions and "car make" and "car year" as tags/keywords. The file names and photo dates are standard metadata embedded by the digital camera and extracted by Greenstone automatically.

2. Digital File Recommendations

These are based on these trials, practical considerations and guidelines for archives (U. S. National Archives and the Smithsonian Institution):

Digital Photographs:

a. Use software such as the ExifToolGUI, Picasa or Photoshop to put descriptions (captions) on each photo. Captions will be very useful for later identifying the photo.

b. Use tags to add "keywords" to each photo or group of photos.

c. A unique ID number can be added to each photo, best as the first "tag". Best museum/archive practice is to make this unique number an "accession number". If you use the ExifToolGUI to add an accession number, put it in the DC:Identifier category.

d. Photos can be geo-tagged (Picasa red pin) to locate each photo or group of photos on Google's maps.

e. Save or export the photos for use in Greenstone or other archive software at a resolution suitable for the archive's use.

Scanned images:

a. Scan at least at 300 dpi; best museum practice recommends 600 dpi.

b. If "archive quality" is not a concern, scanning to JPG format is "OK".

c. If museum standards are desired, scan to TIFF or PDF/A format (Note: "PDF/A" is a new standard for long-term archive storage and use of digital images and documents).

d. Use the same techniques as for digital photos above, to add captions, tags, geo-tags to each image.

e. Export the images. Some programs will convert TIFF images to JPG format; image metadata will be preserved.

f. However, many programs do not recognize PDF/A formatted images. Subject, keyword and other identifying data can be added in the ExifToolGUI or a PDF editor, such as Adobe Acrobat, ABBYY FineReader, or Lightning PDF Editor.

Scanned Slides and Negatives:

a. If you have the original negative or slide for any image, scanning the slide or negative directly will almost always give better results than scanning the photo previously printed in a darkroom or with a digital printer.

b. The same steps for scanned images apply, except the most common negative/slide format - 35 mm - should be scanned at 2800-4000 dpi. This resolution should be within the optical resolution of your scanner. Better quality scanners (usually those costing more than $100 or scanners that are not part of a "all-in-one" printer) will give better, near-professional archive quality results.

c. The software included with your scanner may be adequate. You should scan several slides or negatives and check the results to determine whether you need a software upgrade or alternative.

Scanned Documents:

a. If text recognition (and later text searching) is not a concern, scan as described above for images.

b. However, text recognition is important! Therefore scan to PDF, multi-image TIFF or PDF/A at 300 dpi or higher.

c. Process each document with good optical character recognition (OCR) software such as Adobe Acrobat, ABBYY FineReader or other tested and proven OCR software. This software will also convert 600 dpi TIFF files to OCR-text files.

d. Add identifying information after the OCR process stage with the same OCR software. This information will be located in the XMP metadata category.

e. The ExifToolGUI can also be used to add metadata to a PDF or PDF/A file if the Workspace manager is configured for this function.

Note: TIFF files are a long-recognized standard for archiving photos and scanned images/documents. In a white paper "Guidelines for TIFF Metadata, Recommended Elements and Format", a US government standards organization recommends using "ImageDescription" for the subject of the item and "ImageUniqueID" for a unique file identifier. However, the seemingly logical "ImageUniqueID" was empty for TIFF files, but very much in use for digital camera images.

Email me with any comments, suggestions or questions: Bob Schmitt, rgschmitt@gmail.com