The Car Library Project: Trials and Tests with Picasa, Metadata, Greenstone and the ExifTool
The CarLibrary.org digital archive project, promotes the use of the open-source Greenstone digital library program for car historians, collectors, museums and collections, to encourage the creation of digital archives. The webpage describes trials using Picasa and other Windows programs to improve the identification - and eventual archive-style classification - of digital photos and scanned documents and photos. These trials started in early December 2012. This website shows results in a chronological order; the most recent trials were completed in March 2013. Google announced on February 12, 2016 that Picasa, its desktop photo editing and management program, would not be supported after March, 2016. The Picasa Web Albums, the online feature of this program, would transition to Google Photos. This webpage recommends using Picasa for basic photo captioning and metadata tagging, including location tagging. "Desktop" Picasa will function indefinitely for this use. Software ("app") program recommendations will be updated as replacements become known. Captioning in Picasa: Trials I've personally used the Picasa photo editing/organizing program (free download from Google) for several years for modest edits and to organize - by year and month of creation - thousands of digital photographs and many images scanned from slides or negatives. Also for images of scanned documents. But only a small fraction of my digital assets are fully identified, even though the Picasa captioning step is not difficult. However, if Picasa captions can be used by other programs (such as the Greenstone digital library/archive program) to make more complex archives, this secondary use would be a strong incentive to always caption digital photos Tests were made with a small group of digital photos which were captioned in Picasa. After captioning, the photos were inspected by the Adobe Photoshop Elements software program. The "File info" command shows the Picasa captions are embedded metadata, as expected. In each photo the captions showed up in three places:
Adding the Images to a Greenstone Collection To see if this metadata had further use, a new Greenstone "collection" was started:
Another Use for Captions One preliminary technique put a unique ID (accession number) in the Picasa Caption field as embedded data. It was thought this could be useful for archives or collection using museum "best practices" technique for object numbering. For example, the first photo in Figure 1 below could have "The three Italian vehicles 12090043" in the "caption" field. If this number had been written on the original photograph or slide to be scanned, the digital representation could be readily cross-referenced and located through Greenstone's search function. Picasa photo editing features are very compatible with archive standards. Any photo edit - except the caption, tags and geotagging - makes no actual change to the original photo until it is "saved" or "exported" to a different folder. Image edits are stored in a small, separate Picasa file. After image edits have been made in Picasa, it is easy to export a folder or group of photos to a folder (intended for adding to or archiving in Greenstone) in the original photo size or a choice of a smaller photo size. The screenshot below (Figure 1) - from a third trial - shows that Greenstone can also extract the photo file name from "ex.Source" and the date it was taken from the EXIF metadata "ex.EXIF.DateTimeOriginal". Figure 1- Screen shot shows browse results on Captions starting with "T" This was an encouraging result even though this metadata did not appear in Greenstone's "Enrich" panel for the images tested to date. We know it's there! Further trials were conducted and the search results display was customized and improved. Using Picasa "Tags" (Keywords) Further trials with Picasa and digital photos were made in January 2013 to determine if Picasa alone can be used to embed more metadata elements. Captions were added as previously described in Picasa. Several photos were geo-tagged, using the "red pin" (lower right) and multiple tags (keywords) were added using the "Tag" function (also lower right). The ExifToolGUI (see below) shows the embedded metadata results:
Picasa easily can add photo locations to images and this "geotagging" was confirmed to be stored in a standard EXIF metadata location. Greenstone can extract and display this data but tests to display it on a map program are pending. Further Tests for Scanned Photos and Documents As noted above, Picasa stores its captions in XMP and IPTC metadata files. Initial tests were done with the XMP and IPTC metadata. TIFF files also use a subset of the EXIF metadata specification; the next paragraph shows why this is important. To become more consistent - and permit Greenstone to extract captions from a single metadata category source - metadata extract trials were done using EXIF metadata:
For TIFF files - a long-recognized standard for archiving photos and scanned images/documents - the white paper "Guidelines for TIFF Metadata, Recommended Elements and Format", a US government standards organization recommends using "ImageDescription" for the subject of the item and "ImageUniqueID" for a unique file identifier. When this was tested, the seemingly logical "ImageUniqueID" was empty for TIFF files, but very much in use for digital camera images. Perhaps the initial choice of which identifiers to use is not critical, if data can be copied between metadata categories, as promised by ExifTool. What happens when documents are scanned as TIFF files - as recommended by many archive standards - and then converted with OCR software to PDF files with accessible text? This was tested; the results were not encouraging:
ABBYY FineReader 6.0 Sprint Plus OCR software was used to convert the JPG files and TIFF files to separate PDF files. No metadata was evident when inspected by ExifTool. It isn't clear whether this lack of metadata is a quirk of the file structure, or Picasa, or Fine Reader. A second document was scanned directly into a 6-page PDF file. There were no obvious steps in the scan process, either through the Epson scan software or through Picasa, to put metadata into the PDF file. None was found through ExifTool. However, the Lightning PDF editor was able to add metadata to the file through the "properties" function. The metadata was confirmed in ExifTool and was further edited (added to) in this program. A further test was conducted with an 8-page advertising brochure, scanning to a multipage TIFF file. Again, the metadata added in Picasa was not visible in ExifTool nor did it show up after conversion by FineReader to a PDF file. However the scanning to TIFF and conversion seemed faster than the single page trials and the result was an archive TIFF file (sizable!) and a Greenstone-ready PDF file, in which metadata was added in the Lightning PDF editor. This process is workable. Although the TIFF file format has been recommended as an archive standard, archive organizations also have recently stated that the "PDF/A" format is also a reliable archive standard. PDF/A is an open, internationally recognized document standard for completely self-contained documents that can have embedded metadata (XMP format) and searchable text. LibreOffice can create of PDF/A documents through an option to its "Export as PDF..." menu choice. A 2-page Word documents was exported as a PDF/A, with metadata added in the LibreOffice "Properties" menu choice. The metadata was confirmed in ExifTool, the document was added to the "Vespa" Greenstone collection (same as above) and was correctly classified and displayed through this embedded metadata. A reply from ABBYY FineReader technical support stated that the "Professional" version supports PDF/A and has a menu option to embed metadata. A trial version of this software was tested with an Epson scanner on image, text and mixed text/image documents - see below. Adding MetadataWith Other Programs Two other free programs seem potentially very useful to better identify digital photos: "ReNamer" and "ExifTool". ReNamer can change the file name for an entire folder of photos in many ways, including adding metadata. For this sample group of photos, the caption was temporarily added to the file name as a prefix or suffix by selecting "ITPC Caption" choice. If the only the "accession number" was put in the Picasa caption, this data could be easily added to a group of photos. The EXIFTool The second program, Phil Harvey's ExifTool, is described to have the ability to extract, add, copy or move nearly all types of metadata. The basic program must run from a command line, but with the correct configuration, it is very powerful and promises to "do everything" A download and very complete explanation of its functions are here. It soon will be tested using the command line functions. Bogdan Hrastnik has written a GUI (Graphical User Interface, Windows only) for ExifTool, which allows very easy access to many of the ExifTool functions. The "how to" and download page for ExifToolGUI is here. 10 more digital images were added and ExifToolGUI was used to see if "captions" and an "ID number" can be readily embedded in each image. This was very easy to do in the "IPTC edit function" window. These images were then brought into a new Greenstone collection, titled "Vespa" and the (new) captions were extracted from the "IPTC.Caption-Abstract" metadata item. The trial ID numbers were extracted and displayed from the "IPTC.ObjectName" metadata category. This is one solution for adding important data in digital photo archiving! Figure 2 - Note the file now has "Object ID", added by by using the "ExifToolGUI". This screen shot shows browse results on "Alternate Captions" starting with "1". The "Alternate Caption" is from "IPTC.Caption-Abstract" and the "Caption" is from "XMP.Description", as described above. The ExifTool and ExifToolGUI will display and edit the metadata for any file. PDF files, Word and Excel documents, music and video files all have embedded metadata. Check your files and you may be surprised! The ExifToolGUI was used to correct some old captions embedded by Picasa and then this software created an Excel file (similar to Figure 3 below) showing each record in the file with its metadata. The Excel "text to columns" function was used to separate multiple Tags into separate columns. Therefore, a unique ID/accession number can be added to a digital photo as a "Tag", along with as many other "Tags" as desired and these Tags can be readily separated in Excelcolumns. This results in a very useful Excel file that will show the file name, directory, the caption, an ID number, the photo date and location. This is certainly sufficient for a digital photo archive. A revised Greenstone archive was created with this new data. ExifToolGUI was customized (not difficult!) using its Workspace Manager and User Defined file display to review files, to quickly check their metadata and to make necessary additions or changes. Trials confirm that metadata added in ExifTool can also be viewed in Photoshop Elements and survives file conversions, such as from TIFF to JPG. The screen shot below (figure 3) shows the particular metadata embedded in these test images, extracted by ExifTool from the command line, through the GUI. Note the "SourceFile" shows that these are image files in a Greenstone collection. Although this sample has only image (JPG) files in this directory, this function of the GUI will show the file name and metadata (if correctly specified) for all files. This is a handy way to make a list in the CSV file format, then open that with Excel to annotate or mark files for further action. From this type of list, the photo unique ID - here "ObjectName" - can be reviewed and updated if desired. The FAQ on the ExifTool website, under question 13, shows examples of using the command parameters to make this export. Although it may seem like there is much to learn, my experience with ExifTool shows that it does exactly what it claims to do - a great program! Figure 3 - The ExifToolGUI was used to request a list of the "FileName", "Caption-Abstract", "ObjectName" and "Description" metadata from a single directory. This is only a small subject of the metadata in any digital image. More on ExifTool and ExifToolGUI There is an ExifTool function to copy each image's metadata from one category to another, when the program is used from the command line. This may be the solution to add metadata to images that can copied/transferred from/to other metadata descriptors, making those files conform to any "standard". Another ExifTool function can add or replace specific metadata in image (and other!) files with new text from a CVS file (a standard export from Excel). This really works as described - a "unique ID" can be very easily added as a column in an Excel table (as above), converted to a CVS file and then used to update a full directory of images. More on these more advanced ExifTool functions are reported here. Digital Photograph Recommendations Confirmed The webpage "Guide to Improved Identification and Classification of Digital Photos, Images and Documents", part of this site, includes recommendations for captioning and tagging digital assets - images and documents. The same group of Vespa test images was reviewed in Picasa and both tags and geo-tags were added. The first tag for a few of the images was a ID number, such as "12120108", which would be the 108th photo taken during December, 2012. The image below, Figure 4, shows the metadata in all these files, extracted by ExifToolGUI and displayed in an Excel file. Figure 4 - Excel file, showing the metadata added to a group of files by Picasa. The Picasa "tags" become XMP.Subject metadata, easily separated in Excel. The ExifTool command is shown in row 1. These "new" images were "gathered" into a Greenstone archive and Figure 5, below, shows some results from browsing in the keywords category, specifically to display the photo ID numbers, as embedded as Picasa "tags." Figure 5 - A display of keywords starting with "1", which show the photos with the trial ID numbers starting with "12", etc. The Greenstone "search" function can also be used to locate a specific ID number - or other keyword. This trial shows that Picasa can be effectively used to caption and tag photos, with good results when the same photos are "archived" in Greenstone. The ExifTool (and the ExifToolGUI) can export and entire folder of JPG files to Excel. This produces a list of images and their metadata. The Excel list shows metadata gaps that can be easily filled with copying or direct data entry into the blank cells. See the ExifTool webpage for detailed examples. Further, the Figure 4 Excel file can be "exploded" (imported) into Greenstone to provide the same metadata for use in Greenstone's "Enrich" function. For example, the Description can become the "dc.Description", the photo ID can become a "dc.Resource Identifier", etc. The Excel file also can be used as an "inventory list" of images with their descriptive data: captions and tags. These trials worked well with an initial small group of images and a further test with 190 images. The new release of Greenstone 3.0 will also be used for trials. February 15, 2013 update: the Greenstone lab team at the University of Waikato helped me fix a bug that was preventing more than 145 images from being added to the collection. It's a simple fix - select "unicode" as an option of "input_encoding" in the plugin for embedded metadata. Another glitch prevented viewing the embedded metadata in the "Enrich" panel for files with the uppercase "JPG" extension. This was fixed by a simple edit to the "util.pm" file in the "perllib" directory. Contact me for the fix details sent to me from the Greenstone Users Group. An archive of personal photos taken at the Mullin Automotive Museum is the current largest collection of images built only with embedded metadata. The metadata includes location data for each photo: latitude and longitude. Greenstone 3.0 can use this data for map displays. Figure 6 - An archive of Mullin museum photos, this is the initial display of "Captions". Picasa was used to identify each image with captions and "car make" and "car year" as tags/keywords. The file names and photo dates are standard metadata embedded by the digital camera and extracted by Greenstone automatically. A trial-version of ABBYY FineReader 11 Professional OCR software was used to scan a single page from a book, with both text and color images. In the "Scan, Options" menu was a function to add title, description, subject and author metadata. The file was saved to the PDF/A format. The ExifTool confirmed that this metadata was in the "PDF" title, subject and keywords, in the "XMP dc" title and description and in the "XMP pdf" keywords locations. The ExifToolGUI could also be used to add metadata at any time. When imported into the same Greenstone archive, the document was not located using the previously used metadata categories for extraction. When ex.Subject, ex.Title and "text search" were added as Search Indexes, the document was found in a normal Greenstone search, but with limited identifying information. This can be improved in the Format panel. The trial-version of this software is limited to a single scanned page at a time. However, these results are encouraging. Summary My technical knowledge of Greenstone is "advanced beginner", so improvements this process will be by trial and error. I have asked the Greenstone user group (technical support) about a more efficient and clearer method to reach these results. At the very least, these tests seem to be on the right track - Picasa is a viable recommendation to initially organize and identify images, especially by reviewers and classifiers with average computer skills. Also, FineReader, the ExifTool and ExifToolGUI are a powerful combination to improve embedded metadata of digital images; I will use these tools for my archives. Email me with any comments, suggestions or questions! Bob Schmitt, rgschmitt@gmail.com October 31, 2013 Revised February 15, 2016 Note: The Greenstone collections of CarLibrary.org are hosted on a ThinkPad T61 system located in Burbank, using (free!) Linux Ubuntu server software and (free!) Greenstone 2.85 (Linux) on a 60 GB SSD (OCZ) disk drive. |