Skip to Main Content

FL-Islandora Guide: OAI-PMH And Harvesting

A guide for FL-Islandora users.

Introduction

Harvesting metadata from your site is an important aspect of discoverability for your content. Harvesting is mainly accomplished via the Open Archives Initiative Protocol for Metadata Harvesting, or OAI-PMH. This page details some basic instructions for how to manage your OAI feed. For more technical look at OAI management, visit the Harvesting page of the Islandora Git Documentation.

Harvesting

There are many ways of having your OAI-PMH data harvested. Below are some links to different harvesting organizations and their requirements.

Islandora 2.0 API Endpoints

Islandora only provides two additional API endpoints

    /media/{mid}/source

    PUT a file to this endpoint to create/update a Media’s file

    /node/{nid}/media/{media_type}/{taxonomy_term}

    PUT a file to this endpoint to create/update a Media for a Node

How to Set Up an OAI View

If you want to have only part of your collection harvested by OAI-PMH protocols, use the below steps to use the Views module to set a particular collection to be harvested. By Default, the OAI-PMH View is enabled with these functions: there is one set per "Collection" object containing that object's children; there is one set of all Repository Item objects that are not members of any Collection, and are not themselves Collections; and disabled by default, there is a set of all Repository Item objects that are not Collections. Your OAI-PMH stream is available to crawlers by default, but you will want to edit the view so that the crawlers capture all necessary metadata and only your desired collections.

1. Navigate to Structure > Views

2. Scroll down the list to the OAI-PMH Module and click Edit.

3. In All Repository Items display, click “Duplicate All Repository Items

4. Click on “Edit View Name/Description” and edit the appropriate info. Then click on “Display Name: All repository Items” and rename appropriately.

5.     You can save your changes at the bottom of the page at this point. 

6. While in the view of the set  you just created, go to Filter Criteria and click “Add”

7. In the “For” Field choose “This entity_reference (override)”. There are a number of ways you can filter by different metadata fields and such, but if you are trying to filter by a particular collection, scroll down to “Member of (field_member_of)” and check the box. Then hit Apply.

8. For this next step you will need the node number for the collection you wish to limit your view to. You can find this by visiting the collection on your website and looking at the url in your search bar. The number after the path “node” will be the number of that collection.

9. In the boxes select “Is equal to” and then put the node number by itself in the next box. Hit Apply.

10. You can add multiple collections. In order for it to work, you must change the criteria And/Or designations. Click the drop down next to Add button and select “And/OR rearrange”

11. Here, click on create new filter group. Then scroll down to the bottom where the new blank group is. Ensure that the operator between GROUPS is set to “AND” and change the operator WITHIN the new group to “OR” then drag and drop your collection filters into this group. Click Apply.

12. Return to the Filter criteria, and this time search at the top for “model” then select “Model (field_model) and hit Apply.

13.On the next screen select “Islandora Models” and click Apply

14.For the next screen select “Is none of” and enter into the box “page” then hit Apply.

15. In the first box go to the section which says "Fields" and hit "Add"

16. Search for the fields you want to have displayed with your objects, such as description and date. Best practice is to adhere to all the metadata needs of the institutions you want to harvest your data, such as Google crawlers and the Sunshine State Digital Network. See first tab for requirements. Once you have selected them all hit "Apply"

17. For each data field you selected, you will get this screen to select how it is displayed. For now you can leave them all at default, but these settings can be changed if necessary. Hit Apply on each screen until finished.

18. In the middle boxes, go to Pager and click on the text beside “Items to Display”

19. Select “Display all items” and then hit Apply.

20. Hit Save on the OAI page to save all your selections.

21. Next, visit “Configuration” menu, go to “Web Services” and then “OAI-PMH Settings” to set exposure to OAI-PMH.

22. By Default, all repository items are exposed to OAI-PMH. To limit to just the collection you just designated, UNCHECK the box for All Repository Items and check the box for your new view.

23.Finally, click over to the “Rebuild” tab and click the “Rebuild OAI-PMH" button. This may take some if you have a large number of objects.

Your OAI-PMH Feed is now ready to be harvested.

Exporting OAI-PMH Data to File

The below instructions allow you to export your OAI-PMH data to download to a file for use in house. There are several file types available, including CSV, JSON, and XML. These can be used for a variety of purposes, including manually submitting to various databases. As a warning, the OAI_DC file format does not appear to work at this time.

1. Go to Structure, and then Views to get to your list of Views.  Scroll down to select the OAI-PMH View. Select Edit to the right of the screen.

2.  Click the "Add" button and select "Data Export"

3. In the first box, under “Format” you can select the format you would like the export to be in. Click on “Settings” and select the format from the pop up menu. There are a variety of options such as csv, json, xml etc. For this option I chose csv. After you’re done hit apply

4. In the middle box on the module page, there are options you can change. The first one to change is the Path option. Click where it says “No Path is Set”

5. In the Path box, put the file path for the module you want to export. In this case I am just using the default All Repository Items path.

6. Scroll down to the check box for Store in Public Files Directory. If the content is not proprietary, check this box. Below it, check the box for “Download Immediately” and then click Apply.

7. Below Path is an option that says “Attach To”. Click where it says None

8. In this box, choose the module to attach it to, in this case I only have the “Master” to select. Click Apply. 

9. In the Module page scroll down to the section titled Export Settings. If you are exporting a small number of objects, you can leave the Method as Standard. But if you are exporting a large amount of data it’s best to change this option to “Batch”. Below this is also where you can set a limit if you only want to export part of the data.

10. Scroll a little further and hit “Save” to save your new module.

11. In the address bar of your browser, type in the path you set for the view. In this case it was /All-repository-items, so I use the domains of my Islandora instance, in this case https://sandbox.islandora.ca and then add /All-repository-items to the end. Click go to navigate to that address.

12. You should get a pop up when your file has downloaded depending on your operating system and browser. If you have a large amount of data, this may take some time. Once it tells you it’s downloaded, the file will appear wherever your downloaded files automatically save to. It will be titled after the path you chose. Now you can do whatever you want with the OAI data you have exported.

 

RDF Mapping for OAI-PMH

In order for the correct fields to be exposed in the OAI-PMH feed for ingest into FOAL, you will need to follow these steps.

1. Go to Structure > Views > OAI-PMH to open the OAI-PMH interface. In whichever set you intend to have as your main feed, go to where it says Fields, and hit "Add"

2. In the Add Fields box, search the search box for the required fields and check the box beside each one. You may select multiple at once. Note: The necessary fields for FOAL are "Title" "Date" "Description" "Owning Institution" and "Purl".

3. In the next screens you can configure these fields, or you can just hit "Apply" to leave them in the default configuration. You will do this once for each field selected.

4. Next, because "owning institution" and "purl" are not Dublin Core fields, you will need to set up an RDF mapping for them. Start by going to Configuration > Development > Configuration synchronization.

5. Go to the Export tab then the Single Item tab.

6. On this page, under Configuration Type dropdown choose "RDF Mapping" then in the Configuration Name dropdown choose "node.islandora_object". This will populate the box below with all the RDF mappings as they currently are. Ctrl+A to select all the text, then Ctrl+C to copy it to your clipboard.

7. Now you will go to the Import tab and the Single Item tab. In the Configuration Type dropdown choose "RDF Mapping" again, and Ctrl+v to paste all the saved text into the empty box below.

8. Scroll to where it says "field_linked_agent" and copy the entire set of the mapping for that term, making sure to copy all spaces before the text, then paste it below. Now edit the second one to say "field_purl" and the DC mapping to read "dcterms:identifier".

9. Repeat step 8, but this time change the text to read "field_owning_institution" and the DC mapping to be "dcterms:provenance". Then click "Import" at the bottom of the page.

10. Check to make sure the mapping took by going back to the Export Single Item tabs and repeating the steps to retrieve the RDF mapping.

11. Now you need to rebuild your OAI-PMH. Go to Configuration > Web Services > OAI-PMH Settings.

11. Go to the Rebuild tab and click on "Rebuild OAI-PMH"

12. This may take a little while depending on the size of your archive, but once it's complete you should receive this status message with a link to your OAI-PMH feed. The link will remain the same throughout the life of the archive, and you can periodically check it to ensure all of the data about your objects is being exposed properly.