Welcome to the FISHlink project website

You can find general information about the project and the project team in the About FISHlink section. Please read the blog below for the latest information and updates on the FISHlink project.

Blogs Blogs
FISH.Link

FISH.link is drawing to a close and it's time to reflect on the outputs of the project. Our key output has been the development of tooling that extends the workflow identified in the FISHNet project, supporting the conversion of datasets to RDF. FISHNet allows the decoration of datasets with metadata, but this is not at a level of detail supporting annotation of individual columns. It can thus be hard to understand what a particular column in a data set means. This is, of course, even more of an issue when machine processing or linking of datasets is to be done. As a simple example, datasets may well use terms such as "MASL", "m.a.s.l.", "Map_alt_m" and "altitude", all to refer to the same measurement (height above sea level). This use of different terms is often due to established working practices within communities. 

The FISH.Link tools (working title WaterColumn) have been developed to address these issues, allowing the association of additional column level information with a dataset. The tools have specifically been developed to work with spreadsheet data. Spreadsheets are commonly used to capture or record data and there are a multitude of tools, commercial, free and open source that can handle and manipulate formats such as CSV. A number of the datasets targeted during the project were already in spreadsheet form, or could easily be massaged into such a format. Our experiences in other projects supporting scientists (for example the development of tools such as RightField and Populous) reinforce this use of spreadsheets.

The tool allows a user to annotate a spreadsheet with column level metadata that provides additional information about the data contained in the column including the type of item, the type of data and any associated units. These annotations make use of terms from shared vocabularies. Annotated spreadsheets can then be converted to RDF using additional tooling (based on the open source XLWrap spreadsheet-to-RDF wrapper). The conversion process validates that the metadata conforms to the vocabulary. Converted datasets can then be uploaded to a triple store and queried via SPARQL or published as Linked Data.

The tools are primarily aimed at working freshwater biologists without particular technical expertise. Such users are comfortable with the use of spreadsheet based tools, but would not be expected to be familiar with, for example, the tooling required to convert datasets to RDF.

Integration with the FISHNet repository has been achieved through the use of Fedora PIDs in the tooling.

The picture below shows the basic workflow supported by the tool.

RDF Conversion Workflow

  1. A spreadsheet is pulled from the repository in csv, open office or excel. Tool adds empty dropdown selections and embeds vocabulary in sheet. New excel sheet placed in repository.
  2. Domain expert pulls amended sheet from repository and annotates columns with appropriate vocabulary. Reviewer can review process and enhance vocabulary if necessary. Result is annotated excel sheet, pushed back to repository
  3. Annotated sheet passed to tool which generates XLWrap mapping file.
  4. XLWrap converts annotated sheet to RDF using mapping file. RDF added to triplestore
  5. SPARQL queries against triplestore produce (integrated) tabular data for analysis.

Using this triple store/RDF based approach enabled us to integrate a number of data sources. The sources used were chosen to support our prime case study which was focused around a paper by Jones, Li and Maberly, investigating relationships between species richness and alttitude. The analysis in the paper requires combination of information from different data sources. SPARQL queries can now be used to pull out "wide tables" of data for further analysis. A key point to note here is that the analysis is still left in the hands of the scientist -- the FISH.link infrastructure is simply about supporting the integration/extraction of the data.

In the final weeks of the project we hope to complete the set up of a repository at FBA containing the converted datasets. Tools will also be made available via the project web site.

FISHlink@JISC MRD

Sean Bechhofer gave a presentation on FISHLink at the JISC MRD International Workshop in Birmingham at the end of March. The presentation was in the Case studies in linking and integrating research data session, which also included talks from Mark Hedges on SPQR and Adrian Stevenson on the LOCAH project and Linked Data issues in general. 

Slides from the presentation are available on slideshare.

Experiences with ISA

To try and understand more about the data sets in FISH.Link and how they might be used to help answer our Freshwater biology research questions, the Manchester team have been looking at one of Iwan Jones' papers[1]. The exercise involves deconstructing or picking apart the methods, materials and results section of the paper to pull out the atomic claims that are being made, and identify the data sources that provide the evidence for the claims.

We analysed the results section of the paper down to the sentence level. We looked to see the implicit questions from the answers given in the results section — the paper explores the relationship between altitude, surface area of water bodies and the species richness. Some 20 or so questions were extracted that determined the data we needed to have in the KB and the tables to draw out for our collaborating scientists to analyse.

This is an interesting exercise in a number of ways.

First of all, we need to understand and represent the rhetorical structure of the paper. This is related to work such as the Ontology of Rhetorical Blocks from the W3C's HCLS. On first sight though, the level of granularity that we are trying to obtain (down to individual sentences) may be finer than that provided by existing models.

We also need to understand and represent the structure of the underlying investigation or study. In order to help us with this, we've been consider the ISA model. ISA provides representation for experimental (meta) data in terms of Investigations, Studies and Assays. Investigations contain Studies, which each contain a number of Assays where Assays are characterized as the smallest complete unit of experimentation producing data associated to a subject [2].

ISA is used extensively in SysMO-DB and we've been making use of the local expertise from the SysMO team in Manchester. They have produced the JERM (Just Enough Results Model) which is an ontology that supports modelling of experiments and data, and is built on ISA. One observation that we've made is that the ISA approach seems biased towards a situation where an investigation is planned "top-down", with each Assay being performed within the context of a Study.

With our FISH.Link data, we have a slightly different situation. For example, the data that Iwan and colleagues use in their paper comes from a number of different sources, which would have been parts of different studies or investigation. Thus the study here is _assembled_ from existing data sources, rather than being the result of a planned sequence of measurements. This raises questions about how one might interpret the notion of "part of" in the JERM/ISA model -- for our purposes, the intention is rather that a Study makes use of an Assay, rather than having the Assay as a part.

References

  1. J. Iwan Jones, Wei Li and Stephen C. Maberly Area, altitude and aquatic plant diversity Ecography 26:4 pp 411--420 (2003)
  2. ISA-Tab v1 (Nov 2008) http://isatab.sourceforge.net/docs/ISA-TAB_release-candidate-1_v1.0_24nov08.pdf
Ice Fishing

Brrrr!The FISHNet and FISH.Link teams battled through the icy conditions this week for a meeting in Manchester. The main topic of discussion was how FISHNet and FISH.Link will work together. This corresponds to Phase II of our development plan for FISH.Link — integration with the repository.

Over the last couple of weeks, the team in Manchester have been dissecting a paper from Iwan Jones, pulling out references to data sets, and using this as a driver for the "RDFisation" of some of the freshwater data sets. More of this to come in another blog post, but this has helped us identify a number of questions about the ways in which the data sets are mapped to our triple store.

During the Manchester meeting, we identified a "workflow" that supports the ingestion of data sets into FISHNet, and the transition from red to orange to green using the Traffic Light System identified in FISHNet. The workflow involves a number of steps including:

  1. Creation of description metadata;
  2. Identification of file types;
  3. Conversion to normal forms;
  4. Standardisation of content;
  5. Content review;
  6. Data publication;
  7. Linked Data publication

Next steps involve a formalisation and implementation of this workflow and more detailed descriptions of the standardisation process, which includes harmonisation of column descriptions and vocabularies.

Workflow

We aim to have the first three steps (which essentially cover transition from red to orange) covered by our January meeting.

FISHNet Traffic Lights system for data and how FISH.Link relates to it.

In a related blog post on the FISHNet project website: http://www.fishnetonline.org/home/-/blogs/a-simple-traffic-light-system-for-data the plan for a basic structure to how datasets are added and maintained in the FISHNet repository is described.

The FISH.Link project is involved in the Green category of the FISHNet traffic lights system. This is where datasets are already in a reuseable format with a DOI for citation (and provenance). To be moved from the Orange to the Green category the data must be made freely available under the CC0 Licence and may be mapped into an RDF Triple Store using tools developed by the FISH.Link Project.

Showing 1 - 5 of 14 results.
Items per Page 5
of 3
Document Library Display Document Library Display
Folders
Folder # of Folders # of Documents
Deliverables 0 4
Showing 1 result.
Documents
There are no documents in this folder.
Name Size
There are no documents in this folder.
Showing 0 results.