Functional Requirements for Vocabulary and Information model registry systems

Stephen Richard 2016-01-20 08:32

Vocabulary registry functions:

Scope is a concept vocabulary-each item represents a unique concept; the definition of the item in natural language, along with examples, images etc. convey the intention of the concept in a manner that should make it clear to all within the target community for the vocabulary.

 

Target applications:

+        reference for communities to document the meaning of terminology they adopt

+        Annotation of resources (keywords)

+        Semantic search-concept expansion, synonomy

+        Terminological-property domain validation in data

+        Pick lists for user interfaces

Functions:

+        Basic CRUD operations on concept; registry needs to allow various policies on update and deletion (i.e. no delete, only deprecate)

+        Concepts have URI, prefLabel, altLabels, definition, source (minimum)

+        Resolve URI to definition, source and labels for human consumption

+        Language localization?

+        Status properties for concepts, e.g. proposed, adopted, deprecated, superseded (with links to successor).

+        CRUD operations on Relations/objectProperties (add new relationships). Relationships/associations are first order concepts

+        Get related concepts;

o   simplest case are SKOS type relations;

o   general relation navigation (CRUD operations on relations, essentially becomes an ontology?),

o   transitive relations/transitive closure,

o   relation hierarchies (e.g. -chapter- is a kind of -partOf' relation).

+        Get URI associated with Label

+        Find concepts related to text strings (full text search-labels + definitions)

+        CRUD operations for mapping relations between concepts in different collections or conceptSchemes

+        Define collections of concepts, e.g. as a domain for some property value;

+        CRUD operations on collections and collection items;

+        formal concept of a -conceptScheme- that has properties like intended scope (concept space), isCovering, isComplete, uniqueConcepts, hierarchical-

+        Given concept and collection, report if concept is member of collection (for domain validation processes)

+        Get all members of a collection (tools use to construct pick lists-)

+        Track source of concept (citation), steward for concept (who put in, who is responsible for maintenance and status)

+        auto-complete functionality

+        OpenRefine resolution services

 

Information model registry

Scope the registry considered here is the formal representation of data types. The term data type is used to mean "A specification of the representation of a single value in an information system" (http://earthlexicon.sdsc.edu/wiki/Data_type, http://en.wikipedia.org/wiki/Data_type). Use of the  term 'data type' often leads to confusion because it is applied to representations at a conceptual, logical, and physical implementation level, as well as a wide spectrum of granularity, ranging from primitive types like 'integer, character' to complex structured data types like 'metadata record', and to resource container types like MIME types. The data type concept is also used to denote both an information item representing some 'thing' in the domain of interest (an entity or object), and to denote an information item representing a value for a property of some thing in the domain of interest. The applications of the various denotations of the the term 'data type' require a variety of different information models and relationships. It is thus imperative that the registry make clear distinctions about what kind of 'data type' a model element represents.

 

, which would include a definition of the types or entities, and specification of a collection of attributes, with domains and cardinalities for those attributes, constituting the representation of instances of that type/entity. These could be implemented as JSON objects, XML elements, rows in a relation, RDF graphs etc, all different representations of the same fundamental type.  The information models need to be documented at the conceptual level to support interaction with domain users, as well as the logical and physical level to support development of new, interoperable models, and to enable automation of some data integration steps.

Target applications:

+        Reference for communities to document the meaning of entities and attributes in data that they share.

+        Discover existing data type and attribute definitions for use in constructing data models, to foster interoperability.

+        Discover resource containing information about a particular entity or property.

+        Machine-assisted data integration, based on identification of matching or -integratable- attribute content.

+        Validation of data instances against a type definition.

+        Tools that spin up a UI for a particular data type.

Information concepts:

+        Concept. A mental phenomena that human beings use in their internal representation of the world. Webster-s dictionary [1996] uses the terms -idea- and -object of thought- to convey the meaning of -concept.- Concepts exist in the mind of human observers.

+        -Integratable- is a somewhat tortured adjective, used here to mean -capable of integration-, i.e. values from different sources can be used in an application as if they were part of a single data set to obtain scientifically sound results.  The data integration process might involve transformations such as conversion of measurement units that do not inherently change the meaning of the measurement. See discussion of property values, below.

+        Entities, Attributes, and Properties are concepts, thus inherit properties from concepts as defined for vocabulary (above)

+        Attribute is a logical implementation for representing a Property value. A given property, e.g. temperature, may be represented in various ways, e.g. as a number (Celsius, Farenheit), or a term (high, medium, low). Attributes representing the same Property should be integratable.

+        Property values may be measured or reported using a variety of different methods, e.g. measured with mercury thermometer, alcohol thermometer, infrared sensor, reported as -Average temperature (over some interval)-, -Peak temperature (over some interval)-, -instantaneous temperature-.  These all relate to a general -temperature- concept, but may not be integratable, and for the purposes of data integration these distinctions need to be documented.  There is a continuum from the most general concept of temperature (least likely to be integratable) to a property representing a temperature measurement by a particular observer using a single measurement and reporting method (most likely to be integratable)

+        Entity and Type.  CSDGM metadata data defines entities and attributes, but this entity concept is essentially a type definition. Type is concerned with the definition of a data structure; any entity has an inherent type, so the concepts are closely related. Thus this text mostly uses both terms like this -entity/type- (subject to getting a better idea-)

+        ConceptSpace. An abstract space in which the bases are composed by quality dimensions, which denote basic features in which concepts and objects can be compared, as such as weight, colour, taste, etc. (Gardenfors, 2000 [ISBN:0262071991]). In this view, natural categories are convex regions in a conceptual spaces. In that if x and y are elements of a category, and if z is between x and y, then z is also likely to belong to the category. The notion of concept convexity allow the interpretation of the focal points of regions as category prototypes. (https://en.wikipedia.org/wiki/Conceptual_Spaces)

Functions:

+        Get property definition

+        Get all attributes related to a property

+        Get related properties

+        Get schema for entity/type. Possible representations (content negotiation?): rdf schema, xml schema, JSON schema, others-

+        Get entities that include an attribute

+        Get entities that include a property

+        Get related entities/types. Allow inheritance of attributes from parent to child types.

+        Register new type

o   Create through forms

o   Ingest schema document (XML schema, rdf schema, JSON schema, ISO19110 feature catalog, others-)

+        Assert relationships between entities, properties, attributes (similar to general relation functions in vocabulary; basically constructing an ontology -.)

+        Find transformation between attributes

+        Find similar entities (compare attributes)

Summary

An Information Model Registry builds on a Vocabulary Registry, because the DataObject/entity/type, attribute, and property concepts that are its basis are all concepts, thus manageable using the vocabulary registry. The information model registry requires a variety of more complex relationships, functions, and additional attributes on the concepts. For instance an attribute will need a data type, and value domain. An attribute in the context of a type/entity will need a cardinality, and perhaps restriction on value domain.  Properties will be a complex vocabulary rooted in abstract -phenomenon- concepts like temperature, density, length, with more granular properties defined based on context (observer?, environment, intention), reporting method, and measurement method. Transformation methods linking attributes will also be useful.  Processes for finding transformations between attributes, determining similarity between entity/type definitions, and ingesting new schema will also be needed.

Registry items

This section enumerates a set of registry item types that could be used to implement a vocabulary and information model registry. For more detail on the information model for the registry, see https://github.com/usgin/usginspecs/raw/gh-pages/DataTypeModelDraft.pdf

 

Concept. The registry item for a concept includes a definition in natural language that explains the idea for people to understand; and a unique, machine-parseable identifier (operationally, it-s a string-) for use by computer software. The item also includes labels (words, designations) that are used in natural language communication to signify the concept. These labels are language-localized, and ideally would be context localized to account for different community practice using the same language. A preferred label in at least one language is required.  Finally, the concept item should include source information citing the intellectual origin of the definition. Concept register items are the base representations in the vocabulary registry for various classes that are extended with additional properties for the Information Model Registry. These classes include DataType (logical and syntactic/primitive), MeasureClass, ArrayDimension, ObjectClass, Property, and UnitOfMeasure.

ConceptScheme: A collection of concepts defined within a single conceptSpace. The implication of definition in the concept space is that every concept in the scheme denotes a value or value range on each axis (base dimension) of the concept space.

ControlledVocabulary: a collection of concepts and labels (designations) for those concepts that will be used to populate a DataElement value. Typically will equate with a ConceptScheme or some subset of a ConceptScheme, but may include values from different schemes.

Source (Citation). An information object that identifies a resource and provides information for standard scholarly citation of that resource.  The intention is that the citation provides sufficient information to be dereferenced and acquire a representation of the resource.

Contact (Agent). A contact registry item identifies an agent. "An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity." (W3C PROV). In this view, software can also be an agent. Responsibility does not have to be 'conscious' or intentional. An agent is an identifiable entity; could be an organization, an individual who may or may not be associated with an organization, an entity identified via a role (position) relative to an organization, or an artificial entity (software, a machine).

DataObject. An information object that represents an entity of interest (ObjectClass in this model, based on ISO11179) in some domain; the representation consists of a collection of DataElements that are used to quantify properties of instances of the entity. Corresponds to 'dataType' in ISO11179, Entity in Entity-Relationship models, Object in object models, 'document' in document type noSQL databases (e.g. CouchDb, MongoDb), 'Variable' in the netCDF common data model (OGC 10-090r3) or a 'Feature' in GML.  DataObject can be thought of as an information model for a particular entity of interest, like the Content Models in NGDS (http://schemas.usgin.org/models), exchange models in NIEM, or ontology design patterns. A dataObject may have additional metadata attributes (metaAttributes) necessary to understand the data object, for example sampling basis, geometry type (for spatial data), coordinate reference system.

DataElement. An information object representing a unit of data that quantifies a property in the context of an ObjectClass. The identity of a DataElement is based on its meaning and domain.   The intention is that a DataElement does not denote a particular implementation environment, corresponding to 'logical model' data modeling approaches.  An ArrayVariable is a special case of a DataElement in a gridded (discrete coverage) data set that assigns a property value for each cells  in a spaces defined by intervals on one or more dimension axes. DataElements may have associated metadata attributes (metaAttribute) that specify additional information necessary to understand the DataElement, for example units of measure.

Domain. A representation of the restrictions on values that are allowed for quantification of a property. A Conceptual domains restricts the values based on the modeler's conception of what is of interest in the realm that a model is intended to represent. A Value Domain restricts values in terms that can be implemented in an information system, independently of a specific implementation. Domains can be specified through enumeration, based on rules (a description), or with other constraints.

ImplementationObject. Information objects representing artifacts that map DataElements and DataObjects to specific software environments, and physical datatypes. Includes ImplementationElement and ImplementationObject.

InterchangeFormat. Register item representing a document format used to serialize one or ImplementationObjects for information exchange.