Advanced Queries
Special Query Parameters
For extended query functionality, the NBA offers a set of extended query parameters:
Filtering results: _fields
The _fields
parameter allows for filtering certain fields of interest in query results. Suppose one would like to query all taxa in the genus Hydrochoerus but is only interested in the title of the scientific publication associated with that taxon:
This query will yield a minimal JSON document containing the desired data. Note that multiple fields can be chosen if separated by commas, e.g.:
Basic sorting: _sortFields
To structure a query result, it is possible to sort the result by the values of user-defined fields. This example extracts all geo area records with areaType
Country and sorts the results by the name of the locality (country):
https://api.biodiversitydata.nl/v2/geo/query/?areaType=Country&_fields=locality&_sortFields=locality
Note that more than one _sortFields
can be provided, separated by commas. The results will then be first sorted on the first field, and if there are multiple results matching the first field, the latter fields will be considered in sorting the results.
It is also possible to specify a sort direction. To sort in ascending order:
To sort in descending order:
Controlling result size: _size
and _from
By default, the NBA returns the first 10 best-scoring matches. The total size is always the first number in the result set when using the query endpoint. It is important to note that by default, the results returned from a query are not sorted on any field. Controlling the size of the result therefore makes most sense on sorted data. The below example uses the _size
parameter:
to return the first 100 geo areas that are countries. The scrolling
parameter _from
controls the offset from which results are
retrieved. The query
thus returns 100 results starting from the 100th result. The first
item in this set is therefore ‘Hungary’ and not ‘Afghanistan’, which
would be the first hit of the query without _from
.
Ignore case: _ignoreCase
By default, search fields are case-sensitive in the NBA. As the name indicates, this parameter makes a query term case insensitive.
Search on multiple fields: _logicalOperator
By default, when querying multiple fields, the query terms are conjoined by the operator AND
meaning that each condition has to be met. To use an OR
conjunction, the query parameter _logicalOperator
can be used, e.g.:
Complex Queries
Human readable queries, as outlined above, provide an intuitive way to search for one or multiple fields in the data. However, these queries have limited power and therefore the NBA allows for a detailed query specification in JSON format, a so-called QuerySpec
object. A QuerySpec
generally consists of an array of QueryCondition
objects (please refer to the NBA reference documentation for detailed documentation of these object types) that, in turn, have the fields field, operator and value. A simple QuerySpec for specimens of the genus Hydrochoerus looks as follows
{
"conditions": [
{
"field": "identifications.defaultClassification.genus",
"operator": "EQUALS",
"value": "Hydrochoerus"
}
]
}
This is the entire query (URL encoded).
The QuerySpec object allows for partial matching of a query term. To find, for instance, all specimen of mammal genera that start with Hydro, we could query with two conditions:
{
"conditions": [
{
"field": "identifications.defaultClassification.genus",
"operator": "STARTS_WITH",
"value": "Hydro"
},
{
"field": "collectionType",
"operator": "EQUALS",
"value": "Mammalia"
}
]
}
QuerySpec in the NBA Scratchpad
The NBA Query Scratchpad is a convenient tool to design, optimise and debug QuerySpec JSON strings. Its URL can even take a QuerySpec as parameter, and the query will be automatically visible in the scratchpad:
https://api.biodiversitydata.nl/scratchpad//?_querySpec={my_queryspec_here}
Further querySpec options
Next to the QueryCondition(s), the parameters fields, size, from, logicalOperator, sortFields
can be set. Note that these parameters are identical to the parameters _fields, _size
, etc. introduced above for human-readable queries. Suppose, we want to query with the above conditions with maximal 100 results, sorted by the country where the specimen was found and only retrieve a subset of the specimen properties:
{
"conditions": [
{
"field": "identifications.defaultClassification.genus",
"operator": "STARTS_WITH",
"value": "Hydro"
},
{
"field": "collectionType",
"operator": "EQUALS",
"value": "Mammalia"
}
],
"logicalOperator": "AND",
"size": 100,
"from": 0,
"sortFields": [
{
"path": "gatheringEvent.country",
"sortOrder": "ASC"
}
],
"fields": [
"identifications.defaultClassification.genus",
"identifications.scientificName.fullScientificName",
"gatheringEvent.country"
]
}
Comparison Operators
The available operators differ between search fields, depending if they are words, numbers, or dates. Most fields (except geo shapes) have the matching operators EQUALS
and NOT_EQUALS
. For numbers and dates, comparison operators such as LT
, GT
(less than and greater than, respectively) are usually important, as well as querying numbers within or outside a certain range (BETWEEN, NOT_BETWEEN
). Partial matching operators for text fields are very powerful. With the operator CONTAINS
, a search term must be an exact substring (upper/lower cases are ignored) of a data field. The operator
MATCHES
breaks up the query string into single search terms which are conjoined with OR
. The metadata service {doctype}/metadata/getFieldInfo
lists all operators for every field in a data type:
https://api.biodiversitydata.nl/v2/specimen/metadata/getFieldInfo
Alternatively, it is possible to ask the API whether an operator is valid for a specific field, e.g.:
https://api.biodiversitydata.nl/v2/specimen/metadata/isOperatorAllowed/collectionType/EQUALS
Logical Conjunctions
While the parameter logicalOperator can logically combine multiple QueryConditions, this can also be accomplished in a nested manner, using the fields and and or to add one or more conditions within a QueryCondition. The below example will return specimens from both, genus Saguinus and Cebus:
{
"conditions": [
{
"field": "identifications.defaultClassification.genus",
"operator": "EQUALS",
"value": "Saguinus",
"or": [
{
"field": "identifications.defaultClassification.genus",
"operator": "EQUALS",
"value": "Cebus"
}
]
}
]
}
Note that the above query can also be implemented with two non-nested QueryConditions. Please refer to the java client documentation about details on nesting.
Empty/Non-empty Fields
The pre-defined parameter value @NULL@
matches a non-existent value. Suppose we want to know which specimens are of rank species, but do not have a default classification on the genus-level:
Conversely, it is possible to query for non-empty fields with the query parameter @NOT_NULL@
. For example, to get all specimens that have lat-long coordinates:
Be aware, the parameters @NULL@
and @NOT_NULL@
are intended for human-readable queries only. In QuerySpec JSON notation, we can use "value" : null
in combination with "operator" : "EQUALS"
or "operator" : "NOT_EQUALS"
. Example: Search for all taxa that have no full scientific name in their synonyms:
{
"conditions": [
{
"field": "synonyms.fullScientificName",
"operator": "EQUALS",
"value": null
}
]
}
Please note that if "value" : null
is omitted in the above example, the result will be the same. However, it seems more intuitive to explicitly specify the ‘null’ value in such queries.
Scoring and Boosting
By default, query results are sorted by relevance, so hits that match the query conditions best will appear first. The relevance is reflected in the score which is returned in any NBA search result. The higher the score, the more relevant the search result. Please refer to the [ElasticSearch documentation] (https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html) on how the score is calculated. Suppose we look for taxa with genus name matching "uinus" or family names matching "alsa", using the operator CONTAINS:
{
"conditions": [
{
"field": "defaultClassification.genus",
"operator": "CONTAINS",
"value": "uinus"
},
{
"field": "defaultClassification.family",
"operator": "CONTAINS",
"value": "alsa"
}
],
"logicalOperator": "OR",
"fields": [
"defaultClassification.genus",
"defaultClassification.family"
],
"size": 100
}
The best-scoring hits are taxa of genus Saguinus, Guinusia and Pinguinus while taxa with matching family names (e.g. Valsaceae) are further behind, due to a larger edit distance. To lever the importance of a match in family- rather than in genus name, the boost parameter can be added to a QueryCondition
:
{
"conditions": [
{
"field": "defaultClassification.genus",
"operator": "CONTAINS",
"value": "uinus"
},
{
"field": "defaultClassification.family",
"operator": "CONTAINS",
"value": "alsa",
"boost": 2
}
],
"logicalOperator": "OR",
"fields": [
"defaultClassification.genus",
"defaultClassification.family"
],
"size": 100
}
A boost
parameter higher than 1 increases the relative weight of a QueryCondition while a value between 0 and 1 decreases it. The default boost of a QueryCondition
is 1. More about boosting in ElasticSearch can be found here.
In many applications, scoring might not be relevant at all. To turn off scoring and sorting query results by score, the parameter constantScore
can be set in a QuerySpec object (i.e. "constantScore" : "true"
). Disabling the calculation of a relevance score generally results in faster queries.
GET vs POST requests
All NBA services can process GET requests with an (URL encoded) query string. Queries with complex QuerySpec objects can however result in long strings, which can cause size issues in certain REST clients. The NBA therefore also accepts POST requests for the endpoints that can take a QuerySpec object: query
and count
. The query spec is then passed as the payload. Below a cURL example to query for all taxa of genus Sempervivum with POST:
curl -X POST https://api.biodiversitydata.nl/v2/taxon/query/ -d '_querySpec={
"conditions": [
{
"field": "defaultClassification.genus",
"operator": "EQUALS",
"value": "Sempervivum"
}
]
}'
Searching with Dates
All date fields of specimen, taxon and multimedia data types (geo areas do not have a date field) are indexed and therefore searchable. Dates can be entered in different levels of precision; the supported formats can be queried as follows:
https://api.biodiversitydata.nl/v2/metadata/getAllowedDateFormats
The most precise format is yyyy-MM-dd'T'HH:mm:ss.SSSZ
, with resolution in milliseconds (SSS) and specification of time zone (Z). For example: 1981-01-09T08:40:59.880+02:00
. The least precise format is yyyy
. See also here for information and examples on NBA date formats. Date fields support the same operator as numeric fields (see here e.g. for specimens), e.g. EQUALS, IN, GT, LT
and BETWEEN
.
Example: Find all specimens that were gathered before 1800, sorted descendingly:
{
"conditions": [
{
"field": "gatheringEvent.dateTimeBegin",
"operator": "LT",
"value": "1800"
}
],
"sortFields": [
{
"path": "gatheringEvent.dateTimeBegin",
"sortOrder": "DESC"
}
]
}
CAUTION: When querying with less precise date strings, such as yyyy
or yyyy-MM
, the missing precision is added and set to the beginning of the unit. When querying for 2001 with the EQUALS operator, for instance, not all dates in the year 2001 are matched, but all dates that match 2001-01-01. Therefore, querying for a range of values should be done with the BETWEEN
operator.
For example, in order to retrieve all specimens collected in April 2001:
{
"conditions": [
{
"field": "gatheringEvent.dateTimeBegin",
"operator": "BETWEEN",
"value": [
"2001-04-01",
"2001-04-30"
]
}
]
}
Aggregation
Aggregation functions summarise data from multiple query results based on certain conditions in order to get a higher-level view on the whole dataset. The NBA features two different types of aggregation: Aggregation on distinct field values and aggregation on scientific names.
Distinct values for fields
For a given field in a data type, services with path /{documentType}/getDistinctValues/{field}
return all distinct values of that field in the data. Additionally, the frequency for each value for the field is given and the result is ordered by frequency. This functionality is thus a simple count aggregation. Example: To identify the amounts of specimens per collection, one has to aggregate on all distinct values for the field collectionType:
https://api.biodiversitydata.nl/v2/specimen/getDistinctValues/collectionType
Similarly, the service /{documentType}/countDistinctValues/{field}
aggregates over a user-defined field, but returns the mere counts of records instead of returning the actual data.
Distinct values per group
Nested aggregation over two fields can be accomplished with the /{documentType}/getDistinctValuesPerGroup/{field}
. To retreive, for example, the continents of specimen collection, grouped by collection type, one can query:
Note that the counts for the inner (second) aggregation can be higher than the first one; this stems from the cardinality of vernacularName fields in a taxon document - more documentation will follow.
The /{documentType}/countDistinctValuesPerGroup/{field}
service does the same aggregation as above, but returns the counts of the second field grouped within the first field. To see how many different values for sex there is per collection, we can query:
https://api.biodiversitydata.nl/v2/specimen/countDistinctValuesPerGroup/collectionType/sex
Combining aggregations and the _querySpec
parameter
All aggregation functions can be combined with a _querySpec
parameter to add further detail to your query. For instance, to get the frequency of gathering event-countries per collection type, taking into account only specimens from the kingdom Animalia, and restricting ourselves to the first ten collections, we can combine the aggregation query:
with the QueryObject:
{
"conditions": [
{
"field": "identifications.defaultClassification.kingdom",
"operator": "EQUALS_IC",
"value": "Animalia"
}
],
"size": 10
}
This is the entire query (URL encoded).
Correctly interpreting frequencies
Please note that when executing aggregation queries, different frequencies may sometimes be returned for the same variable when altering the value of the size parameter. This is an effect of the distributed architecture of the underlying datastore. When choosing a value for the size parameter that is smaller than the unique number of terms you are aggregating on, there is a chance the reported frequencies are too low. Increasing size can help minimize the effect, but as a rule, the frequencies in the output of the aggregation functions are not to be considered exact values. If you need an exact count, it is advised to use a regular search query and count the returned documents.
Scientific Name Groups
The identification of a museum specimen is its assignment to a certain taxon of a certain rank (e.g. species or subspecies) and the taxon must be defined in some taxonomic reference resource. The concept of scientific name groups in the NBA establishes this link between a specimen and a taxon. This allows to query which specimens are associated with which taxa and vice versa. The services available in the path {specimen|taxon}/groupByScientificName/{query}
allow for aggregation of search results over both, specimen and taxa at the same time. Aggregation is carried out on the field scientificNameGroup
(identifications.scientificName.scientificNameGroup
for specimen and acceptedName.scientificName.scientificNameGroup/synonyms.scientificName.scientificNameGroup
for taxa). The query result gives a list of objects which contain both, specimen and taxa, that share the same value for scientificNameGroup
. For example, to retrieve all specimens and their associated taxa for the genus Felis:
https://api.biodiversitydata.nl/v2/taxon/groupByScientificName/?acceptedName.genusOrMonomial=Felis
The output of the scientificNameGroup
queries can be fine-tuned with some additional parameters when using advanced queries:
groupSort
determines how buckets are sorted. Valid values are COUNT_DESC, COUNT_ASC (sort by the number of documents in each bucket), NAME_ASC, NAME_DESC (sort by the scientific name by which the buckets are grouped) and TOP_HIT_SCORE (sort by the highest score-value within each bucket).-
groupFilter
allows for filtering of documents based on the scientific name. It can be configured to either accept or reject, and has an implementation for a regular expression, or an array of values: -
{ "acceptRegexp" : ".*larus*." }
{ "acceptValues" : [ "meles meles", "larus fuscus" ] }
{ "rejectRegexp" : ".*\?.*" }
-
{ "rejectValues" : [ "? ?", "? meles", "unknown" ] }
-
specimensSortFields
: controls the sorting of specimen within each bucket. Same syntax as 'sortFields'. specimensFrom
&specimensSize
allow you to control which and how large a subset of the documents within each bucket are included in the resultset. Same syntax as 'from' and 'size'. Note that these parameters control the contents of all buckets at once.noTaxa
: the specimen groupBy-query also returns corresponding taxon-records with the results. SetnoTaxa
to true to suppress.
See the Elasticsearch documentation for their implementation of the regular expressions-syntax.