About

The WDC-API provides a REST-Interface for the WebDataCollector-Framework and provided the means for external applications to work with the collected and prepared data of the WebDataCollector.

General definitions

This API shares some common conventions and definitions which are not stated explicity at each endpoint.

Base-URL, Authentication and Authorization

The API can be accessed at the URL https://dss-wdc.wiso.uni-hamburg.de/api.

The API can only be access with Access-Tokens. Access Tokens can be included included in the Http-Head as parameter "Token".

curl 'https://dss-wdc.wiso.uni-hamburg.de/api/snapshot/list?page=0&size=5' -i -X GET -H "Token:MyToken"

Snapshots and Panels are secured objects. A user gets only the snapshots which have been accordingly configured. If you think you miss a snapshot you can have a look into your permissions via an API-Call.

You need an access-token? Please get in touch with us by email.
To increase the bevity of the examples, the documentation ignores the authentication-token. In your own code you have to be authenticated.
Please note that we log your access to the API. We use that information to identify bottlenecks and problems within the API.

Responses, Status-Codes and Paging

The WDC-API aims to produce structural stable Response-Objects in JSON. The basic form of such a Response-Object is as follows:

{
  "responseHeader" : { (1)
    "query" : "...",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : "OK"
  },
  "content" : [ { (2)
    "domainName" : "www.dfg.de"
  }, {
    "domainName" : "www.oaq.ch"
  }, {
    "domainName" : "www.europace.org"
  } ],
  "page" : { (3)
    "size" : 3,
    "number" : 0,
    "totalElements" : 110,
    "totalPages" : 37
  },
  "links" : { (4)
    "next" : "https://dss-wdc.wiso.uni-hamburg.de/api/snapshot/20121227_intermediaries/domains?page=1&size=3"
  }
}
1 The responseHeader represents information about the query, the state of the response and potential messages and warnings.
2 The content consists of an array of objects. The type of these objects depends on the query.
3 The page-object gives information about the overall size of the data and gives detailed information which is important for paging throug large data-sets. To actually consume paged resources you should use the links-objects.
4 The links give the link for the next or the previous page. This information should be used to implement paging. If there is no next- or prev-page the property does not exist.
The maximum number of elements in one page is set to 1000. Thus, if you specify an paging-size of 2000 it will be overriden.

Complex Datatypes for the API-Requests

The API defines arguments on different endpoints. Some arguments, such as the arguments for paging or SnapshotSelections, have or can be used jointly and refer to a special datatype.

Datatype Description Arguments

Paging

Used to provide a means to "page" through larger results. See above. You should not use the paging directly. Instead use the prev and next-links.

  • page: the number of the page

  • size: the number of items on each page

SnapshotSelection

SnapshotSelection are used to express a subset of domains in a snapshot. Various endpoints offer the possibility to work on such subsets.

  • snapshot: the machine-name of the snapshot

  • selection: Optional. The machine-name of the selection. If not provided all domains in the snapshot are used.

Integration: Access from Python

For a more tight integration we publish the Python package dss.wdc_client. This package supports automatic handling of paging of large results and transforming these in JSON-Arrays or directly to DataFrames.

We highly recommend to use this approach as we develop and test this package in synch with the rest of the WDC-API.

  • The package is published on PyPi as dss.wdc_client and can be used in the usual ways using pip, poetry or any other package manager you prefer.

  • The documentation with examples and the API-Reference is located at dss.wdc_client-API

Snapshots

A set of Endpoints to discover information about available snapshots.

/api/snapshot/list

Returns a list of Snapshots, which are accessible for the given user.

Parameter Description

filter

A simple filter which checks if the name contains the given String

Example
$ curl 'http://localhost:8080/api/snapshot/list?filter=intermediaries' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 551

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/snapshot/list?filter=intermediaries",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "name" : "20121227_intermediaries",
    "description" : null,
    "indexed" : true,
    "textExtracted" : true
  }, {
    "name" : "20240701_intermediaries",
    "description" : null,
    "indexed" : false,
    "textExtracted" : false
  } ],
  "page" : {
    "size" : 1000,
    "number" : 0,
    "totalElements" : 2,
    "totalPages" : 1
  },
  "links" : { }
}

/api/snapshot/{snapshot}/domains

Returns the set of Domains included in the specified Snapshot. The information about Domains reflects the imported status of a crawl.

Information can be only obtained about crawled domains (Seeds).

Parameter Description

page

The number of the requested page.

size

The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/snapshot/20121227_intermediaries/domains?page=0&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 587

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/snapshot/20121227_intermediaries/domains?page=0&size=2",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "domainName" : "www.cpu.fr",
    "type" : "SEED",
    "pages" : 10008
  }, {
    "domainName" : "www.srhe.ac.uk",
    "type" : "SEED",
    "pages" : 9995
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 114,
    "totalPages" : 57
  },
  "links" : {
    "next" : "http://localhost:8080/api/snapshot/20121227_intermediaries/domains?page=1&size=2"
  }
}

/api/snapshot/{snapshot}/seeds

Return concise information about seeds, including their crawled status and possible redirects.

The information of seeds is generated from the Heritrix seed-reports. Status codes can be found here: https://heritrix.readthedocs.io/en/latest/glossary.html#status-codes

Fields of one seed-item:

Field Description

httpStatusCode

Extended httpStatusCode for the current uri

status

A more humand readable status code

uri

The actual URI.

redirectsTo

A possible redirect. Return "null", if there was no redirect. Please note, that such a redirect creates following seed-item which in turn could again create a redirect.

Example
$ curl 'http://localhost:8080/api/snapshot/20121227_intermediaries/seeds?size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 791

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/snapshot/20121227_intermediaries/seeds?size=3",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "httpStatusCode" : -6,
    "status" : "NOTCRAWLED",
    "uri" : "http://www.esib.org/",
    "redirectsTo" : null
  }, {
    "httpStatusCode" : -6,
    "status" : "NOTCRAWLED",
    "uri" : "http://www.forum.eua.be/",
    "redirectsTo" : null
  }, {
    "httpStatusCode" : -6,
    "status" : "NOTCRAWLED",
    "uri" : "http://www.www2.esf.org/",
    "redirectsTo" : null
  } ],
  "page" : {
    "size" : 3,
    "number" : 0,
    "totalElements" : 157,
    "totalPages" : 53
  },
  "links" : {
    "next" : "http://localhost:8080/api/snapshot/20121227_intermediaries/seeds?page=1&size=3"
  }
}

/api/snapshot/{snapshot}/searchDomains

Queries the SearchIndex of the crawled documents with a given Query and returns a list of hits in each domain. Only domains which actually have at least one hit are returned.

The number of hits of a domain is calculated as the sum of hits in each document. Internally a facetted SolrQuery of the index is created which uses the facet.method=fc (see https://solr.apache.org/guide/solr/latest/query-guide/faceting.html).
Parameter Description

query

A query to search for. Can be an arbitrary Solr-Query.

selection

Optional. A machineName of a Selection. If specified only results of Domains in the Selection will be returned.

Example
$ curl 'http://localhost:8080/api/snapshot/20121227_intermediaries/searchDomains?query=uni&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 566

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/snapshot/20121227_intermediaries/searchDomains?query=uni&size=2",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "domainName" : "www.acquin.org",
    "hits" : 2386
  }, {
    "domainName" : "www.che.de",
    "hits" : 1476
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 51,
    "totalPages" : 26
  },
  "links" : {
    "next" : "http://localhost:8080/api/snapshot/20121227_intermediaries/searchDomains?query=uni&page=1&size=2"
  }
}

Selections

A Selection represents a subset of Domains of a Snapshot. They can be used as a filter for various endpoints.

Filtering on Selections are made on a best effort basis. Assume for example a search request. The Filtering includes all search results which end with a domain in the selection. This is necessary to include search results of redirected crawled data. Yet, this simpler approach might lead to undesired results:
Domain in Selection Domain in Seed-List Crawled, indexed Domain Matches

bimid.de

bimid.de

www.bimid.de

true

www.tageszeitung.de

www.tageszeitung.de

www.taz.de

false

Filter of Selections will be reworked to use a more sophisticated strategy using the redirect-data which will also match the second case.

/api/selection/list

Returns a list of available selections.

Parameter Description

page

The number of the requested page.

size

The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/selection/list' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 524

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/selection/list",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "machineName" : "createWithSelection",
    "title" : "the title"
  }, {
    "machineName" : "SelectionControllerTest.SET",
    "title" : ""
  }, {
    "machineName" : "wdc.crawler.test.SelectionServiceTest#set",
    "title" : ""
  } ],
  "page" : {
    "size" : 1000,
    "number" : 0,
    "totalElements" : 3,
    "totalPages" : 1
  },
  "links" : { }
}

/api/selection/{selection}/domains

Returns the set of Domains included in the specified Selection.

Parameter Description

page

The number of the requested page.

size

The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/selection/createWithSelection/domains' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 963

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/selection/createWithSelection/domains",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "name" : "www.esf.org"
  }, {
    "name" : "www.oecd.org"
  }, {
    "name" : "www.nuffic.nl"
  }, {
    "name" : "www.eua.be"
  }, {
    "name" : "www.enqa.eu"
  }, {
    "name" : "www.eqar.eu"
  }, {
    "name" : "www.inqaahe.org"
  }, {
    "name" : "www.esmu.be"
  }, {
    "name" : "www.eaie.org"
  }, {
    "name" : "www.britishcouncil.org"
  }, {
    "name" : "eacea.ec.europa.eu"
  }, {
    "name" : "www.chea.org"
  }, {
    "name" : "www.aca-secretariat.be"
  }, {
    "name" : "www.iau-aiu.net"
  }, {
    "name" : "www.iie.org"
  }, {
    "name" : "www.aucc.ca"
  }, {
    "name" : "www.aau.org"
  }, {
    "name" : "www.nafsa.org"
  } ],
  "page" : {
    "size" : 1000,
    "number" : 0,
    "totalElements" : 18,
    "totalPages" : 1
  },
  "links" : { }
}

/api/selection/{selection}/set (beta)

Sets the Domains of the Selection. If the Selection does not exist, it will be created.

This Endpoint is still in evaluation and will be not usefull for "normal" users. As a normal user you won’t be able to edit a newly created Selection.
Selections are SecuredObjects and making changes of the object are secured. To edit an existing selection you have to make sure you have the corresponding access rights.
Example
$ curl 'http://localhost:8080/api/selection/SelectionControllerTest.SET/set' -i -X PUT \
    -H 'Content-Type: text/plain' \
    -d '	www.eua.be
	www.oecd.org
	www.enqa.eu
'
HTTP/1.1 201 Created

Panels

A set of Endpoints to discover information about available Panels.

/api/panel/list

Returns a list of Panels, which are accessible for the given user.

Example
$ curl 'http://localhost:8080/api/panel/list' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 368

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/panel/list",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "name" : "intermediaries",
    "description" : null,
    "snapshotCount" : 1
  } ],
  "page" : {
    "size" : 1000,
    "number" : 0,
    "totalElements" : 1,
    "totalPages" : 1
  },
  "links" : { }
}

/api/panel/{name}/list

Returns the set of Snapshots included in the specified Panel.

The list of returned Snapshots is secured and filtered with your access rules.
Parameter Description

page

Optional. The number of the requested page.

size

Optional. The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/panel/intermediaries/list?page=0&size=1' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 428

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/panel/intermediaries/list?page=0&size=1",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "name" : "20121227_intermediaries",
    "description" : null,
    "indexed" : true,
    "textExtracted" : true
  } ],
  "page" : {
    "size" : 1,
    "number" : 0,
    "totalElements" : 1,
    "totalPages" : 1
  },
  "links" : { }
}

Index

Internally the documents (pages) of a Snapshot are indexed using a full-text search-engine.

For search requests based on documents you can use the fiels in the table below:

Fieldname Description

domain_s

The domain of the web site hosting this page

path_s

The complete path of the page, including query parameters

title_t

The title of the web page.

description_t

A description extracted from the page.

language_s

The language of the identified text

_text_

The hidden field for the full-text. It can be queried but the values are not stored in the index. If you need full-texts please use the appropriate API-Call. Normally, you do not have to state this field in queries.

Examples for querying the full-text index
	CSR (1)
	"Corporate Social Responsibility" (2)
	language_s:de && "Corporate Social Responsibility" (3)
	language_s:en && path_s:"/about" (4)
1 Searches for a phrase in the field \_text.
2 Searches for the phrase "Corporate Social Responsibility". Use " to combine single words to a longer phrase.
3 Same as above, but searches only in german documents.
4 Searches for pages with the specified path in english documents.
Notes and complete Solr-Query-Syntax

As you probably noted the examples do not entail information about a snapshot or panel. This information is added to your query automatically to make sure that only reasonable queries can be submitted to the API.

Currently a version of Solr is used. Thus you can use the syntax of Solr to query the full-text index. For further reference, please use for reference the orginal documentation of Solr:

/api/index/status

Provides an overview of indexed documents per Snapshot and a possible defined Selection.

Parameter Description

snapshot

Optional. The name of a snapshot. Can be specified multiple times.

panel

Optional. The name of a panel. One of 'snapshot' or 'panel' has to be specified.

selection

Optional. The name of a selection to filter the results.

Example
$ curl 'http://localhost:8080/api/index/status?snapshot=20121227_intermediaries' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 412

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/index/status?snapshot=20121227_intermediaries",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "snapshot" : "20121227_intermediaries",
    "selection" : "",
    "indexedDocs" : 223664
  } ],
  "page" : {
    "size" : 1,
    "number" : 0,
    "totalElements" : 1,
    "totalPages" : 1
  },
  "links" : { }
}

Texts

Texts of web pages are prepared in various ways. This chapter describes what actually happens to those texts an how you can access this information.

/api/texts/search

Returns a subset of pages with extracted text.

Parameter Description

snapshot

The name of the snapshot

selection

Optional. The machine-name of the selection.

query

The Solr-Query which is used to search for the pages

textsInclude

Optional. If 'true' includes the extracted text. Please note, that texts are not extracted on all Snapshots.

textsAbbreviate

Optional. Used for debugging. Abbreviates exported texts.

page

Optional. The number of the requested page.

size

Optional. The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/texts/search?snapshot=20121227_intermediaries&query=news&textsInclude=true&textsAbbreviate=true&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 1650

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/texts/search?snapshot=20121227_intermediaries&query=news&textsInclude=true&textsAbbreviate=true&size=2",
    "state" : "OK",
    "msg" : "solrQuery: q=news&q.op=OR&fq=snapshot_id_i:+1&sort=domain_id_i+asc,reference_id_i+asc&start=0&rows=2",
    "httpStatus" : null
  },
  "content" : [ {
    "snapshot" : "20121227_intermediaries",
    "language" : "en",
    "textId" : 43,
    "domain" : "www.dfg.de",
    "path" : "/en/index.jsp",
    "textInfo" : {
      "title" : "DFG, German Research Foundation",
      "description" : null,
      "text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDFG, German Research Foundation\n\n\r\n\r\n\r\n    \r\n\r\n      ...",
      "contentType" : "text/plain",
      "language" : "en"
    }
  }, {
    "snapshot" : "20121227_intermediaries",
    "language" : "de",
    "textId" : 95,
    "domain" : "www.dfg.de",
    "path" : "/dfg_profil/geschaeftsstelle/dfg_praesenz_ausland/beijing/index.jsp",
    "textInfo" : {
      "title" : "DFG - Deutsche Forschungsgemeinschaft - Chinesisch-Deutsches Zentrum für Wissenschaftsförderung Beijing",
      "description" : null,
      "text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDFG - Deutsche Forschungsgemeinschaft - Chinesisch-...",
      "contentType" : "text/plain",
      "language" : "de"
    }
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 75772,
    "totalPages" : 37886
  },
  "links" : {
    "next" : "http://localhost:8080/api/texts/search?snapshot=20121227_intermediaries&query=news&textsInclude=true&textsAbbreviate=true&page=1&size=2"
  }
}

/api/texts/get

Returns extracted text from a set of given text-ids.

Parameter Description

snapshot

The name of the snapshot

id

Id of the text. Can be specified multiple times.

Example
$ curl 'http://localhost:8080/api/texts/get?snapshot=20121227_intermediaries&id=9094&id=9095' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 7093

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/texts/get?snapshot=20121227_intermediaries&id=9094&id=9095",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "snapshot" : "20121227_intermediaries",
    "language" : "en",
    "textId" : 9094,
    "textInfo" : {
      "title" : "Making Conferences Greener : European Science Foundation",
      "description" : null,
      "text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMaking Conferences Greener : European Science Foundation\n\n\n\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tBookmark this pageFAQMember pagesRSSSitemapSubscribe\n\n\t\t\t\t\t\t\t\t\t\n\t\t\t     \n                            \n                            \n                            \n\t\t\n\n\t\t\n\t\t\t\n\t\t\n\n\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\t\tHome\n\tAbout ESF\n\tActivities\n\tResearch Areas\n\tPublications\n\tMedia Centre\n\tJobs\n\tContact\n\n\n\n\n\t\t\t\tHome  > Activities  > ESF Research Conferences  > Making Conferences Greener\n\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\tESF Research Conferences\n\n\n\t\t\t\t\t\t\tMaking Conferences Greener: A project of the ESF Conferences Unit \n\n\n\n\n\t\n\n\nWe are proud to inform you that the ESF Research Conferences Forest has become bigger! \r\n\n3777 exemplars of Moringa trees will join the 5833 ones that we planted last year.\nThey will be planted in Dosso (Niger) to face the  desertification of the area and to offset 1020 tons of CO2 during their life span.\r\n\nMore information will be posted soon!\n\r\n\n** Autumn update! Read here the Latest News! **\n\n\n\n\n\n\t\n\n\nA 140m deep water well was drilled during the last months and its electric pump at a depth of 80m to pump the water up to the surface is run by soler panels. Click here to read the whole report and look at the picture.\n\nThe two hectares of moringa trees planted last year gave a good weekly leaf harvests over the summer and the products were sold right away at the market in the nearby city of Dosso. Do you wonder what does the moringa leaf taste like? Click here to discover it! \n\n\n\n\n\n\t\n\n\nThe idea to turn our conferences ‘green’ started in 2009 and involves various aspects of the conference organisation. In order to conform to UNEP recommendations, amongst others, we are working on making our conferences more sustainable and environmentally friendly. This will be an ongoing project that will affect our participants, our material, our venues and our office. This page aims to keep our attendees informed about the status of the project and further green activities.\r\n\nWe would also like to encourage participants to use this page to write tips and recommendations that will help to make our green project successful.\n\tResearch Conferences Forest and Green Fee\n\tGreen Travel\n\tGreen Venues\n\tGreen Material\n\tGreen Office\n\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\tEuroBioFund\n\tEUROCORES\n\tExploratory Workshops\n\tForward Looks\n\tCalls and Funding\n\tMO Fora\n\tResearch Networking Programmes\n\tESF Research Conferences\tUpcoming Events\n\tNews\n\tCall for Proposals\n\tPartnerships\n\tMaking Conferences Greener\tResearch Conferences Forest\n\tGreen Travel\n\n\n\tSponsor Resource Center\n\tVenues\n\tPublications\n\tRestricted Pages\n\tContacts\n\tFAQ\n\tPast Events\n\tSearch\n\tOther Meetings \n\tConferences Email Alerts\n\n\n\tScience Policy\n\tESF Meetings\n\tEuropean Latsis Prize 2012\n\tPeer Review\n\tESF Symposia\n\tESF at ESOF 2012 Dublin\n\n\n\n\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\tData protection | Disclaimer\n\n© 2012 European Science Foundation - page last updated: 20.12.2011\n\nESF provides the scientific, administrative and technical secretariat for COST (European Cooperation in Science and Technology).\n\n\n\n\n\t\t\t\n\n\t\t\n\n\t\t\n\t\t\n\t\n\n\n\n\n\n\n",
      "contentType" : "text/plain",
      "language" : "en"
    }
  }, {
    "snapshot" : "20121227_intermediaries",
    "language" : "en",
    "textId" : 9095,
    "textInfo" : {
      "title" : "Forward Looks : European Science Foundation",
      "description" : null,
      "text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nForward Looks : European Science Foundation\n\n\n\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tBookmark this pageFAQMember pagesRSSSitemapSubscribe\n\n\t\t\t\t\t\t\t\t\t\n\t\t\t     \n                            \n                            \n                            \n\t\t\n\n\t\t\n\t\t\t\n\t\t\n\n\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\t\tHome\n\tAbout ESF\n\tActivities\n\tResearch Areas\n\tPublications\n\tMedia Centre\n\tJobs\n\tContact\n\n\n\n\n\t\t\t\tHome  > Activities  > Forward Looks\n\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\tForward Looks\n\n\n\t\t\t\t\t\tThe flagship activity of ESF’s strategic arm, Forward Looks enable Europe’s scientific community, in interaction with policy makers, to develop medium to long-term views and analyses of future research developments with the aim of defining research agendas at national and European level.  Forward Looks are driven by ESF’s Member Organisations and, by extension, the European research community. Quality assurance mechanisms, based on peer review where appropriate, are applied at every stage of the development and delivery of a Forward Look to ensure its quality and impact. \n\nPlease see the left navigational bar for all current Forward Looks.\n\nFor enquiries about Forward Looks please contact:\n\n\tMs.LauraMarinE-Mail\n\tScience Officer MOs Relations & Partnerships\n\n\tMs.MadeliseBlumenroederE-Mail\n\tSenior Administrator\n\n\nPlease click here to see all our Forward Look reports \n\n\t\n\n\t\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t \n\n\t\t\t\t\t\t\tEuroBioFund\n\tEUROCORES\n\tExploratory Workshops\n\tForward Looks\tNews\n\tAll Current and Completed Forward Looks\n\tSpace Sciences (SSU)\n\tHumanities (SCH)\n\tLife, Earth and Environmental Sciences (LESC)\n\tMedical Sciences (EMRC)\n\tPhysical and Engineering Sciences (PESC)\n\tSocial Sciences (SCSS)\n\tWorkshop scheme\n\n\n\tCalls and Funding\n\tMO Fora\n\tResearch Networking Programmes\n\tESF Research Conferences\n\tScience Policy\n\tESF Meetings\n\tEuropean Latsis Prize 2012\n\tPeer Review\n\tESF Symposia\n\tESF at ESOF 2012 Dublin\n\n\n\n\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\tData protection | Disclaimer\n\n© 2012 European Science Foundation - page last updated: 26.12.2012\n\nESF provides the scientific, administrative and technical secretariat for COST (European Cooperation in Science and Technology).\n\n\n\n\n\t\t\t\n\n\t\t\n\n\t\t\n\t\t\n\t\n\n\n\n\n\n\n",
      "contentType" : "text/plain",
      "language" : "en"
    }
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 2,
    "totalPages" : 1
  },
  "links" : { }
}

DomainGraph

A set of Endpoints to export already prepared DomainGraphs. DomainGraphs represent the graph of Domains (Nodes) and their linkages (Edges). Edges additionally have a weight to count how often one Domain links to another.

The API for DomainGraphs uses a Node-Edge representation. Thus you have to use 2 calls to the API to get the data for a DomainGraph.

There may be multiple different DomainGraphs for one Snapshot. DomainGraphs may be created from Variants and Selections.

As for the Variants, currently only one Variant ONLY_SEEDS exists.

  • ONLY_SEEDS: Contains all nodes and edges from from the crawled Snapshot.

Currently new DomainGraphs can only be created from the backend.

/api/domaingraph/list

A list with all existing DomainGraphs.

Parameter Description

snapshot

Optional. The name of a snapshot. Can be specified multiple times.

panel

Optional. The name of a panel.

selection

Optional. The machine-name of the selection. If not specified does not filter the result.

variant

Optional. The variant of the DomainGraph. Currently only 'ONLY_SEEDS' is supported. If not specified, does not filter the result.

page

The number of the requested page.

size

The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/domaingraph/list?page=0&size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 431

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/domaingraph/list?page=0&size=3",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "id" : 16,
    "snapshotName" : "20121227_intermediaries",
    "variant" : "ONLY_SEEDS",
    "selectionMachineName" : null
  } ],
  "page" : {
    "size" : 3,
    "number" : 0,
    "totalElements" : 1,
    "totalPages" : 1
  },
  "links" : { }
}

/api/domaingraph/{id}/nodes

A list with all nodes in the referenced DomainGraph.

Parameter Description

page

The number of the requested page.

size

The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/domaingraph/16/nodes?page=0&size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 892

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/domaingraph/16/nodes?page=0&size=3",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "id" : "www.dfg.de",
    "url" : "www.dfg.de",
    "type" : "SEED",
    "indegree" : 8,
    "outdegree" : 4,
    "degree" : 12,
    "outdegree_seeds" : 4
  }, {
    "id" : "www.oaq.ch",
    "url" : "www.oaq.ch",
    "type" : "SEED",
    "indegree" : 10,
    "outdegree" : 20,
    "degree" : 30,
    "outdegree_seeds" : 20
  }, {
    "id" : "www.europace.org",
    "url" : "www.europace.org",
    "type" : "SEED",
    "indegree" : 5,
    "outdegree" : 6,
    "degree" : 11,
    "outdegree_seeds" : 6
  } ],
  "page" : {
    "size" : 3,
    "number" : 0,
    "totalElements" : 113,
    "totalPages" : 38
  },
  "links" : {
    "next" : "http://localhost:8080/api/domaingraph/16/nodes?page=1&size=3"
  }
}

/api/domaingraph/{id}/edges

A list with all edges in the referenced DomainGraph.

Parameter Description

page

The number of the requested page.

size

The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/domaingraph/16/edges?page=0&size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 637

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/domaingraph/16/edges?page=0&size=3",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "source" : "www.dfg.de",
    "target" : "www.esf.org",
    "weight" : 20
  }, {
    "source" : "www.dfg.de",
    "target" : "erc.europa.eu",
    "weight" : 6
  }, {
    "source" : "www.dfg.de",
    "target" : "www.ciee.org",
    "weight" : 2
  } ],
  "page" : {
    "size" : 3,
    "number" : 0,
    "totalElements" : 1126,
    "totalPages" : 376
  },
  "links" : {
    "next" : "http://localhost:8080/api/domaingraph/16/edges?page=1&size=3"
  }
}

Statistics

These endpoints provide access to some statistical information basic information about snapshots (and panels).

/api/stats

Returns a list of Stats-Objects describing various descriptive indicators of snapshots.

Parameter Description

snapshot

Optional. The name of a snapshot. Can be specified multiple times.

panel

Optional. The name of a panel. One of 'snapshot' or 'panel' has to be specified.

Example
$ curl 'http://localhost:8080/api/stats?snapshot=20121227_intermediaries' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 632

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/stats?snapshot=20121227_intermediaries",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "snapshot" : "20121227_intermediaries",
    "selection" : null,
    "seedsInitial" : 122,
    "seedsActual" : 157,
    "seedsCrawled" : 141,
    "seedsNotCrawled" : 16,
    "importedDomains" : 114,
    "importedHtmlDocs" : 274126,
    "importedKBytes" : -1,
    "indexedDomains" : 110,
    "indexedDocs" : 223664
  } ],
  "page" : {
    "size" : 1000,
    "number" : 0,
    "totalElements" : 1,
    "totalPages" : 1
  },
  "links" : { }
}
Table 1. Stats-Object
Name Description

snapshot

The snapshot.

selection

The selection. Might be null, when no Seleciton is present.

seedsInitial

The number for seeds which have been used as input for the crawl.

seedsActual

The number of seeds which have been used for the crawl. Includes possible redirects and seeds which could not be crawled.

seedsCrawled and seedsNotCrawled

Should be self explanotory.

importedDomains

The number of seeds which have been actually imported.

importedHtmlDocs

The number of imported documents with the mime-type "text/html"

importedKBytes (currently not computed)

The complete size of the imported documents. Includes possible duplicated documents.

indexedSites

The number of sites which have been indexed.

indexedDocs

The number of indexed documents.

/api/stats/domains

Returns a list of DomainStats-Objects describing descriptive indicators for all seed-domains of a given snapshot.

Parameter Description

snapshot

Optional. The name of a snapshot. Can be specified multiple times.

panel

Optional. The name of a panel. One of 'snapshot' or 'panel' has to be specified.

Example
$ curl 'http://localhost:8080/api/stats/domains?snapshot=20121227_intermediaries&page=0&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 806

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/stats/domains?snapshot=20121227_intermediaries&page=0&size=2",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "snapshot" : "20121227_intermediaries",
    "selection" : null,
    "domain" : "www.dfg.de",
    "importedHtmlDocs" : 9877,
    "importedKBytes" : 0,
    "indexedDocs" : 7623
  }, {
    "snapshot" : "20121227_intermediaries",
    "selection" : null,
    "domain" : "www.oaq.ch",
    "importedHtmlDocs" : 2613,
    "importedKBytes" : 0,
    "indexedDocs" : 1221
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 114,
    "totalPages" : 57
  },
  "links" : {
    "next" : "http://localhost:8080/api/stats/domains?snapshot=20121227_intermediaries&page=1&size=2"
  }
}

Embeddings

TODO:

/api/embeddings/definitions

Returns a list of EmbeddingDefintions. An EmbeddingDefinition combines a specific embedding model and a reference to a Chunker.

Parameter Description

page

Optional. The number of the requested page.

size

Optional. The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/embeddings/definitions?size=1' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 628

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/embeddings/definitions?size=1",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "machineName" : "default",
    "modelName" : "sentence-transformers/paraphrase-MiniLM-L12-v2",
    "dimensions" : 384,
    "chunkerMachineName" : "default"
  }, {
    "machineName" : "tests",
    "modelName" : "sentence-transformers/paraphrase-MiniLM-L12-v2",
    "dimensions" : 384,
    "chunkerMachineName" : "default"
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 2,
    "totalPages" : 1
  },
  "links" : { }
}

/api/embeddings/status

TODO:

/api/embeddings/search

Creates a list of "nearby" chunks of documents based on the used embedding sorted ascending on the distance. As distance the cosinus similarity is used.

Parameter Description

snapshot

The snapshot

selection

The selection. (currently not used)

domain

A Domain on which the search should be restricted. Can be specified multiple times.

embeddingsDef

The machineName of the EmbeddingsDef

query

The text for comparing with embeddings

maxDistance

The maximum distance of text chunks. Defaults to 0.5

limit

The limit of matching Chunks to return. Defaults to 1000

minTokenCount

Only consider Chunks with more than minTokensCount. Defaults to 3

page

Optional. The number of the requested page.

size

Optional. The number of objects of the requested page.

Example
$ curl 'http://localhost:8080/api/embeddings/search?snapshot=20121227_intermediaries&embeddingsDef=default&query=Nachhaltigkeit&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 893

{
  "responseHeader" : {
    "query" : "http://localhost:8080/api/embeddings/search?snapshot=20121227_intermediaries&embeddingsDef=default&query=Nachhaltigkeit&size=2",
    "state" : "OK",
    "msg" : "",
    "httpStatus" : null
  },
  "content" : [ {
    "domain" : "www.dfg.de",
    "reference" : "/dfg_magazin/index.jsp",
    "textChunk" : "Jahr der Nachhaltigkeit",
    "embedding" : [ ],
    "dist" : 0.1292587
  }, {
    "domain" : "www.dfg.de",
    "reference" : "/dfg_magazin/wissenschaft_oeffentlichkeit/index.html",
    "textChunk" : "Jahr der Nachhaltigkeit",
    "embedding" : [ ],
    "dist" : 0.1292587
  } ],
  "page" : {
    "size" : 2,
    "number" : 0,
    "totalElements" : 135,
    "totalPages" : 68
  },
  "links" : {
    "next" : "http://localhost:8080/api/embeddings/search?snapshot=20121227_intermediaries&embeddingsDef=default&query=Nachhaltigkeit&page=1&size=2"
  }
}

include::users.adoc[]s