About
The WDC-API provides a REST-Interface for the WebDataCollector-Framework and provided the means for external applications to work with the collected and prepared data of the WebDataCollector.
General definitions
This API shares some common conventions and definitions which are not stated explicity at each endpoint.
Base-URL, Authentication and Authorization
The API can be accessed at the URL https://dss-wdc.wiso.uni-hamburg.de/api
.
The API can only be access with Access-Tokens. Access Tokens can be included included in the Http-Head as parameter "Token".
curl 'https://dss-wdc.wiso.uni-hamburg.de/api/snapshot/list?page=0&size=5' -i -X GET -H "Token:MyToken"
Snapshots and Panels are secured objects. A user gets only the snapshots which have been accordingly configured. If you think you miss a snapshot you can have a look into your permissions via an API-Call.
You need an access-token? Please get in touch with us by email. |
To increase the bevity of the examples, the documentation ignores the authentication-token. In your own code you have to be authenticated. |
Please note that we log your access to the API. We use that information to identify bottlenecks and problems within the API. |
Responses, Status-Codes and Paging
The WDC-API aims to produce structural stable Response-Objects in JSON. The basic form of such a Response-Object is as follows:
{
"responseHeader" : { (1)
"query" : "...",
"state" : "OK",
"msg" : "",
"httpStatus" : "OK"
},
"content" : [ { (2)
"domainName" : "www.dfg.de"
}, {
"domainName" : "www.oaq.ch"
}, {
"domainName" : "www.europace.org"
} ],
"page" : { (3)
"size" : 3,
"number" : 0,
"totalElements" : 110,
"totalPages" : 37
},
"links" : { (4)
"next" : "https://dss-wdc.wiso.uni-hamburg.de/api/snapshot/20121227_intermediaries/domains?page=1&size=3"
}
}
1 | The responseHeader represents information about the query, the state of the response and potential messages and warnings. |
2 | The content consists of an array of objects. The type of these objects depends on the query. |
3 | The page-object gives information about the overall size of the data and gives detailed information which is important for paging throug large data-sets. To actually consume paged resources you should use the links-objects. |
4 | The links give the link for the next or the previous page. This information should be used to implement paging. If there is no next- or prev-page the property does not exist. |
The maximum number of elements in one page is set to 1000. Thus, if you specify an paging-size of 2000 it will be overriden. |
Complex Datatypes for the API-Requests
The API defines arguments on different endpoints. Some arguments, such as the arguments for paging or SnapshotSelections, have or can be used jointly and refer to a special datatype.
Datatype | Description | Arguments |
---|---|---|
Paging |
Used to provide a means to "page" through larger results. See above. You should not use the paging directly. Instead use the prev and next-links. |
|
SnapshotSelection |
SnapshotSelection are used to express a subset of domains in a snapshot. Various endpoints offer the possibility to work on such subsets. |
|
Integration: Access from Python
For a more tight integration we publish the Python package dss.wdc_client. This package supports automatic handling of paging of large results and transforming these in JSON-Arrays or directly to DataFrames.
We highly recommend to use this approach as we develop and test this package in synch with the rest of the WDC-API.
|
Snapshots
A set of Endpoints to discover information about available snapshots.
/api/snapshot/list
Returns a list of Snapshots, which are accessible for the given user.
Parameter | Description |
---|---|
|
A simple filter which checks if the name contains the given String |
$ curl 'http://localhost:8080/api/snapshot/list?filter=intermediaries' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 551
{
"responseHeader" : {
"query" : "http://localhost:8080/api/snapshot/list?filter=intermediaries",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"name" : "20121227_intermediaries",
"description" : null,
"indexed" : true,
"textExtracted" : true
}, {
"name" : "20240701_intermediaries",
"description" : null,
"indexed" : false,
"textExtracted" : false
} ],
"page" : {
"size" : 1000,
"number" : 0,
"totalElements" : 2,
"totalPages" : 1
},
"links" : { }
}
/api/snapshot/{snapshot}/domains
Returns the set of Domains included in the specified Snapshot. The information about Domains reflects the imported status of a crawl.
Information can be only obtained about crawled domains (Seeds).
Parameter | Description |
---|---|
|
The number of the requested page. |
|
The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/snapshot/20121227_intermediaries/domains?page=0&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 587
{
"responseHeader" : {
"query" : "http://localhost:8080/api/snapshot/20121227_intermediaries/domains?page=0&size=2",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"domainName" : "www.cpu.fr",
"type" : "SEED",
"pages" : 10008
}, {
"domainName" : "www.srhe.ac.uk",
"type" : "SEED",
"pages" : 9995
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 114,
"totalPages" : 57
},
"links" : {
"next" : "http://localhost:8080/api/snapshot/20121227_intermediaries/domains?page=1&size=2"
}
}
/api/snapshot/{snapshot}/seeds
Return concise information about seeds, including their crawled status and possible redirects.
The information of seeds is generated from the Heritrix seed-reports. Status codes can be found here: https://heritrix.readthedocs.io/en/latest/glossary.html#status-codes |
Fields of one seed-item:
Field | Description |
---|---|
httpStatusCode |
Extended httpStatusCode for the current uri |
status |
A more humand readable status code |
uri |
The actual URI. |
redirectsTo |
A possible redirect. Return "null", if there was no redirect. Please note, that such a redirect creates following seed-item which in turn could again create a redirect. |
$ curl 'http://localhost:8080/api/snapshot/20121227_intermediaries/seeds?size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 791
{
"responseHeader" : {
"query" : "http://localhost:8080/api/snapshot/20121227_intermediaries/seeds?size=3",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"httpStatusCode" : -6,
"status" : "NOTCRAWLED",
"uri" : "http://www.esib.org/",
"redirectsTo" : null
}, {
"httpStatusCode" : -6,
"status" : "NOTCRAWLED",
"uri" : "http://www.forum.eua.be/",
"redirectsTo" : null
}, {
"httpStatusCode" : -6,
"status" : "NOTCRAWLED",
"uri" : "http://www.www2.esf.org/",
"redirectsTo" : null
} ],
"page" : {
"size" : 3,
"number" : 0,
"totalElements" : 157,
"totalPages" : 53
},
"links" : {
"next" : "http://localhost:8080/api/snapshot/20121227_intermediaries/seeds?page=1&size=3"
}
}
/api/snapshot/{snapshot}/searchDomains
Queries the SearchIndex of the crawled documents with a given Query and returns a list of hits in each domain. Only domains which actually have at least one hit are returned.
The number of hits of a domain is calculated as the sum of hits in each document. Internally a facetted SolrQuery of the index is created which uses the facet.method=fc (see https://solr.apache.org/guide/solr/latest/query-guide/faceting.html). |
Parameter | Description |
---|---|
|
A query to search for. Can be an arbitrary Solr-Query. |
|
Optional. A machineName of a Selection. If specified only results of Domains in the Selection will be returned. |
$ curl 'http://localhost:8080/api/snapshot/20121227_intermediaries/searchDomains?query=uni&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 566
{
"responseHeader" : {
"query" : "http://localhost:8080/api/snapshot/20121227_intermediaries/searchDomains?query=uni&size=2",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"domainName" : "www.acquin.org",
"hits" : 2386
}, {
"domainName" : "www.che.de",
"hits" : 1476
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 51,
"totalPages" : 26
},
"links" : {
"next" : "http://localhost:8080/api/snapshot/20121227_intermediaries/searchDomains?query=uni&page=1&size=2"
}
}
Selections
A Selection represents a subset of Domains of a Snapshot. They can be used as a filter for various endpoints.
Filtering on Selections are made on a best effort basis. Assume for example a search request. The Filtering includes all search results which end with a domain in the selection. This is necessary to include search results of redirected crawled data. Yet, this simpler approach might lead to undesired results: |
Domain in Selection | Domain in Seed-List | Crawled, indexed Domain | Matches |
---|---|---|---|
bimid.de |
bimid.de |
www.bimid.de |
true |
www.tageszeitung.de |
www.tageszeitung.de |
www.taz.de |
false |
Filter of Selections will be reworked to use a more sophisticated strategy using the redirect-data which will also match the second case.
/api/selection/list
Returns a list of available selections.
Parameter | Description |
---|---|
|
The number of the requested page. |
|
The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/selection/list' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 524
{
"responseHeader" : {
"query" : "http://localhost:8080/api/selection/list",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"machineName" : "createWithSelection",
"title" : "the title"
}, {
"machineName" : "SelectionControllerTest.SET",
"title" : ""
}, {
"machineName" : "wdc.crawler.test.SelectionServiceTest#set",
"title" : ""
} ],
"page" : {
"size" : 1000,
"number" : 0,
"totalElements" : 3,
"totalPages" : 1
},
"links" : { }
}
/api/selection/{selection}/domains
Returns the set of Domains included in the specified Selection.
Parameter | Description |
---|---|
|
The number of the requested page. |
|
The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/selection/createWithSelection/domains' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 963
{
"responseHeader" : {
"query" : "http://localhost:8080/api/selection/createWithSelection/domains",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"name" : "www.esf.org"
}, {
"name" : "www.oecd.org"
}, {
"name" : "www.nuffic.nl"
}, {
"name" : "www.eua.be"
}, {
"name" : "www.enqa.eu"
}, {
"name" : "www.eqar.eu"
}, {
"name" : "www.inqaahe.org"
}, {
"name" : "www.esmu.be"
}, {
"name" : "www.eaie.org"
}, {
"name" : "www.britishcouncil.org"
}, {
"name" : "eacea.ec.europa.eu"
}, {
"name" : "www.chea.org"
}, {
"name" : "www.aca-secretariat.be"
}, {
"name" : "www.iau-aiu.net"
}, {
"name" : "www.iie.org"
}, {
"name" : "www.aucc.ca"
}, {
"name" : "www.aau.org"
}, {
"name" : "www.nafsa.org"
} ],
"page" : {
"size" : 1000,
"number" : 0,
"totalElements" : 18,
"totalPages" : 1
},
"links" : { }
}
/api/selection/{selection}/set (beta)
Sets the Domains of the Selection. If the Selection does not exist, it will be created.
This Endpoint is still in evaluation and will be not usefull for "normal" users. As a normal user you won’t be able to edit a newly created Selection. |
Selections are SecuredObjects and making changes of the object are secured. To edit an existing selection you have to make sure you have the corresponding access rights. |
$ curl 'http://localhost:8080/api/selection/SelectionControllerTest.SET/set' -i -X PUT \
-H 'Content-Type: text/plain' \
-d ' www.eua.be
www.oecd.org
www.enqa.eu
'
HTTP/1.1 201 Created
Panels
A set of Endpoints to discover information about available Panels.
/api/panel/list
Returns a list of Panels, which are accessible for the given user.
$ curl 'http://localhost:8080/api/panel/list' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 368
{
"responseHeader" : {
"query" : "http://localhost:8080/api/panel/list",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"name" : "intermediaries",
"description" : null,
"snapshotCount" : 1
} ],
"page" : {
"size" : 1000,
"number" : 0,
"totalElements" : 1,
"totalPages" : 1
},
"links" : { }
}
/api/panel/{name}/list
Returns the set of Snapshots included in the specified Panel.
The list of returned Snapshots is secured and filtered with your access rules. |
Parameter | Description |
---|---|
|
Optional. The number of the requested page. |
|
Optional. The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/panel/intermediaries/list?page=0&size=1' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 428
{
"responseHeader" : {
"query" : "http://localhost:8080/api/panel/intermediaries/list?page=0&size=1",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"name" : "20121227_intermediaries",
"description" : null,
"indexed" : true,
"textExtracted" : true
} ],
"page" : {
"size" : 1,
"number" : 0,
"totalElements" : 1,
"totalPages" : 1
},
"links" : { }
}
Index
Internally the documents (pages) of a Snapshot are indexed using a full-text search-engine.
For search requests based on documents you can use the fiels in the table below:
Fieldname | Description |
---|---|
domain_s |
The domain of the web site hosting this page |
path_s |
The complete path of the page, including query parameters |
title_t |
The title of the web page. |
description_t |
A description extracted from the page. |
language_s |
The language of the identified text |
_text_ |
The hidden field for the full-text. It can be queried but the values are not stored in the index. If you need full-texts please use the appropriate API-Call. Normally, you do not have to state this field in queries. |
CSR (1)
"Corporate Social Responsibility" (2)
language_s:de && "Corporate Social Responsibility" (3)
language_s:en && path_s:"/about" (4)
1 | Searches for a phrase in the field \_text. |
2 | Searches for the phrase "Corporate Social Responsibility". Use " to combine single words to a longer phrase. |
3 | Same as above, but searches only in german documents. |
4 | Searches for pages with the specified path in english documents. |
Notes and complete Solr-Query-Syntax
As you probably noted the examples do not entail information about a snapshot or panel. This information is added to your query automatically to make sure that only reasonable queries can be submitted to the API. Currently a version of Solr is used. Thus you can use the syntax of Solr to query the full-text index. For further reference, please use for reference the orginal documentation of Solr: |
/api/index/status
Provides an overview of indexed documents per Snapshot and a possible defined Selection.
Parameter | Description |
---|---|
|
Optional. The name of a snapshot. Can be specified multiple times. |
|
Optional. The name of a panel. One of 'snapshot' or 'panel' has to be specified. |
|
Optional. The name of a selection to filter the results. |
$ curl 'http://localhost:8080/api/index/status?snapshot=20121227_intermediaries' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 412
{
"responseHeader" : {
"query" : "http://localhost:8080/api/index/status?snapshot=20121227_intermediaries",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"snapshot" : "20121227_intermediaries",
"selection" : "",
"indexedDocs" : 223664
} ],
"page" : {
"size" : 1,
"number" : 0,
"totalElements" : 1,
"totalPages" : 1
},
"links" : { }
}
Texts
Texts of web pages are prepared in various ways. This chapter describes what actually happens to those texts an how you can access this information.
/api/texts/search
Returns a subset of pages with extracted text.
Parameter | Description |
---|---|
|
The name of the snapshot |
|
Optional. The machine-name of the selection. |
|
The Solr-Query which is used to search for the pages |
|
Optional. If 'true' includes the extracted text. Please note, that texts are not extracted on all Snapshots. |
|
Optional. Used for debugging. Abbreviates exported texts. |
|
Optional. The number of the requested page. |
|
Optional. The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/texts/search?snapshot=20121227_intermediaries&query=news&textsInclude=true&textsAbbreviate=true&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 1650
{
"responseHeader" : {
"query" : "http://localhost:8080/api/texts/search?snapshot=20121227_intermediaries&query=news&textsInclude=true&textsAbbreviate=true&size=2",
"state" : "OK",
"msg" : "solrQuery: q=news&q.op=OR&fq=snapshot_id_i:+1&sort=domain_id_i+asc,reference_id_i+asc&start=0&rows=2",
"httpStatus" : null
},
"content" : [ {
"snapshot" : "20121227_intermediaries",
"language" : "en",
"textId" : 43,
"domain" : "www.dfg.de",
"path" : "/en/index.jsp",
"textInfo" : {
"title" : "DFG, German Research Foundation",
"description" : null,
"text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDFG, German Research Foundation\n\n\r\n\r\n\r\n \r\n\r\n ...",
"contentType" : "text/plain",
"language" : "en"
}
}, {
"snapshot" : "20121227_intermediaries",
"language" : "de",
"textId" : 95,
"domain" : "www.dfg.de",
"path" : "/dfg_profil/geschaeftsstelle/dfg_praesenz_ausland/beijing/index.jsp",
"textInfo" : {
"title" : "DFG - Deutsche Forschungsgemeinschaft - Chinesisch-Deutsches Zentrum für Wissenschaftsförderung Beijing",
"description" : null,
"text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDFG - Deutsche Forschungsgemeinschaft - Chinesisch-...",
"contentType" : "text/plain",
"language" : "de"
}
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 75772,
"totalPages" : 37886
},
"links" : {
"next" : "http://localhost:8080/api/texts/search?snapshot=20121227_intermediaries&query=news&textsInclude=true&textsAbbreviate=true&page=1&size=2"
}
}
/api/texts/get
Returns extracted text from a set of given text-ids.
Parameter | Description |
---|---|
|
The name of the snapshot |
|
Id of the text. Can be specified multiple times. |
$ curl 'http://localhost:8080/api/texts/get?snapshot=20121227_intermediaries&id=9094&id=9095' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 7093
{
"responseHeader" : {
"query" : "http://localhost:8080/api/texts/get?snapshot=20121227_intermediaries&id=9094&id=9095",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"snapshot" : "20121227_intermediaries",
"language" : "en",
"textId" : 9094,
"textInfo" : {
"title" : "Making Conferences Greener : European Science Foundation",
"description" : null,
"text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMaking Conferences Greener : European Science Foundation\n\n\n\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tBookmark this pageFAQMember pagesRSSSitemapSubscribe\n\n\t\t\t\t\t\t\t\t\t\n\t\t\t \n \n \n \n\t\t\n\n\t\t\n\t\t\t\n\t\t\n\n\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\t\tHome\n\tAbout ESF\n\tActivities\n\tResearch Areas\n\tPublications\n\tMedia Centre\n\tJobs\n\tContact\n\n\n\n\n\t\t\t\tHome > Activities > ESF Research Conferences > Making Conferences Greener\n\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\tESF Research Conferences\n\n\n\t\t\t\t\t\t\tMaking Conferences Greener: A project of the ESF Conferences Unit \n\n\n\n\n\t\n\n\nWe are proud to inform you that the ESF Research Conferences Forest has become bigger! \r\n\n3777 exemplars of Moringa trees will join the 5833 ones that we planted last year.\nThey will be planted in Dosso (Niger) to face the desertification of the area and to offset 1020 tons of CO2 during their life span.\r\n\nMore information will be posted soon!\n\r\n\n** Autumn update! Read here the Latest News! **\n\n\n\n\n\n\t\n\n\nA 140m deep water well was drilled during the last months and its electric pump at a depth of 80m to pump the water up to the surface is run by soler panels. Click here to read the whole report and look at the picture.\n\nThe two hectares of moringa trees planted last year gave a good weekly leaf harvests over the summer and the products were sold right away at the market in the nearby city of Dosso. Do you wonder what does the moringa leaf taste like? Click here to discover it! \n\n\n\n\n\n\t\n\n\nThe idea to turn our conferences ‘green’ started in 2009 and involves various aspects of the conference organisation. In order to conform to UNEP recommendations, amongst others, we are working on making our conferences more sustainable and environmentally friendly. This will be an ongoing project that will affect our participants, our material, our venues and our office. This page aims to keep our attendees informed about the status of the project and further green activities.\r\n\nWe would also like to encourage participants to use this page to write tips and recommendations that will help to make our green project successful.\n\tResearch Conferences Forest and Green Fee\n\tGreen Travel\n\tGreen Venues\n\tGreen Material\n\tGreen Office\n\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\tEuroBioFund\n\tEUROCORES\n\tExploratory Workshops\n\tForward Looks\n\tCalls and Funding\n\tMO Fora\n\tResearch Networking Programmes\n\tESF Research Conferences\tUpcoming Events\n\tNews\n\tCall for Proposals\n\tPartnerships\n\tMaking Conferences Greener\tResearch Conferences Forest\n\tGreen Travel\n\n\n\tSponsor Resource Center\n\tVenues\n\tPublications\n\tRestricted Pages\n\tContacts\n\tFAQ\n\tPast Events\n\tSearch\n\tOther Meetings \n\tConferences Email Alerts\n\n\n\tScience Policy\n\tESF Meetings\n\tEuropean Latsis Prize 2012\n\tPeer Review\n\tESF Symposia\n\tESF at ESOF 2012 Dublin\n\n\n\n\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\tData protection | Disclaimer\n\n© 2012 European Science Foundation - page last updated: 20.12.2011\n\nESF provides the scientific, administrative and technical secretariat for COST (European Cooperation in Science and Technology).\n\n\n\n\n\t\t\t\n\n\t\t\n\n\t\t\n\t\t\n\t\n\n\n\n\n\n\n",
"contentType" : "text/plain",
"language" : "en"
}
}, {
"snapshot" : "20121227_intermediaries",
"language" : "en",
"textId" : 9095,
"textInfo" : {
"title" : "Forward Looks : European Science Foundation",
"description" : null,
"text" : "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nForward Looks : European Science Foundation\n\n\n\n\n\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tBookmark this pageFAQMember pagesRSSSitemapSubscribe\n\n\t\t\t\t\t\t\t\t\t\n\t\t\t \n \n \n \n\t\t\n\n\t\t\n\t\t\t\n\t\t\n\n\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\t\tHome\n\tAbout ESF\n\tActivities\n\tResearch Areas\n\tPublications\n\tMedia Centre\n\tJobs\n\tContact\n\n\n\n\n\t\t\t\tHome > Activities > Forward Looks\n\n\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\tForward Looks\n\n\n\t\t\t\t\t\tThe flagship activity of ESF’s strategic arm, Forward Looks enable Europe’s scientific community, in interaction with policy makers, to develop medium to long-term views and analyses of future research developments with the aim of defining research agendas at national and European level. Forward Looks are driven by ESF’s Member Organisations and, by extension, the European research community. Quality assurance mechanisms, based on peer review where appropriate, are applied at every stage of the development and delivery of a Forward Look to ensure its quality and impact. \n\nPlease see the left navigational bar for all current Forward Looks.\n\nFor enquiries about Forward Looks please contact:\n\n\tMs.LauraMarinE-Mail\n\tScience Officer MOs Relations & Partnerships\n\n\tMs.MadeliseBlumenroederE-Mail\n\tSenior Administrator\n\n\nPlease click here to see all our Forward Look reports \n\n\t\n\n\t\n\n\n\n\n\n\n\n\n\n\t\t\t\t\t\t \n\n\t\t\t\t\t\t\tEuroBioFund\n\tEUROCORES\n\tExploratory Workshops\n\tForward Looks\tNews\n\tAll Current and Completed Forward Looks\n\tSpace Sciences (SSU)\n\tHumanities (SCH)\n\tLife, Earth and Environmental Sciences (LESC)\n\tMedical Sciences (EMRC)\n\tPhysical and Engineering Sciences (PESC)\n\tSocial Sciences (SCSS)\n\tWorkshop scheme\n\n\n\tCalls and Funding\n\tMO Fora\n\tResearch Networking Programmes\n\tESF Research Conferences\n\tScience Policy\n\tESF Meetings\n\tEuropean Latsis Prize 2012\n\tPeer Review\n\tESF Symposia\n\tESF at ESOF 2012 Dublin\n\n\n\n\n\n\t\t\t\t\t\n\n\t\t\t\t\n\n\t\t\t\tData protection | Disclaimer\n\n© 2012 European Science Foundation - page last updated: 26.12.2012\n\nESF provides the scientific, administrative and technical secretariat for COST (European Cooperation in Science and Technology).\n\n\n\n\n\t\t\t\n\n\t\t\n\n\t\t\n\t\t\n\t\n\n\n\n\n\n\n",
"contentType" : "text/plain",
"language" : "en"
}
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 2,
"totalPages" : 1
},
"links" : { }
}
DomainGraph
A set of Endpoints to export already prepared DomainGraphs. DomainGraphs represent the graph of Domains (Nodes) and their linkages (Edges). Edges additionally have a weight to count how often one Domain links to another.
The API for DomainGraphs uses a Node-Edge representation. Thus you have to use 2 calls to the API to get the data for a DomainGraph.
There may be multiple different DomainGraphs for one Snapshot. DomainGraphs may be created from Variants and Selections.
As for the Variants, currently only one Variant ONLY_SEEDS exists.
-
ONLY_SEEDS: Contains all nodes and edges from from the crawled Snapshot.
Currently new DomainGraphs can only be created from the backend. |
/api/domaingraph/list
A list with all existing DomainGraphs.
Parameter | Description |
---|---|
|
Optional. The name of a snapshot. Can be specified multiple times. |
|
Optional. The name of a panel. |
|
Optional. The machine-name of the selection. If not specified does not filter the result. |
|
Optional. The variant of the DomainGraph. Currently only 'ONLY_SEEDS' is supported. If not specified, does not filter the result. |
|
The number of the requested page. |
|
The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/domaingraph/list?page=0&size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 431
{
"responseHeader" : {
"query" : "http://localhost:8080/api/domaingraph/list?page=0&size=3",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"id" : 16,
"snapshotName" : "20121227_intermediaries",
"variant" : "ONLY_SEEDS",
"selectionMachineName" : null
} ],
"page" : {
"size" : 3,
"number" : 0,
"totalElements" : 1,
"totalPages" : 1
},
"links" : { }
}
/api/domaingraph/{id}/nodes
A list with all nodes in the referenced DomainGraph.
Parameter | Description |
---|---|
|
The number of the requested page. |
|
The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/domaingraph/16/nodes?page=0&size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 892
{
"responseHeader" : {
"query" : "http://localhost:8080/api/domaingraph/16/nodes?page=0&size=3",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"id" : "www.dfg.de",
"url" : "www.dfg.de",
"type" : "SEED",
"indegree" : 8,
"outdegree" : 4,
"degree" : 12,
"outdegree_seeds" : 4
}, {
"id" : "www.oaq.ch",
"url" : "www.oaq.ch",
"type" : "SEED",
"indegree" : 10,
"outdegree" : 20,
"degree" : 30,
"outdegree_seeds" : 20
}, {
"id" : "www.europace.org",
"url" : "www.europace.org",
"type" : "SEED",
"indegree" : 5,
"outdegree" : 6,
"degree" : 11,
"outdegree_seeds" : 6
} ],
"page" : {
"size" : 3,
"number" : 0,
"totalElements" : 113,
"totalPages" : 38
},
"links" : {
"next" : "http://localhost:8080/api/domaingraph/16/nodes?page=1&size=3"
}
}
/api/domaingraph/{id}/edges
A list with all edges in the referenced DomainGraph.
Parameter | Description |
---|---|
|
The number of the requested page. |
|
The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/domaingraph/16/edges?page=0&size=3' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 637
{
"responseHeader" : {
"query" : "http://localhost:8080/api/domaingraph/16/edges?page=0&size=3",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"source" : "www.dfg.de",
"target" : "www.esf.org",
"weight" : 20
}, {
"source" : "www.dfg.de",
"target" : "erc.europa.eu",
"weight" : 6
}, {
"source" : "www.dfg.de",
"target" : "www.ciee.org",
"weight" : 2
} ],
"page" : {
"size" : 3,
"number" : 0,
"totalElements" : 1126,
"totalPages" : 376
},
"links" : {
"next" : "http://localhost:8080/api/domaingraph/16/edges?page=1&size=3"
}
}
Statistics
These endpoints provide access to some statistical information basic information about snapshots (and panels).
/api/stats
Returns a list of Stats-Objects describing various descriptive indicators of snapshots.
Parameter | Description |
---|---|
|
Optional. The name of a snapshot. Can be specified multiple times. |
|
Optional. The name of a panel. One of 'snapshot' or 'panel' has to be specified. |
$ curl 'http://localhost:8080/api/stats?snapshot=20121227_intermediaries' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 632
{
"responseHeader" : {
"query" : "http://localhost:8080/api/stats?snapshot=20121227_intermediaries",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"snapshot" : "20121227_intermediaries",
"selection" : null,
"seedsInitial" : 122,
"seedsActual" : 157,
"seedsCrawled" : 141,
"seedsNotCrawled" : 16,
"importedDomains" : 114,
"importedHtmlDocs" : 274126,
"importedKBytes" : -1,
"indexedDomains" : 110,
"indexedDocs" : 223664
} ],
"page" : {
"size" : 1000,
"number" : 0,
"totalElements" : 1,
"totalPages" : 1
},
"links" : { }
}
Name | Description |
---|---|
snapshot |
The snapshot. |
selection |
The selection. Might be null, when no Seleciton is present. |
seedsInitial |
The number for seeds which have been used as input for the crawl. |
seedsActual |
The number of seeds which have been used for the crawl. Includes possible redirects and seeds which could not be crawled. |
seedsCrawled and seedsNotCrawled |
Should be self explanotory. |
importedDomains |
The number of seeds which have been actually imported. |
importedHtmlDocs |
The number of imported documents with the mime-type "text/html" |
importedKBytes (currently not computed) |
The complete size of the imported documents. Includes possible duplicated documents. |
indexedSites |
The number of sites which have been indexed. |
indexedDocs |
The number of indexed documents. |
/api/stats/domains
Returns a list of DomainStats-Objects describing descriptive indicators for all seed-domains of a given snapshot.
Parameter | Description |
---|---|
|
Optional. The name of a snapshot. Can be specified multiple times. |
|
Optional. The name of a panel. One of 'snapshot' or 'panel' has to be specified. |
$ curl 'http://localhost:8080/api/stats/domains?snapshot=20121227_intermediaries&page=0&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 806
{
"responseHeader" : {
"query" : "http://localhost:8080/api/stats/domains?snapshot=20121227_intermediaries&page=0&size=2",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"snapshot" : "20121227_intermediaries",
"selection" : null,
"domain" : "www.dfg.de",
"importedHtmlDocs" : 9877,
"importedKBytes" : 0,
"indexedDocs" : 7623
}, {
"snapshot" : "20121227_intermediaries",
"selection" : null,
"domain" : "www.oaq.ch",
"importedHtmlDocs" : 2613,
"importedKBytes" : 0,
"indexedDocs" : 1221
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 114,
"totalPages" : 57
},
"links" : {
"next" : "http://localhost:8080/api/stats/domains?snapshot=20121227_intermediaries&page=1&size=2"
}
}
Embeddings
TODO:
/api/embeddings/definitions
Returns a list of EmbeddingDefintions. An EmbeddingDefinition combines a specific embedding model and a reference to a Chunker.
Parameter | Description |
---|---|
|
Optional. The number of the requested page. |
|
Optional. The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/embeddings/definitions?size=1' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 628
{
"responseHeader" : {
"query" : "http://localhost:8080/api/embeddings/definitions?size=1",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"machineName" : "default",
"modelName" : "sentence-transformers/paraphrase-MiniLM-L12-v2",
"dimensions" : 384,
"chunkerMachineName" : "default"
}, {
"machineName" : "tests",
"modelName" : "sentence-transformers/paraphrase-MiniLM-L12-v2",
"dimensions" : 384,
"chunkerMachineName" : "default"
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 2,
"totalPages" : 1
},
"links" : { }
}
/api/embeddings/status
TODO:
/api/embeddings/search
Creates a list of "nearby" chunks of documents based on the used embedding sorted ascending on the distance. As distance the cosinus similarity is used.
Parameter | Description |
---|---|
|
The snapshot |
|
The selection. (currently not used) |
|
A Domain on which the search should be restricted. Can be specified multiple times. |
|
The machineName of the EmbeddingsDef |
|
The text for comparing with embeddings |
|
The maximum distance of text chunks. Defaults to 0.5 |
|
The limit of matching Chunks to return. Defaults to 1000 |
|
Only consider Chunks with more than minTokensCount. Defaults to 3 |
|
Optional. The number of the requested page. |
|
Optional. The number of objects of the requested page. |
$ curl 'http://localhost:8080/api/embeddings/search?snapshot=20121227_intermediaries&embeddingsDef=default&query=Nachhaltigkeit&size=2' -i -X GET
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 893
{
"responseHeader" : {
"query" : "http://localhost:8080/api/embeddings/search?snapshot=20121227_intermediaries&embeddingsDef=default&query=Nachhaltigkeit&size=2",
"state" : "OK",
"msg" : "",
"httpStatus" : null
},
"content" : [ {
"domain" : "www.dfg.de",
"reference" : "/dfg_magazin/index.jsp",
"textChunk" : "Jahr der Nachhaltigkeit",
"embedding" : [ ],
"dist" : 0.1292587
}, {
"domain" : "www.dfg.de",
"reference" : "/dfg_magazin/wissenschaft_oeffentlichkeit/index.html",
"textChunk" : "Jahr der Nachhaltigkeit",
"embedding" : [ ],
"dist" : 0.1292587
} ],
"page" : {
"size" : 2,
"number" : 0,
"totalElements" : 135,
"totalPages" : 68
},
"links" : {
"next" : "http://localhost:8080/api/embeddings/search?snapshot=20121227_intermediaries&embeddingsDef=default&query=Nachhaltigkeit&page=1&size=2"
}
}
include::users.adoc[]s