Introduction
INSPIRE is the next-generation High Energy Physics (HEP) information system, which empowers scientists with innovative tools to promote successful research. It is based on Invenio digital library technology developed at CERN, and provides access to almost one million records.
INSPIRE uses advanced computing services to provide enhanced functionality to the worldwide community of over 30,000 scientists. D4Science helps INSPIRE in providing faster and more complex functionality than earlier HEP repositories which had technical, architectural and computational limitations.

OCR
Many digital libraries contain large numbers of scanned documents where textual information is not available. Optical Character Recognition (OCR) enables textual analysis and indexing of these documents.
Traditionally the OCR process has been carried out using commercial tools or services where libraries have to pay for the OCR process, but INSPIRE now provides tools to enable free OCR on the D4Science infrastructure using open-source software.
The INSPIRE team has implemented a procedure and packaged the software so that the OCR of scanned documents can be executed as a grid job in the D4Science Process Execution Engine.
Full-text Indexing
Indexing of full-text documents enables efficient searching in large collections. Combined with OCR, it allows full-text searching in the entire HEP literature, including scanned papers. Once the indexes have been computed, the INSPIRE collection can be searched within a fraction of a second. The indexing jobs are executed in parallel, using the MapReduce model on Hadoop clusters, interfaced through the D4Science Process Execution Engine.
Reference extraction
The problem of automatically extracting reference lines from documents is a challenging task given the variety of reference formats found in academia. However, having a tool which is capable of doing so is useful for bibliometric citation analysis, and enriching the associated document meta-data in the context of a digital library.
INSPIRE's Refextract tool is capable of recognising references and their content, for generic PDF and text documents. Knowledge bases provide Refextract with the ability to identify key, domain-specific data found inside reference lines. A combination of identified reference data helps with the accurate extraction of references in a range of formats.
Authors' names extraction
The task of reading the authors' names from a paper is trivial for humans but finding a way to generically and automatically extract information from within a document can be a difficult problem to solve. Doing this accurately is something desirable in many contexts, and especially within that of digital libraries.
Author extraction extends the notion of identifying document references, with the ability to find and extract authors of a document. By making use of a variety of heuristics, Refextract is extended to output a list of author names for PDF and text documents, without the need for domain-specific knowledge bases.
Author extraction benefits INSPIRE by extending the range of full-text meta-data which can be automatically generated, and then used for indexing and search purposes.
Bibliometrics
The computing resources of the D4Science infrastructure allows for computationally intensive processing of bibliometric information for improved information search and ranking, in order to provide new and more desirable services for INSPIRE users.
The bibliometrics tools developed by INSPIRE also become available to other communities facing similar problems in other environments.