GROBID – PDF into structured XML/TEI

GROBID – GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

The following functionalities are available:

  • Header extraction and parsing from article in PDF format. The extraction here covers the bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
  • References extraction and parsing from articles in PDF format. References in footnotes are supported, although still work in progress. They are rare in technical and scientific articles, but frequent for publications in the humanities and social sciences.
  • Parsing of references in isolation.
  • Extraction of patent and non-patent references in patent publications.
  • Parsing of names, in particular author names in header, and author names in references (two distinct models).
  • Parsing of affiliation and address blocks.
  • Parsing of dates.
  • Full text extraction from PDF articles, including a model for the the overall document segmentation and a model for the structuring of the text body.

https://github.com/kermitt2/grobid

https://grobid.readthedocs.io/en/latest/

http://cloud.science-miner.com/grobid/

Cite this entry: "GROBID – PDF into structured XML/TEI," in OARESOURCES, September 19, 2019, https://oaresources.org/grobid-pdf-into-structured-xml-tei/

Substance

Substance is a JavaScript library for web-based content editing. It provides building blocks for realizing custom text editors and web-based publishing systems.

https://substance.io/

H20

H2O is a suite of online classroom tools developed and provided by the Berkman Center for Internet & Society in collaboration with the Harvard Law School Library. H2O allows professors to freely develop, remix, and share online textbooks under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License (per the Terms of Service).

H2O is based on the open-source model: instead of locking down materials in formalized textbooks, we believe that course books can be free (as in “free speech”) for everyone to access and, just as important, build upon. Currently, H2O is geared primarily toward law professors, though the platform can be used across intellectual domains.

https://h2o.law.harvard.edu/

Open Research Library

The Open Research Library (ORL) will include all Open Access scholarly book content worldwide on one platform for user-friendly discovery, offering a seamless experience navigating more than 20,000 Open Access books. This comprehensive collection of peer-reviewed Open Access books will be openly accessible for everyone. Libraries investing in the Open Research Library contribute to the development of a vital infrastructure for the global research community, while participating libraries have the opportunity to benefit from a set of exclusive services.

https://openresearchlibrary.org

WorldCat

WorldCat is the world’s largest network of library content and services. WorldCat libraries are dedicated to providing access to their resources on the Web, where most people start their search for information.

WorldCat.org lets you search the collections of libraries in your community and thousands more around the world. WorldCat grows every day thanks to the efforts of librarians and other information professionals.

https://www.worldcat.org

Folio

FOLIO is a collaboration of libraries, developers and vendors building an open source library services platform. It supports traditional resource management functionality and can be extended into other institutional areas.

The FOLIO project aims to facilitate a sustainable, community-driven collaboration around the creation of a modern technology ecosystem that empowers libraries through open source applications to manage library resources and expand library value.

FOLIO is hosted by the Open Library Foundation, an independent not-for-profit organization designed to ensure the availability, accessibility and sustainability of open source and open access projects for and by libraries.

https://www.folio.org/