PubFactory

PubFactory is built from the ground up to support books, reference works and journals in a variety of XML formats, with full support for PDF, images and other rich media.

http://www.pubfactory.com

Citation Style Language

Citation Style Language (CSL) is simply a very important project for publishing. It has widespread adoption in important platforms and plays a critical role in the scholarly publishing landscape. It is also important in that it is an open source project populated by a diverse set of skill sets and research perspectives.

The Citation Style Language project developed an XML-based format to define citation formats. Originally it was built to support the OpenOffice platform but has since been adopted on a wide scale.

The projects maintains a free and open source repository with currently over 9000 CSL citation styles for major style guides and individual journals (see https://github.com/citation-style-language/styles/ and https://pinux.info/csls_counter/). Dozens of software products, including popular reference managers such as Zotero, Mendeley, and Papers, have adopted CSL and its style library to give their users the ability to easily generate citations in a large variety of citation formats.

https://citationstyles.org

GROBID – PDF into structured XML/TEI

GROBID – GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

The following functionalities are available:

  • Header extraction and parsing from article in PDF format. The extraction here covers the bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
  • References extraction and parsing from articles in PDF format. References in footnotes are supported, although still work in progress. They are rare in technical and scientific articles, but frequent for publications in the humanities and social sciences.
  • Parsing of references in isolation.
  • Extraction of patent and non-patent references in patent publications.
  • Parsing of names, in particular author names in header, and author names in references (two distinct models).
  • Parsing of affiliation and address blocks.
  • Parsing of dates.
  • Full text extraction from PDF articles, including a model for the the overall document segmentation and a model for the structuring of the text body.

https://github.com/kermitt2/grobid

https://grobid.readthedocs.io/en/latest/

http://cloud.science-miner.com/grobid/

Cite this entry: "GROBID – PDF into structured XML/TEI," in OARESOURCES, September 19, 2019, https://oaresources.org/grobid-pdf-into-structured-xml-tei/

Pandoc

The universal markup converter

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can convert from

Pandoc can also produce PDF output via LaTeX, Groff ms, or HTML.

Pandoc’s enhanced version of Markdown includes syntax for tables, definition lists, metadata blocks, footnotes, citations, math, and much more. See the User’s Manual below under Pandoc’s Markdown.

Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer. Users can also run custom pandoc filters to modify the intermediate AST (see the documentation for filters and lua filters).

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc’s simple document model. While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.

https://pandoc.org/