GROBID – PDF into structured XML/TEI

GROBID – GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

The following functionalities are available:

  • Header extraction and parsing from article in PDF format. The extraction here covers the bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
  • References extraction and parsing from articles in PDF format. References in footnotes are supported, although still work in progress. They are rare in technical and scientific articles, but frequent for publications in the humanities and social sciences.
  • Parsing of references in isolation.
  • Extraction of patent and non-patent references in patent publications.
  • Parsing of names, in particular author names in header, and author names in references (two distinct models).
  • Parsing of affiliation and address blocks.
  • Parsing of dates.
  • Full text extraction from PDF articles, including a model for the the overall document segmentation and a model for the structuring of the text body.

Cite this entry: "GROBID – PDF into structured XML/TEI," in OARESOURCES, September 19, 2019,

Typademic, an academic publishing pipeline

Mähr, Moritz. (2018). Typademic, collaborative academic publishing. 10.3929/ethz-b-000311815. How humanities scholars collaborate today?

Most humanities scholars use e-mail and Word documents (with track changes). This generally leads to comically colored documents, funny file names, mislabeled images, time-consuming text merges, and awful-looking layouts.

How humanities scholars should collaborate!

With the simple Markdown text format and Git version control software, you will

– focus on the writing process (type setting and text production are strictly separated)
– separately edit text and data (figures, tables, etc.)
– divide tasks, define roles and track progress
– work simultaneously on a common project (online and offline)


If you want to self-publish, use Typademic to turn Markdown text, images and Bibtex references into professionally typed articles, reports or books (PDF). Choose from different templates, hundreds of fonts or design with LaTeX according to your own wishes.

Typademic is an open source web-based graphical user interface for Pandoc, LaTeX and Google Fonts developed and maintained by Moritz Mähr.


The universal markup converter

Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can convert from

Pandoc can also produce PDF output via LaTeX, Groff ms, or HTML.

Pandoc’s enhanced version of Markdown includes syntax for tables, definition lists, metadata blocks, footnotes, citations, math, and much more. See the User’s Manual below under Pandoc’s Markdown.

Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document (an abstract syntax tree or AST), and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer. Users can also run custom pandoc filters to modify the intermediate AST (see the documentation for filters and lua filters).

Because pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other. Pandoc attempts to preserve the structural elements of a document, but not formatting details such as margin size. And some document elements, such as complex tables, may not fit into pandoc’s simple document model. While conversions from pandoc’s Markdown to all formats aspire to be perfect, conversions from formats more expressive than pandoc’s Markdown can be expected to be lossy.