meTypeset

meTypeset is a tool to covert from Microsoft Word .docx format to NLM/JATS-XML for scholarly/scientific article typesetting.

meTypeset is a fork of the OxGarage stack and uses TEI as an intemediary format to facilitate interchange. Components are licensed under the GPL v2 and the Creative Commons Attribution-Sharealike 3.0 clauses, according to the various original sources of the fork.

The transforms within this software can be invoked in several different ways:

  1. Run through a *nix compatible command line with the bin/meTypeset.py script.
  2. A limited set of the transforms can be called through any renderer capable of parsing XSLT 1.0. This will not include the meTypeset classifier stages.

The modfications to the OxGarage stack contained within this project are Copyright Martin Paul Eve 2014 and released under the licenses specified in each respective file.

https://github.com/MartinPaulEve/meTypeset

GROBID – PDF into structured XML/TEI

GROBID – GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

The following functionalities are available:

  • Header extraction and parsing from article in PDF format. The extraction here covers the bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
  • References extraction and parsing from articles in PDF format. References in footnotes are supported, although still work in progress. They are rare in technical and scientific articles, but frequent for publications in the humanities and social sciences.
  • Parsing of references in isolation.
  • Extraction of patent and non-patent references in patent publications.
  • Parsing of names, in particular author names in header, and author names in references (two distinct models).
  • Parsing of affiliation and address blocks.
  • Parsing of dates.
  • Full text extraction from PDF articles, including a model for the the overall document segmentation and a model for the structuring of the text body.

https://github.com/kermitt2/grobid

https://grobid.readthedocs.io/en/latest/

http://cloud.science-miner.com/grobid/

Cite this entry: "GROBID – PDF into structured XML/TEI," in OARESOURCES, September 19, 2019, https://oaresources.org/grobid-pdf-into-structured-xml-tei/