Section: btparse (3)
Updated: 2003-10-25


btparse - C library for parsing and processing BibTeX data files 


   #include <btparse.h>

   /* Basic library initialization / cleanup */   void bt_initialize (void);   void bt_free_ast (AST *ast);   void bt_cleanup (void);

   /* Input / interface to parser */   void   bt_set_stringopts (bt_metatype_t metatype, ushort options);   AST * bt_parse_entry_s (char *    entry_text,                           char *    filename,                           int       line,                           ushort    options,                           boolean * status);   AST * bt_parse_entry   (FILE *    infile,                           char *    filename,                           ushort    options,                           boolean * status);   AST * bt_parse_file    (char *    filename,                            ushort    options,                            boolean * overall_status);

   /* AST traversal/query */   AST * bt_next_entry (AST * entry_list,                         AST * prev_entry)   AST * bt_next_field (AST *entry, AST *prev, char **name);   AST * bt_next_value (AST *head,                         AST *prev,                        bt_nodetype_t *nodetype,                        char **text);

   bt_metatype_t bt_entry_metatype (AST *entry);   char *bt_entry_type (AST *entry);   char *bt_entry_key (AST *entry);   char *bt_get_text (AST *node);

   /* Splitting names and lists of names */   bt_stringlist * bt_split_list (char *   string,                                  char *   delim,                                  char *   filename,                                  int      line,                                  char *   description);   void bt_free_list (bt_stringlist *list);   bt_name * bt_split_name (char *  name,                            char *  filename,                             int     line,                            int     name_num);   void bt_free_name (bt_name * name);

   /* Formatting names */   bt_name_format * bt_create_name_format (char * parts, boolean abbrev_first);   void bt_free_name_format (bt_name_format * format);   void bt_set_format_text (bt_name_format * format,                             bt_namepart part,                            char * pre_part,                            char * post_part,                            char * pre_token,                            char * post_token);   void bt_set_format_options (bt_name_format * format,                                bt_namepart part,                               boolean abbrev,                               bt_joinmethod join_tokens,                               bt_joinmethod join_part);   char * bt_format_name (bt_name * name, bt_name_format * format);

   /* Construct tree from TeX groups */   bt_tex_tree * bt_build_tex_tree (char * string);   void          bt_free_tex_tree (bt_tex_tree **top);   void          bt_dump_tex_tree (bt_tex_tree *node, int depth, FILE *stream);   char *        bt_flatten_tex_tree (bt_tex_tree *top);

   /* Miscellaneous string utilities */   void bt_purify_string (char * string, ushort options);   void bt_change_case (char transform, char * string, ushort options);


btparse is a C library for parsing and processing BibTeX files. Itprovides a lexical scanner and LR parser (constructed by PCCTS), both ofwhich are efficient and offer good error detection and recovery; a setof functions for traversing the AST (abstract syntax tree) generated bythe parser; and utility functions for manipulating strings according toBibTeX conventions. (Note that nothing in the library assumes thatyou're using BibTeX files for their original purpose of bibliographicdata for scholarly publications; you could use the file format for anyconceivable purpose that fits it. However, there is some code in thelibrary that is really only appropriate for use with strings meant to beprocessed in the same way that BibTeX itself does. This is all entirelyoptional, though.)

Note that the interface provided by btparse, while complete, isfairly low-level. If you have more sophisticated needs, you might beinterested my "Text::BibTeX" module for Perl 5 (available on CPAN). 


To understand this document and use btparse, you should already befamiliar with the BibTeX language---more specifically, the BibTeX datadescription language. (BibTeX being the complex beast that it is, onecan conceive of the term applying to the program, the data language, theparticular database structure described in the original BibTeXdocumentation, the ``.bst'' formatting language, and the set ofconventions embodied in the standard styles included with the BibTeXdistribution. In this document, I'll stick to the first twomeanings---the data language because that's what btparse deals with,and the program because it's occasionally necessary to explaindifferences between my parser and BibTeX's.)

In particular, you should have a good idea what's going on in thefollowing:

   @string{and = { and },           joe = "Blow, Joe",           john = "John Smith"}

   @book(ourbook,         author = joe # and # john,         title = {Our Little Book})

If this looks like something you want to parse, but don't want to haveto write your own parser for, you've come to the right place.

Before going much further, though, you're going to have to learn some ofthe terminology I use for describing BibTeX data. Most of it's the sameas you'll find in any BibTeX documentation, but it's important to besure that we're talking about the same things here. So, somedefinitions:

All text in a BibTeX file from the start of the file to the start of thefirst entry, and between entries thereafter.
A string of letters, digits, and the following characters:

   ! $ & * + - . / : ; < > ? [ ] ^ _ ` |

A ``name'' is a catch-all used for entry types, entry keys, and field andmacro names. For BibTeX compatibility, there are slightly differentrules for these four entities; currently, the only such rule actuallyimplemented is that field and macro names may not begin with a digit.Some names in the above example: "string", "and".

A chunk of text starting with an ``at'' sign ("@") at top-level, followedby a name (the entry type), an entry delimiter ("{" or "("), andproceeding to the matching closing delimiter. Also, the data structurethat results from parsing this chunk of text. There are two entries inthe above example.
entry type
The name that comes right after an "@" at top-level. Examples fromabove: "string", "book".
entry metatype
A classification of entry types that allows us to group one or moreentry types under the same heading. With the standard BibTeX databasestructure, "article", "book", "inbook", etc. all fall under the``regular entry'' metatype. Other metatypes are ``macro definition'' (for"string" entries), ``preamble'' (for "preamble") entries, and ``comment''("comment" entries). In fact, any entry whose type is not one of"string", "preamble", or "comment" is called a ``regular'' entry.
entry delimiters
"{" and "}", or "(" and ")": the pair of characters that (almost)mark the boundaries of an entry. ``Almost'' because the start of an entryis marked by an "@", not by the ``entry open'' delimiter.
entry key
(Or just key when it's clear what we're speaking of.) The nameimmediately following the entry open delimiter in a regular entry, whichuniquely identifies the entry. Example from above: "ourbook". Onlyregular entries have keys.
A name to the left of an equals sign in a regular or macro-definitionentry. In the latter context, might also be called a macro name.Examples from above: "joe", "author".
field list
In a regular entry, everything between the entry delimiters except forthe entry key. In a macro definition entry, everything between theentry delimiters (possibly also called a macro list).
compound value
(Usually just ``value''.) The text that follows an equals sign ("=") ina regular or macro definition entry, up to a comma or the entry closedelimiter; a list of one or more simple values joined by hash signs("#").
simple value
A string, macro, or number.
(Or, sometimes, ``quoted string.'') A chunk of text between quotes (""")or braces ("{" and "}"). Braces must balance: "{this is a {string}"is not a BibTeX string, but "{this is a {string}}" is. ("this is a {string" is also illegal, mainly to avoid the possibilityof generating bogus TeX code---which BibTeX will do in certain cases.)
A name that appears on the right-hand side of an equals sign (i.e. asone simple value in a compound value). Implies that this name wasdefined as a macro in an earlier macro definition entry, but this isonly checked if btparse is being asked to expand macros to their fulldefinitions.
An unquoted string of digits.

Working with btparse generally consists of passing the library someBibTeX data (or a source for some BibTeX data, such as a filename or afile pointer), which it then lexically scans, parses, and constructs anabstract syntax tree (AST) from. It returns this AST to you, and youcall other btparse functions to traverse and query the tree.

The contents of AST nodes are the private domain of the library, and youshouldn't go poking into them. This being C, though, there's nothing toprevent you from doing so except good manners and the possibility that Imight change the AST structure in future releases, breaking anybadly-behaved code. Also, it's not necessary to know the structuralrelationships between nodes in the AST---that's taken care of by thequery/traversal functions.

However, it's useful to know some of the things that btparse depositsin the AST and returns to you through those query/traversal functions.First off, each node has a ``node type,'' which records the syntacticelement corresponding to each node. For instance, the entry

   @book{mybook, author = "Joe Blow", title = "My Little Book"}

is rooted by an ``entry'' node; under this would be found a ``key'' node(for the entry key), two ``field'' nodes (for the ``author'' and ``title''fields); and associated with each field node would be a ``string'' node.The only time this concerns you is when you ask the library for a simplevalue; just looking at the text is not enough to distinguish quotedstrings, numbers, and macro names, so btparse returns the nodetype aswell.

In addition to the nodetype, btparse records the metatype of each``entry'' node. This allows you (and the library) to distinguish, say,regular entries from comment entries. Not only do they have verydifferent structures and must therefore be traversed differently by thelibrary, but certain traversal functions make no sense on certain entrymetatypes---thus it's necessary for you to be able to make thedistinction as well.

That said, everything you need to know to work with the AST is explainedin bt_traversal. 


btparse defines several types required for the external interface.First, it trivially defines a "boolean" type (along with "TRUE" and"FALSE" macros). This might affect you when including the btparse.hheader in your own code---since it's not possible for the code to detectif there is already a "boolean" type defined, you might have to definethe "HAVE_BOOLEAN" pre-processor token to deactivate btparse.h's"typedef" of "boolean".

Next, two enumeration types are defined: "bt_metatype" and"bt_nodetype". Both of these are used extensively in the libraryitself, and are made available to users of the library because they canbe found in nodes of the "btparse" AST (abstract syntax tree). (I.e.,querying the AST can give you "bt_metatype" and "bt_nodetype"values, so the "typedef"s must be available to your code.) 

Entry metatype enum

"bt_metatype_t" has the following values:

which are determined by the ``entry type'' token. (@string entrieshave the "BTE_MACRODEF" metatype; @comment and @preamblecorrespond to "BTE_COMMENT" and "BTE_PREAMBLE"; and any other entrytype has the "BTE_REGULAR" metatype.) 

AST nodetype enum

"bt_nodetype" has the following values:

Of these, you'll only ever deal with the last three. They are returnedwhen you query the AST for a simple value---just seeing the text isn'tenough to distinguish between a quoted string, a number, and a macro, sothe AST nodetype is supplied along with the text. 

String processing option macros

Since BibTeX is essentially a system for glueing strings together in awide variety of ways, the processing done to its strings is fairlyimportant. Most of the string transformations are done outside of thelexer/parser; this reduces their complexity, and makes it easier toswitch different transformations on and off. This switching is donewith an ``options'' bitmap which can be specified on a per-entry-metatypebasis. (That is, you can have one set of transformations done to thestrings in all regular entries, another set done to the strings in allmacro definition entries, and so on.) If you need finer control thanthat, it's currently unavailable outside of the library (but it's just amatter of making a couple functions available and documenting them---sobug me if you need this feature).

There are three basic macros for constructing this bitmap:

Convert ``number'' values to strings. (The conversion is trivial,involving changing the type of the AST node representing the number from"BTAST_NUMBER" to "BTAST_STRING". ``Number'' values are stored asstrings of digits, just as they are in the input data.)
Expand macro invocations to the full macro text.
Paste simple values together.
Collapse whitespace according to the BibTeX rules.

For instance, supplying "BTO_CONVERT | BTO_EXPAND" as the stringoptions bitmap for the "BTE_REGULAR" metatype means that all simplevalues in ``regular'' entries will be converted to strings: numbers willsimply have their ``nodetype'' changed, and macros will be expanded.Nothing else will be done to the simple values, though---they will notbe concatenated, nor will whitespace be collapsed. See the"bt_set_stringopts()" and "bt_parse_*()" functions in bt_input formore information on the various options for parsing; seebt_postprocess for details on the post-processing. 


The following code is a skeletal example of using the btparselibrary:

    #include <btparse.h>

    int main (void)    {       bt_initialize ();

       /* process some data */

       bt_cleanup ();       exit (0);    }

Please note the call to "bt_initialize()"; this is very important!Without it, the library may crash or fail mysteriously. You mustcall "bt_initialize()" before calling any other btparse functions."bt_cleanup()" just frees the memory allocated by "bt_initialize()";if you are careful to call it before exiting, and "bt_free_ast()" onany abstract syntax trees generated by btparse when you are done withthem, then your program shouldn't have any memory leaks. (Unlessthey're due to your own code, of course!) 


btparse has several inherent limitations that are due to the lexicalscanner and parser generated by PCCTS 1.x. In short, the scanner andparser are both heavily dependent on global variables, meaning thatthread safety --- or even the ability to have two files open and beingparsed at the same time --- is well-nigh impossible. This will notchange until I get with the times and adopt ANTLR 2.0, the successor toPCCTS --- presuming of course that it can generate more modular Cscanners and parsers.

Another limitation that is due to PCCTS: entries with a large number offields (more than about 90, if each field value is just a single string)will cause the parser to crash. This is unavoidable due to the parserusing statically-allocated stacks for attributes and abstract-syntaxtree nodes. I could increase the static allocation, but that would justdecrease the likelihood of encountering the problem, not make it goaway. Again, the chances of this changing as long as I'm using PCCTS1.x are nil.

Apart from those inherent limitations, there are no known bugs inbtparse. Any segmentation faults or bus errors from the libraryshould be considered bugs. They probably result from using the libraryincorrectly (eg. attempting to interleave the parsing of two files), butI do make an attempt to catch all such mistakes, and if I've missed anyI'd like to know about it.

Any memory leaks from the library are also a concern; as long as you areconscientious about calling the cleanup functions ("bt_free_ast()" and"bt_cleanup()"), then the library shouldn't leak. 


To read and parse BibTeX data files, see bt_input.

To traverse the syntax tree that results, see bt_traversal.

To learn what is done to values in parsed entries, and how to customizethat munging, see bt_postprocess.

To learn how btparse deals with strings, see bt_strings (oops, Ihaven't written this one yet!).

To manipulate and access the btparse macro table, see bt_macros.

For splitting author names and lists ``the BibTeX way'' using btparse,bt_split_names.

To put author names back together again, see bt_format_names.

Miscellaneous functions for processing strings ``the BibTeX way'':bt_misc.

A semi-formal language definition is in bt_language. 


Greg Ward <> 


Copyright (c) 1996-97 by Gregory P. Ward.

This library is free software; you can redistribute it and/or modify itunder the terms of the GNU Library General Public License as publishedby the Free Software Foundation; either version 2 of the License, or (atyour option) any later version.

This library is distributed in the hope that it will be useful, butWITHOUT ANY WARRANTY; without even the implied warranty ofMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNULibrary General Public License for more details.

You should have received a copy of the GNU Library General PublicLicense along with this library; if not, write to the Free SoftwareFoundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. 


The btOOL home page, where you can get up-to-date information aboutbtparse (and download the latest version) is

You will also find the latest version of Text::BibTeX, the Perllibrary that provides a high-level front-end to btparse, there.btparse is needed to build "Text::BibTeX", and must be downloadedseparately.

Both libraries are also available on CTAN (the Comprehensive TeX ArchiveNetwork, "") and CPAN (the ComprehensivePerl Archive Network, ""). Look inbiblio/bibtex/utils/btOOL/ on CTAN, and authors/Greg_Ward/ onCPAN. For example,

will both get you to the latest version of "Text::BibTeX" and btparse--- but of course, you should always access busy sites like CTAN and CPANthrough a mirror.



Entry metatype enum
AST nodetype enum
String processing option macros

This document was created byman2html,using the manual pages.