On BibTeX

Only a few decades ago, I hear, CS papers were typewritten, with hand-drawn figures and hand-written greek letters. With the advent of computers, people began to use tools like troff, until we got tex, and then latex, making life much easier. I have no clue how people not at the same physical location co-wrote papers before the Internet. Maybe make a long distance call to read out and discuss the introduction? Or maybe write weeks in advance to allow shipping/faxing the paper draft back and forth! Email makes it possible to do near real-time co-writing, and today version control tools make this task significantly easier. I think it is safe to say that with these tools and a little bit of experience, nearly all the time spent on writing is spent on figuring out and polishing the content: if we had a genie to layout the text and all we had to do was conjure up the words and the preferred layout, it would not save us much time.

There is one aspect however, that I think still lags behind, and adds overhead: the creation of bibliographies. Needless to say, bibtex makes it a lot easier than it would be otherwise. But the engineer in us is greedy, and wants to remove inefficiencies wherever it sees them. It wants to get even closer to the genie ideal. While I understand this is a “first-world-problems” post, I am hoping some of you will suggest solutions that we will all find useful.

Let me elaborate on these inefficiencies. First is the question of creating the bibtex entries. This, especially a few hours before the deadline, is a tedious process, involving translation of the mental pointer I have to a particular paper, into a list of its distinguishing features that would allow the reader to locate the paper. Today that list is usually in the form of a bibtex entry. For historical reasons, this list contains journal/conference name, year, volume, edition, and page numbers! Some of these, especially the last few, have become fairly irrelevant for recent publications. Any time spent on filling these in could easily be saved by the hypothetical genie. Even worse, doing this in the middle of writing the paper is a context switch that adds its own overhead. Of course technology and amortization help; more on that in a minute.

Once I have the bibtex entry, I have effectively converted the mental pointer to a short bibtex tag that I can use while writing. So when I am writing and want to cite a paper, I can simply use that. Not so fast: I still need to translate the mental pointer to this bibtex tag. So unless I remember the tag, I have to go to the bibtex file and search for the paper (usually the authors) to remind myself of it. Another context switch, another break in the train in thought.

Okay, so now on to some of the “solutions” I know of.

1. Obtaining the bibtex entries: Typically to get the volume, edition and page numbers, I need to use a search engine. In most cases, I can find a pre-packaged bibtex entry, almost solving the problem. There are issues of quality and consistency however. Citeseer bibtexs often have missing or wrong entries. DBLP usually splits up the entry, sometimes creating separate citations for the proceedings and the paper. ACM I find has decent bibtex entries (often even for non-ACM papers) but smaller coverage. Many publisher websites work too. Google Scholar requires too many clicks. Microsoft Academic Search is another option that I have only recently started using, and I like it so far. I am sure there are others. Is there one that you particularly like?

2. Amortization: Very often there is a large intersection between the papers you cite today, and the ones you cited last year. So starting with bibtex files from last year is a good idea. I find this appealing in theory, but haven’t really been very successful in implementing it. I find it helps a little, but not greatly so.
Related is the idea of sharing mega bib files that cover the whole field. Maybe I just haven’t found the right big bibtex file. Where do you get yours?
Also when you take a bunch of these mega-bib files, there is still a problem of duplicate entries, which can translate to papers appearing twice in the references. Maybe there is simple tool to de-dup! Additionally, these need to be kept current, as papers often go from @unpublished to @inproceedings to @article.


3. Mnemonics: To help with the map-mental-pointer-to-bibtex-tag, we often use bibtex tags that are close to our mental pointers: {LSBook} for the Lovasz Schrijver book, {LLR} for a paper with those initials. This works well if you always use your own bib file, but shared bibs may fail if we have different mental pointer, or different conventions. There is also a collision problem: BLR in complexity means something very different from BLR in privacy; adding a year often makes them harder to remember. What naming convention do you use?

4. Reference Management Softwares: apparently, there are software packages that will manage your bib files, and help share bibs. I briefly tried one, but did not find it sufficiently useful. Have you used one that actually makes you faster?

5. Autocomplete: I imagine it would help if we just wrote long descriptive tags, and used autocomplete features of the tex editors to find what we are looking for. It would be even nicer if the tex editor could do a smart autocomplete: I write \cite{ and start typing Lovasz, and it gives a drop down of the bibtex entries in my bib file that contain the string Lovasz in either author or title fields. If this happens, then the tags will be essentially irrelevant and using large shared megabibs would be easy. Is there an emacs package or Winedt script to do that?

I am sure there are other better solutions. How do you manage your *.bib files?

20 thoughts on “On BibTeX

  1. I use Lyx to write my papers, and it has a very convenient interface to add citations – this solves the “mental pointer to tag” for me.

  2. Try the package reftex for emacs (which is a part of auctex – the standard latex package for emacs). The command reftex-citation (which can be accessed with the shortcut “C-c [“) lets you enter a regular expression and shows you all relevant papers.

  3. I use BibDesk. I keep multiple “mega” bibtex files split by (large) topic: optimization, data structures, topology, computational geometry, etc, each with several hundred entries. As soon as I find a paper I think might be useful later, I enter its BibTeX data. To cite a paper, I always copy from BibDesk (to make sure I get exactly the right paper(s)) and paste into TeXShop (which inserts the necessary \cite{…} macro).

    I am brutally completist and perfectionist about BibTeX entries, because I find sloppy bibliographies as unprofessional as spelling and grammar errors. **ALL** publisher- and index-supplied BibTeX entries have errors and gaps: Incorrect proceedings titles, poor capitalization, inconsistent journal/conference abbreviations, missing first names, missing accents on names, bad math formatting in titles, using – instead of — between page numbers, even misspelling the word “symposium”. BibDesk auto-completes author names, journals, and proceedings titles as I enter them (unless I’m entering them for the first time), which cuts down on my own inconsistencies.

    I inherited my citation key standard from the old computational geometry community bibliography project (which sadly went dormant about 15 years ago): aa-ttttt-yy, where aa are the last initials of ALL authors, ttttt are first letters of the first five non-stop-words in the title, and yy is the last two digits of the year (except I use all four digits for years before about 1920). This is more a loose mechanism for duplicate detection than a mnemonic, but because I cite by copy-paste I don’t really need a mnemonic.

    1. I’m with Jeff on this one — I basically use BibDesk and can sort things out on my own. I find the overhead of bib management is not nearly as time consuming as, say, the research itself, and actually the process of making sure the bib entries are correct/consistent is sort of a one-time cost.

      As far as mental reference management goes, I think it depends on how you write. I don’t through-compose much, so I end up writing and rewriting — it’s rare that a reference just ends up missing.

      And for citation keys, BibDesk has cut-and-paste so that’s pretty easy. I am not sure if JabRef is as nice to use on Windows.

  4. I think a service similar to citeulike could be useful, but I feel like their interface is kind of clunky and their database is not suited for the CS Theory community.

    Picture something like the following: You register an account on the site and configure some preferences, such as preferred \cite{} mnemonic. The website’s database scrapes results from several of the usual sites with citations (DBLP, ACM, Google Scholar, etc.). You can quickly search and filter based on paper name, author, conference, etc. You simply check a box to add it to your saved citations. When you’re done saving all the citations you need, you can export to a .bib file.

    This sounds like exactly what I’d want for dealing with bibliographies. Does something like this already exist?

    1. Apart from a .bib export, I’d also advocate for a .bbl directly (that is, just a list of \bibitem’s). For very short reports, having to add a .bib file, compile it and them extract the bibitems from the .bbl is a hassle.

  5. Great post. I find that reftex solves the “mental tag to pointer” problem for me as well, but the rest of it is still problematic.

    I’ve shared your frustration about this for a long time. In those situations, I find myself imagining what an ideal solution would look like. Below is one such fantasy that I’ve had, which to my mind solves two problems at once: maintaining a paper reading list, and maintaining the associated bibtex entries. Something like this could conceivably be built on top of something like Google scholar or Bing academic search.

    Lets say you find a paper you want to read on Google scholar. You then click on a drop down menu and select “add to reading list”, at which point you are given the option to apply gmail-style tags. Such tags could include things like “approximation algorithms”, “TSP”, “urgent”, “to review,” etc. This then automatically adds a pdf of the paper to, say, a Dropbox folder, and also updates a Bibtex file in that same folder with a best guess for the associated bibtex entry. This Dropbox folder is synced to your PC, or any tablet or e-reader you use to read papers, and moreover includes a heirarchy of “soft links” that allow you to browse papers by tag. That way, those papers are available and suitably tagged the next time you sit down to read. Moreover, the up-to-date bibtex file of everything you have ever read is sitting there on your machine for use in your paper when needed. On top of all this, you could have a (web?) application that provides a gmail-like interface to your reading list, with tagging, untagging, archiving, and bibtex-editing functions.

    Something like the above seems totally achievable given sufficient will in the academic community. Obviously, with more thought I’m sure someone could think of better variants of what I described.

    1. Zotero provides almost all of these features (it installs as an add on to Firefox, can automatically download papers and populates citation data from most of the popular sources such as arxiv, ACM library, JSTOR, most Springer journals, etc., gives you the option of tagging, adding notes etc to the papers, can can later export citation data to bibtex and other formats.

  6. I find the bibtex entries I get from MathSciNet are often good, though you’ll need to be a member to get them. There’s even a command-line script bibweb that you may enjoy, if that’s how you roll.

    Another set of tools I’ve occasionally found useful: bibtools. In particular the aux2bib script takes say <large.bib and paper.aux, and outputs small.bib that contains only those bib entries cited in the paper. (Maybe this is what Clement was looking for?)

  7. Reftex works reasonable well for me and I’d expect that some reference management software would make handling bib files somewhat easier. But the main issue is what to do if you write a paper with coauthors. Everyone has their own big bib file, naming conventions, possibly reference management software etc. What software gives you a nice workflow when working with multiple coauthors, possible a different set of coauthors in each project? The whole discussion should be in this context.

  8. 1. I mainly use Google Scholar, but it always needs some cleaning. I also check on the publishers’ pages but the same is true about cleaning…

    2. I only have such a large database for books, I never managed to get such a thing for papers. I guess the evolution from @unpublished to @inproceedings and @article makes it harder to maintain. For books, I have clean bibtex entries, and a .bib file of bibtex entries that need cleaning. When I have time, I clean my toClean.bib file. Maybe this could also work for paper?

    4. BibDesk seems a very nice solution from what I heard. Unfortunately, it is only available to Mac users, too bad!

    1. 3. For Mnemonics, I used the FSTYY convention but as stated, there are a lot of conflicts then. Thus I switched to FiSeThiYY to reduce the number of these conflicts. Of course, there still are conflicts and I should maybe add some letters from the title…

  9. On Windows, Jabref is okay for managing a .bib file. Not outstanding, but definitely better than nothing. I also thumbs-up MathSciNet for getting .bib entries.

  10. I use ultratex mode for emacs. It does all the auto-completion you could want. As soon as I type \ref{, it starts autocompleting on the bib file for that paper. It also auto-completes on names of theorems, lemmas, etc.

  11. I use Mendeley in general to keep track of papers in my field, and it comes in handy for reference management as well. It has an option to maintain a complete bib file of all added papers, and provides citation keys are easy to find for any paper in your database.

  12. I keep all my papers in Zotero, since I can just drop PDFs on it and it will get the bibtex information automatically (no idea how, though! must be based on the DOI). When I want to cite a paper I just find it in Zotero, and drag it to the text editor where I edit the bib file.

    Usually the “mental pointer” effort is very small, since I only need to remember a keyword or author name to find something in my library.

  13. It seems to me that keeping track of personalized citation databases should be a thing of the past.

    There is something I always imagined should exist, but of course am too lazy/incompetent to do so myself –

    this would be a script that would automatically search DBLP/Google Scholar/MS acadamic search etc.. and find the bibitems for you. Ideally it would find the best version of them, and even more ideally it would get both the journal version and conference version, and then cite the journal version adding a note such as “preliminary version in FOCS 86” (and add a Prelimyear = “1986” field so that smarter bibstyles could have the key look like ABC86 instead of ABC99 etc..)

    What you write in your document would be a cite with some keywords you remember about the paper e.g.

    \cite{Goldwasser-Micali-Rackoff-198?}

    \cite{Goldreich-Mental-Game}

    etc.. and it would find the best match, or if there are 2-3 matches that are pretty good then it would find them all.

    The script would create the bibfile for you and modify the \cite{..} command to cite that bibfile in some standardized form.

    Any thoughts? volunteers?

    1. Inserting a citation should be more interactive. When you want to insert a citation,
      what you really need is an interactive tool that allows you to search for references
      (from your local database or DBLP etc.) For example, if you remember only one
      coauthor and that the paper was in FOCS, then you want to see all the matches and
      select the correct paper. A script trying to do this automatically would probably do more
      than good.

      With emacs+reftex, you have an interactive tool for searching the .bib file of the current
      paper, select one or more references, and insert the \cite command. Of course, this
      assumes that the entry is already in the .bib file. If anyone cares, here is my usual workflow
      for citations.

      1. The reference was already cited in the paper (already in the .bib file). Then I simply
      use reftex: ctrl-c [ goldwasser, up-down arrows to selected the correct entry of Goldwasser,
      enter, done.

      2. The reference appears in the big .bib file that I used to maintain, but not anymore.
      I open this big file (I have a shortcut for that), find the entry, select and copy it,
      open the .bib file of the current paper (I have a shortcut for that), paste the entry,
      go to 1.

      3. Reference does not appear in the big .bib file. I search for “dblp goldwasser” in a
      browser, find the reference I’m looking for, copy the bibtex entry, open the .bib file of
      the current paper (I have a shortcut for that), paste the entry, go to 1.

      A better tool could streamline this process a bit by making the online search and
      appending to the .bib file more automatic, perhaps managing a big .bib file
      automatically etc. Searching only within the entries already cited in the paper
      (as in 1 above) or within the entries I have ever cited (as in 2 above) seem to
      be useful features that a good tool should provide.

      1. I usually add the bibs only at the end, close to submission/posting time, so usually just write \cite{..} with something that would remind me of the paper.

        In any case, if you have ambiguous \cite{..} that identify more than one paper, the script could also simply add all potentially matching references to the bib file, and then you would always be in step 1.

        (An advantage of a simple file based script is that it would work for all platforms and editors.)

Leave a reply to shaddin Cancel reply