One of the things that we have to do in meTypeset is to capture parenthetical citations. These range in styles, but the following are good examples:
- Some text (Martin Eve, p. 45)
- Eve notes (345).
- Eve notes (p. 345)
- A great thing (Alex, P. 45)
- Here is one of them: (Silva, Rodrigues, Oliveira, & da F. Costa, 2013)
- Eve says (Eve 54)
- Eve says (Eve, 54)
However, there are also often cases where the logic looks quite different:
- As I Note (AIN)
- The Large Hadron Collider (LHC) is great (Eve, p. 3)
How can we distinguish between acronyms, asides, and genuine parenthetical citations?
- eden ahbez, American musician
- bill bissett, Canadian poet
- danah boyd, American scholar
- e e cummings, American poet
- mc chris, American rapper
- Arthur fforde, British solicitor
- Charles ffoulkes, British historian
- Rose ffrench, 1st Baroness ffrench, Irish peer
- Michael ffrench-O’Carroll, Irish politician
- brian d foy, American magazine publisher
- debbie tucker green, British playwright
- jack green, American critic
- dream hampton, American filmmaker
- bell hooks, American feminist
- k.d. lang, Canadian singer
- Conrad O’Brien-ffrench, British military officer
- ruth weiss, American writer
How can one distinguish between “This led to greater possibilities to resistance (hooks)” and “This led to greater possibilities to resistance (ohms)”? We have the additional challenge that checking for acronym-like behaviour is unlikely to succeed; authors make typos etc. that are easy to spot by eye but difficult to judge computationally in a wide range of situations.
What we’ve been doing so far is pulling out text between brackets that doesn’t look like a mathematical expression and then seeing if we can find a corresponding bibliographic entry. This is prone to many bugs and doesn’t actually work very well. In reality, many documents erroneously end up with parenthetical asides that are within a ref tag but that are unlinked as we’re unable to parse the entry.
To begin to solve this problem more thoroughly, I wrote the following regular expression:
This does several things. It matches a set of parentheses that contain either:
- A set of words that are all capitalised on the first letter only, followed by a series of other words, numbers, or special characters.
- A set of numbers following an optional p. or P.
- It catches cases where a URL is included within the brackets so that we can match them.
The next thing to do, I think, is to embed the above authors as special cases that we know may occur and that must be handled separately. There is no, so far as I can see, easy way to determine whether we’re referring to bell hooks or other arbitrary lowercase words without the specificity of her name being encoded. I had thought about whether we could do some keyword scanning/topic modelling of documents to determine likely authors/parenthetical asides, but this then gets very complicated.
If anyone has any better ideas for how to distinguish between various use cases/false positives here, please do open an issue on the meTypeset page.