Towards an XML-free future for the digital humanities

I was pleasantly surprised that my talk on the AustESE (Australian scholarly editing) infrastructure went down so well with the audience. Less surprising perhaps was their negative reaction to my suggestion that there might be a life for the digital humanities outside of XML. XML has for a long time (since 1998 at least) defined what the digital humanities were all about, and so to cultivate the creation of an alternative that would overcome its fundamental limitations may indeed seem like heresy. Not only does practically every tool in DH depend on XML (TEI Guidelines, XSLT, XQuery, XPath, Oxygen, etc.) but also the skills of digital humanists are based on those same technologies. To suggest that XML may not be the way forward seems to imply two unpalatable consequences:

  1. all the texts we have encoded so far may have to be redone
  2. all the tools we have developed on top of XML would have to be thrown away

This seems crazy, as well as heretical. But let me explain why I think it is not.

In answer to the first objection a fully-featured import facility would overcome any fears that encodings would have to be revised. The ability to ’round-trip’ the data back to XML (albeit with some loss) would also quell fears of ‘lock-in’ to a possibly unstable alternative.

In answer to the second objection, the skills of digital humanists and all other technicians evolve continuously. We are at the mercy of the software industry, and learn whatever tools they offer us to do our work. What I am suggesting is that we instead devise our own tools to do our specialised job far better. As an added bonus such a suite of tools would be under our control and not subject to commercial whims.

The industrial future of XML

XML was created by the W3C with help from Microsoft, who saw it as a way of implementing web-services. Messages would be passed from the client to the server about actions that the service could or would perform. Since then, the ‘bloated, opaque and insane complexity’ of XML web services as Tim Bray called them, has led many technologists to reject them in favour of a simpler noun-based methodology called REST. REST in a nutshell treats a service as if it was composed of static web-pages. ‘Get me this’, ‘here have that’ or ‘delete this’ etc is what REST services are all about. Although originally designed to work with XML, REST services are increasingly being crafted with pure JSON, a much simpler encoding strategy that is gaining some powerful advocates. How much longer programmers will support XML remains unknown; it’s very deeply entrenched. But that they will eventually replace it with something simpler can hardly be in doubt. And when they do, those tools on which we rely will cease to be maintained and thus will soon die. With Microsoft rapidly moving towards a predominantly mobile desktop metaphor based on JSON, HTML5 and Javascript, there seems no room for old-style ‘enterprisey’ XML in a future that is rushing towards us.

 

This entry was posted in Uncategorized by Desmond Schmidt. Bookmark the permalink.

About Desmond Schmidt

I was originally trained as a classicist, working in Ancient Greek papyrology and epigraphy, but also as a software engineer. I have worked on several digital humanities projects, including an edition of Wittgenstein. I currently work in Information Security and in my spare time I try to contribute to various projects in the DH field. I am working on ways to develop innovative representations of variant versions and markup.

10 thoughts on “Towards an XML-free future for the digital humanities

  1. Hi Desmond!

    I was at your session (I asked the question about “round-tripping” from XML – I was thinking of TEI XML, in particular). A question I didn’t ask at the time was about what other (“native”) serialization formats are supported, but I was assuming (from having read about the project somewhere else) that the markup is modelled as RDF, using URIs to refer into the base text, and hence that it could be interchanged in whatever RDF serialization. Is that true?

    I think the AustESE project is fascinating, and I look forward to having a play with the tools as they come on stream. The use of stand-off semantic markup as an alternative method to embedding XML markup (as in TEI) has a lot to be said for it, but I do question whether it should be seen as a negation of the value of TEI and XML, rather than an addition to it. It seems to me that as a strategic matter, it would be valuable for AustESE to try to interoperate with TEI XML as far as possible, including being able to produce TEI not just as an emergency escape-route or hedge against getting “locked-in”, but precisely in order to facilitate the use of XML-oriented tools with data produced using AustESE. It may well be that at some stage in the future, such stand-off tools predominate over embedded markup, but it seems to me that in the meantime the best path to getting to that point would involve as great a degree of interoperability as possible, so that textual scholars can begin to see some benefits of the stand-off approach in an incremental way, without requiring major disruption to their tool-set.

    I am a bit flummoxed by the anti-XML spin in the post above. To be honest I feel it rather detracts from the really positive aspects of the work you presented on the day. Obviously there are issues around the use of XML in DH, but I don’t think this post deals with those issues at all adequately. I appreciate the value of a bit of iconoclastic provocation, but in this case I think your characterisation of XML here is a bit of a straw man. I think the relationship of XML to SOAP is irrelevant to textual scholarship, and I think leaving out the SGML pre-history of XML gives the mistaken impression that it was entirely the creation of software industry giants with an agenda unrelated to dealing with text. Similarly, the bit about JSON I think is a red herring. While it’s true that JSON is simpler than XML, it would be a mistake to assume that this simplicity will lead to JSON replacing XML entirely. In fact the complexities of XML have a lot to do with making it usable for large-scale distributed systems. JSON is adequate for dealing with data within quite restricted contexts (e.g. within a one particular “API”), but it has no real support for interoperable semantics (because it lacks the equivalent of XML’s namespaces), which makes it inadequate to encode a distributed web of knowledge. Where people have used JSON in this way, they’ve had to layer semantics on top of it, as in JSON-LD.

    More to the point, there’s nothing in XML which prevents it being used to model non-hierarchical structures, in a stand-off way; isn’t that what e.g. RDF/XML and XML Topic Maps do, for instance? Even in rather hierarchical markup languages such as TEI there are non-hierarchical parts such as Feature Structures and Critical Apparatuses. That’s why I think that dissing “XML” as such is painting with too broad a brush.

    Cheers!

    Con

    • Hi Conal,
      long post! I’ll try for a brief response, especially as I have answered most of it in my two responses below.
      The RDF bit refers to annotation. This is implemented separately via the LORE tool, and in my brief talk I didn’t mention that. The technology I was referring to is standoff properties, which have a basis in LMNL, and also are being implemented in tools like eComma and CATMA. The idea to use RDF as a way to literally markup up texts for everyday use has been proposed by Di Iorio, but I think it is too heavyweight for the task.
      Let me clarify the TEI-XML vs standoff properties argument. At the moment I see TEI as a useful input format for AustESE, and also that we can export back to XML for those who want to use it. I agree that we must meet user requirements whatever they are. So I’m not opposed to people continuing to use XML in whatever way suits them. But in the long run I think it has too many drawbacks to survive as a permanent format for DH data. By the way, TEI XML isn’t interoperable, only XML is. You should read Syd Bauman’s piece in Balisage 2011. He was co-editor of the TEI P5 guidelines.
      “I don’t think this post deals with those issues at all adequately”. No it doesn’t and couldn’t. I’ve already said why I think XML isn’t a good choice for digital humanists at length elsewhere and more recently in Historical Social Research Vol. 37 (2012) 3, 125-146.

      • Addendum: we did indeed start off using XML as a format for standoff properties. Later nearly everyone asked for JSON, so we switched. I don’t think it really matters in standoff format, as CATMA for example defines its properties in XML. My objections are confined to the original embedded form of XML. Sorry to be unclear.

  2. @desmond.schmidt: I think your mixing paradigms and mis-representing what James Clark (and subsequently Norm Walsh) said in his post. JClark is describing the fact that some of the interests that were there when XML was created were from this message centric world however it was derived primarily as a way to encode content. XML must be taken in this light first (in a programming language agonistic light). XML is designed to encode human content with human context to track both for all time. In this context XML is very unlikely to be matched soon. Your post here is trying to describe that because there are downfalls to using XML for messaging that means is should be thrown out for content encoding, this is not a proof and in many ways loses the point for humanities which is that you still need the human context with the content. A markup language like JSON loses this context and thus misses the main use case for the humanities.

    If your problem is tools for editing content than this can be handled by tools like oXygen Author (http://www.oxygenxml.com/), Serna (http://www.syntext.com/products/serna/), etc. These tools can be made to get closer to the feel of traditional editing paradigms while respecting the need to encode context.

    • Hi Paul,
      in addition to what I say below to Michael’s post I don’t think I misrepresented what James Clark said all that much. What he said was: “JSON shines as a programming language-independent representation of typical programming language data structures. This is an incredibly important use case and it would be hard to overstate how appallingly bad XML is for this…. Microsoft was certainly pushing XML as a representation for exactly this kind of data. Consider SOAP and XML Schema; a lot of the hype about XML and a lot of the specs built on top of XML for many years were focused on using XML for exactly this sort of thing.” Just for comparison compare what he said four years ago on the same topic. Quite a change. At that point he wasn’t contemplating JSON replacing XML any time soon. In the last paragraph of his 2010 post (already 2 years ago) he was thinking exactly that: “In the longer term, I think the challenge is how to use our collective experience from building the XML stack to create technologies that work natively with HTML, JSON and JavaScript.” It’s not over for XML yet by a long shot but I think in a couple of years we are going to see a major degradation in its use in enterprises that is already gathering pace. As for mixed content industry doesn’t need XML (although digital humanists might). HTML will do just fine.

    • I’m going to reply again because I realise I didn’t answer the ‘user-friendly XML editor’ argument. I think this is an interesting point, as many DH projects are trying or have tried to build TEI-XML friendly editors (e.g. Son of Suda), but they don’t seem to get much beyond the prototype stage. The problem for DH texts I see as threefold:
      1) many of the XML constructs they use (e.g. alternatives like sic/corr, abbrev/expan, choice, app, subst, add, del etc and linking between elements, which is widely used in TEI) can’t be mapped to visual formats to achieve an effective ‘tagless’ display. You either have to leave out these more complex elements, including their contents and attributes, or you get unreadable nonsense. So the editor of the text ends up interacting mostly with naked XML.
      2) humanists don’t want to learn a specific encoding scheme. In our experience with training people to do that we found that no one encodes the same original analog document in the same way. That’s not useful as a way of gathering consistent input to a program.
      3) What we are mostly building these days are web-applications. The tools you mention are both desktop applications. We couldn’t find a web-based plugin for context-sensitive XML editing that worked in the browser. Ideally it would have to come down the wire with the page and not require installation. But that’s what you’d need if you allow people to edit the source of an XML document. If they made a mistake you’ve have to handle it right there, not on the server if you want a good user interface. So basically my objection is that it requires too much technical knowledge on the part of the user to be truly user friendly.

  3. You’ve missed the point. JSON is good for exchanging data between programs because it faithfully models the kind of data structures that programs use. XML is good for exchanging documents between people because it faithfully models the kind of information that people use.

    • Hi Michael,
      I don’t think I missed the point. Of course there’s a lot more to this than fits in a small post. I think what you’re arguing is the ‘mixed content’ role for XML as James Clark put it (first link in original post). I would agree with his objections to that argument in that a) if programmers adopt JSON as a primary format for program data this is a vital use case that will lead them to eventually drop support for the maintenance of XML tools. b) Mixed content used to be served by XML because of the desire to have one-input-many-outputs. Since there is now more or less only one output for mixed content, namely HTML, why encode as XML only to convert it to HTML? I think it is significant that the guy who invented the acronym XML, who wrote the XSLT standard and who was the technical lead in the development of XML now thinks along these lines.

      • Hi Desmond,
        I think you’re right that many of the transformations taking place are from XML to HTML. First of all, exchanging information and passing it on to platforms (e.g. Europeana) needs XML-to-XML transformations thus your argument is not that much valid. Second, if we consider (X)HTML to be a subset of XML one would have to recognise the generic approach of HTML to markup texts. Wouldn’t we loose something (or make processing more complex) if we turned from elements like listBibl or incipit etc to ul or section etc? Third, you were arguing against embedded markup in LLC, how would you replace this in HTML?

        • Hi Torsten,
          thanks for your comments. It’s great to see that people are interested in these questions.
          However, I don’t think I said that humanists would be as well served by HTML as XML. Far from it. I only said that industry has no need of XML for representing mixed content when they already have HTML.
          And I don’t think I would agree with your argument that just because everyone else uses XML we must do likewise. The trend is clearly towards dropping XML from web services. Soon those XML to XML transformations you mentioned may become XML to JSON or JSON to something else.
          In LLC I didn’t yet have a replacement for the light embedded markup we still used at that time. Since then I have developed a working model of “standoff properties”, which you can see working at http://austese.net/tests/. This is a totally XML-free service, though currently only for tests. We use plain text with sets of freely combinable properties, which can be added or subtracted from the text in layers, since there is no syntax and no embedding. There is a more detailed description of this proposal in Historische Sozialforschung 37.3: 2012, “The Role of Markup in the Digital Humanities”, pp.125-146.