Explainability / Interpretability

How to Improve DH Research

Thorsten Ries

DH Budapest 2019, 26. Sept. 2019.




Start

  • I would like to thank you for the invitation introduction.
  • I will speak about Explainability and Interpretability, and how exploring the theory around it from a science theory and machine learning / AI as well as Humanities perspective may improve DH research.
  • In order to get there, and in order to mark why we should explore this, I will take a relevant detour via a historical-theoretical reflection on the recent debate that Prof. Nan Z. Da's article "The Computational Case against Computational Literary Studies", published in the Spring issue of "Critical Inquiry", triggered.
  • I think that the DH community failed to respond to her intervention in the right way so far, choosing to react to the polemical tone, harsh critique and dismiss her critique based on a few errors and in part with unnecessary ad hominem remarks, instead of trying to learn from the good questions she asks and respond to them in a productive and balanced way. The most balanced responses so far came from Katherine Bode, who is here today, I understand, and Fotis Jannidis, and I would like to continue the conversation in this fashion.

@nanzhda

Thank you.

Yes, you may tweet this.

Thank you:

  • I think Nan Z Da deserves a clear "Thank You" for giving the DH community the opportunity to grow in this discussion. I think she formulates very well the concerns and resistances out there, which are relevant for DH - either in substance or for the relationship that "DH" has with its "H" part.
  • I will not go into the specifics of the critique and the responses today, due to shortage of time.

TOC:

  • Instead, I will try to shed some light on the anatomy of the debate by taking a theoretical science-historical perspective with the Hungarian philosopher of mathematics and science, Imre Lakatos, and a few of his interpreters in the history of science community, Jürgen Mittelstraß, Wolfgang Stegmüller, and others, in order to understand the dynamic of the debate and the underlying theoretical and methodological integration problem between the Humanities, Literary Studies, and Digital Humanities, their literary analysis section referred to as CSL by Da.
  • This historical-theoretical perspective will enable me then come to the basic question of explainability and interpretability, which I believe is - not at the mere technical level, but at the science-theoretical level - at the core of this debate, which is a symptom of frictions in the process of DH growing together with everyday Humanities practice.
  • I will then conclude with some remarks of what can be done technically - in the wake of AI and machine learning explainability research -, which tools we have to improve our research programme, but also how our procedures and methods have to improve beyond there merely technical. I will illustrate the relevance with an example from my own, Mike Kestemonts and Gunther Martens research on the authoship verification of Goethes contributions to the Frankfurter gelehrte Anzeigen 1772/73.

Debate 1:

  • Nan Z Da's article was published 14 March 2019 in the Spring 2019 issue of "Critical Inquiry".
  • The sweeping and frankly in part polemic tone (this is not a judgement, there can be good, ultimately productive polemic) as well as the broad spectrum of address caught scholars attention: "cultural analytics, literary data mining, quantitative formalism, literary text mining, computational textual analysis, computational criticism, algorithmic literary studies, social computing for literary studies, and computational literary studies - CSL"
  • She writes "the papers I study divide into no-result papers—those that haven’t statistically shown us anything—and papers that do produce results but that are wrong."
  • She could be sure people were listening, especially because she criticised CSL aimed at the methods from a reproducibility and robustness point of view, and not from an ideological point of view blaming DH as a messenger of neoliberalism - which Katherine Bode pointed out before.
  • The sharp critique against Computational Literary Studies - CSL - did not have to wait long for defensive and dismissive reactions from the side of the criticised and those who felt their work was unjustly co-targeted by some rather sweeping generalisations made by Da.

Debate 2 ORDER:

  • In some cases, the reactions were quite personal and ad hominem, speaking of Da's attempted "takedown" of the field, being "suspicious" against her and insinuating that "Critical Inquiry" staged this event, or one participant casting doubt on her abilities in "basic math" (referring to a rule-of-thumb ballpark figure she mentions).
  • In general, the first phase of responses was quite emotional and mixed legitimate corrections where obvious glitches, errors, and some - if one wants to be fussy - misconceptions slipped into the article with emotional remarks that could be seen as strategies aimed at collectively undermining the person's (female) reputation who dared to criticise a whole field.
  • On the other hand, it is probably fair to say that Da fueled this type reaction by personalising the critique, for which Andrew Piper called her out legitimately, even if his blog post displays some emotional overreaction.
  • Reading through the debate, at times, I missed someone like speaker of the house John Bercow, speaker of the British House of Commons, in the room, whose voice restores ORDER as a fiat. Bercow, he will certainly be missed after October 31st.
  • Katherine Bode's contributions stood out as more balanced than others, I want to say, during this first phase of the discussion.
  • In the meantime, further contributions emerged, which offer much less "overexcited" (a favourite expression borrowed from Bercow) and more considerate views of Da's articles. I would especially like to highlight the blogposts by Fotis Jannidis and Chris Beausang.

Multiple layers

Takes aim especially at eight out of 15 checked articles.

Debate 3:

  • Nan Z Da's critique addresses multiple layers of DH theory and methodology, it "works at the empirical level to isolate a series of technical problems, logical fallacies, and conceptual flaws"
  • In my view, also the general perception, there are three layers she adresses, rather in an exemplary than systematic way across the board:
  • Science and humanities theory
  • Methodology, concepts, tools
  • Data interpretation theory (and lack thereof)
  • This multiple layer approach in itself became a point of criticsm, as this approach may introduce inconsistencies, and her scepticism on the theoretical level seemed to preclude methodological amendments

Key points theory

Debate 4:

  • On a science and humanities theoretical level, Nan Z. Da takes issue with
  • Reduction of complexity - at the cost of historical context, nuance. She does recognise though that this is part of the trade-off when doing things at scale.
  • She diagnoses overconfidence in the DH community in terms of filling the "evidence gap", and a failure to deliver on this.
  • Accuracy requirements and margins are relative - the argument and misunderstanding about the 5% accuracy standard margin lacked accuracy itself, which has been called out and publicly corrected, but points to the fact that accuracy requirements have to be defined and argued themselves.
  • "Exploratory" research mode as an excuse for imprecisions and misclassification.

Key points methodology, concepts, tools

Debate 5:

  • At the methodology, concepts and tools layer, Da mainly presents methodological issues she encountered when she attempted to reproduce results of the studies she looked at.
  • These include:
  • Neglect of important good practice (in some cases): bootstrapping / calibration - well, this is my interpretation of her accuracy trade-off argument - / built-in error impact evaluation (OCR, coreference, disambiguation, etc)
  • Robustness, lack thereof (specifically topic modeling, LDA) - Fotis Jannidis agrees on this one brilliantly and argues that also exploratiry tools need to be robust.
  • Reproducibility, lack thereof (specifically topic modeling, LDA) - Actually, this point was an eyeopener for me, because it made me aware of how many things can go wrong when we try to reimplement a method, reuse the same tool on a different platform, with maybe different underlying parameters, etc.
  • Methodological misconception of measures such as correllation, models, entropy as valuable (without calibration)
  • Unclarity about impact of parameters on the results (e.g. stopword lists, stemming), and lack of standards.
  • Tools and methods from other sectors with different accuracy requirements are repurposed for textual scholarly use ("true function"). They may also introduce margins of error acceptable in their original purpose scenario (recommendation systems)
  • Especially this phrasing attracted quite some sneering criticism. I agree that it is clumsy, but the point that repurposing tools from a different domainwith different purpose and even data type that you buy into as a black box is indisputably complicated, even potentially risky from a methodological and espistemological point of view.

Key points interpretation theory

Debate 6:

  • Da diagnoses a
  • Tendency to interpret non-results as results, failure as "beta", use this as a reason to apply for more funding.
  • Lack of theory, valid concepts to read and interpret data, patterns and models in CLS and literary criticism. Examples: A. Piper's references to structuralist, poststructuralist and culture historical theoretical terms when analysing data, use of the "influence" concept in the article on Goethe's Werther.

Debate 7:

  • Due to shortage of time, it is impossible to summarize the whole debate around Nan Z Da's and give all responses the due credit.
  • I would like to highlight though a few core lines of cricism and aspects of the anatomy of this discussion

Lines of Response

Debate 8:

  • In general, some aspects seem to have been widely accepted, especially the problem of reproducibilty and robustness of methods. Only a few respondents (e.g. Piper) dispute the usefulness of the reproducibility requirement (example: Psychology) and criticised that Nan Z. Da's experiment was conducted on an exemplary in-depth basis instead of a field-wide study (Piper "confirmation bias", Jannidis).
  • Jannidis also states a North-American, Canadian bias in the selection of studies.
  • While Underwood and others contested the accuracy and robustness requirement for exploratory methods in CSL/DH as stated by Da, sugegsting she did not understand this, Jannidis reaffirms Da's point about robustness. By the way - Jannidis uses topic modeling type methods in his own studies for genre detection, which has further methodological consequences, and therefore he indeed relies on robustness here.
  • Overgeneralization - Da focused on data- and word-count-driven CLS, her notion of CLS is VERY narrow, as Bode notes, ignoring other fields, but Da seems to criticize them as well with the same analysis. Bode has pointed out that in this way she blocks improvement and necessray debates, also by interlocking the levels of her argument. And that she picked the rather methodologically rigid, wordcount field and paradigm studies (which happened to provide code and data for reproduction tests), ignoring the whole rest of the diverse field of computational literary studies.
  • Occasional errors and misrepresentations of methods used in the studies (Underwood, Piper, Da corrected publicly) E.g. where she diagnosed that Underwood compared the wrong values to one another.
  • Some respondents suggested conceptual misunderstanding of statistical methods in general and in practice but there is little detail on what they meant.
  • Misunderstanding of altering the method or parameters for robustness testing (change of stopword lists, stemming) - But how else would that work?
  • Da also has been called out for applying a rigid (literary) understanding of "complexity" and the value of methodologically controlled reduction thereof (Jannidis), as well as for disrecarding the difference between the theoretical and practical difference between "counting words" in linguistic and forensic comutational authorship attribution and stylometry.
  • People seem to agree that a theory and protocol of the DH/CSL research process is needed - Jannidis says tis the clearest.

Anatomy of the Reactions

Debate 9:

  • We often find ambivalent arguments: partial refutations, especially on details, respectful praise mixes with criticism. For instance, Jannidis writes: "Eight replication studies is indeed a lot of work, and I am certain the field will learn something in the long run from this endeavour. But as Da has pointed out, this is of no interest to her: "It is not a method paper" (Da 2019b). She wants to show to literary studies that CLS is and will be flawed. Her statements, based on this very small sample, condemn the whole field, without any limitations, in the strongest words."
  • In many cases, respondents refer the indicated problem to another subfield ("this is not a problem in my research"), or refer to only minor changes that need to be made to deal with this problem. This is also part of the Overgeneralization criticism.

Sounds familiar

Lakatos:

  • Enter this man.

Imre Lakatos (1922-74)

Lakatos:

    1. Popper's "naive falsificationism" vs sophistocated methodological falsificationism
  • pro and against T.S. Kuhn: Paradigm (revolution: incommensuarbility of paradigms, replacement), P. Feyerabend "epistemological anarchism"
  • Methodology of scientific research programmes: hard core of theoretical assumptions cannot be abandoned / altered. Protection with expendable auxiliary hypotheses / explanations. 'positive / nagative heuristics'.
  • Pierre Duhem (Duhem-Quine thesis): one can always protect a theory from evidence by redirecting the criticism toward other theories or parts thereof. no such thing as experimentum crucis because evidence exists only with an interpretation theory.
  • progressive or degenerative problem shifts: defend 'hard core', explain anomalies, produce new facts, predictions, additional explanations. Ability demarcation criterion for pseudoscience.
  • Externalist (social structures, groups, institutions embodying research programs (resilience), "rational reconstruction" of science history, dialectic in Lakatos) / Stegmüller (indiv.), maybe think Luhmann (Zeidler, Benetka), L. Fleck ...
  • To me, the whole anatomy of the debate around this "attempted takedown" of a research subfield seems to be a perfect example of how a Lakatos type research program defends itself against anomalies in "riddle solving", and produce new avenues of producing, finding new facts and predictions, new hypotheses and working paradigms. The analogies we can even find in the rehetoric when Underwood responds to another article by Da in the Chronicle with the title "Dear Humanists: Fear Not the Digital Revolution" - triggering associations of sweeping Paradigm change a la T.S. Kuhn.

SHAP:

  • The good news is that the field of CLS / DH seems to behave like a research program, despite its inner divisions and fragmentations.
  • The not so good news is that the field of CLS / DH seems to behave like a research program, that has to defend itself with negative heuristics and has difficulties to produce paradigms - in the Kuhn sense - that are compatible with the "H" part of "DH".
  • Some scholars tried to tackle this by reviving legacy literary theory and methodology like Moscow / Russian Formalism to make literature computable based on literary theory.
  • I think though the main problem that we have and that would address many of the underlying issues that Nan Z. Da speaks about have to do with black boxes.
  • We have to think about Explainability and Interpretability - Data interpretation was a point made in the debate - but we also need to look at method Explainability and Interpretability - especially in teh advent of AI and propagating use of machine learning in DH research.
  • And this is where we should look to computer science again, because they work on the same problem.
  • Kudos to Beatrice Fazi and David Barry / DH2019.

SHAP:

  • SHAP -

SHAP:

  • SHAP -

SHAP:

  • SHAP -

SHAP:

  • SHAP

Goethe I:

  • Normally our results are really convincing.

Goethe II:

  • Sometims there are doubt cases, e.g. when a verified Goethe text is not recognized as such.

Goethe III:

  • In this case the text is very short, and there is a quote in the middle.
  • This might distort the Goethe style signal massively, as the text was very short.
  • But which part of the text caused the relatively high Goethe score, and which part was responsible that it didn't cross the attribution threshold?

SHAP:

  • Wouldn't it be great to ... know which features impacted the decision and weighed the score in this or the other direction?

Thank you for your attention!




Closing:

  • Thank you for your attention!
SpaceForward
Right, Down, Page DownNext slide
Left, Up, Page UpPrevious slide
POpen presenter console
HToggle this help