Over the past six months I have worked with Johannes Schafer from The University of Hildesheim, on detecting offensive and toxic comments in machine generated language. This a protracted and nuanced task, that should become an integral part of my PhD. As yet we have been able to detect conversation 'Offence Direction', i.e. whether conversations are becoming more or less offensive. This novel outcome has used a bespoke Reddit dataset, the code for which we've published on github.
In 2020 we will publish a long paper on the details of this metric:
Future Paper Abstract
This paper outlines an ongoing corpus study and gradient based search of conversations by offence probability direction; for example, becoming more or less offensive. With the motivation of retrieving examples of humans' conversational tactics against offence. Such a database would be a lucrative resource in evaluating dialogue systems on the same task, where a central challenge is dealing with offensive input. This problem is compounded when working in ethically sensitive areas, where replies should be validated and context appropriate. Literature from care and pedagogy advises professionals that when subjects become offensive, methods like expansion, clarification, and adaptation are necessary rather than warning alone. By appropriately handling the offence, the professional carer removes blame and opens the door for the subject to process their intention. However, these nuanced and delicate forms of language are challenging to implement in a dialogue system, from conceptual literature alone. We use state of the art offence classification from the recent Offenseval 2019 Task 6. We predict an offence direction based on linear gradients of the classification probability.