Same Author or Just Same Topic? Towards Content-Independent Style Representations

About

Linguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV): Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, a good performance on the AV task does not ensure good "general-purpose" style representations. For example, as the same author might typically write about certain topics, representations trained on AV might also encode content information instead of style alone. We introduce a variation of the AV training task that controls for content using conversation or domain labels. We evaluate whether known style dimensions are represented and preferred over content information through an original variation to the recently proposed STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no content control at representing style independent from content.

Anna Wegmann, Marijn Schraagen, Dong Nguyen• 2022

Related benchmarks

Task	Dataset	Result
AI-generated text detection	M4	AUROC81	41
Single-target AI-generated Text Detection	M4	AUROC@184	25
AI Text Detection	MAGE in-distribution (test)	AUROC58	16
Authorship Verification	LDC Hard	AUC87.2	6
Authorship Verification	BBN Base	AUC76.4	6
Authorship Verification	BBN Harder	AUC90.9	6
Authorship Verification	LDC Base	AUC84.4	6
Authorship Verification	LDC Harder	AUC0.894	6
Authorship Verification	BBN Hard	AUC0.801	6
Style Representation Evaluation	STEL-or-Content Multilingual (averaged across languages)	Simplicity Score23	5

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord