According to some posts I saw there is a paper on Mastodon that scraped public posts. Anyone know what it is called or where to find it?

It's also nice that all the conclusions in the paper are wrong because they start with a mistaken premise that content warnings mean that a post is "inappropriate".

Show thread

@jaranta Thanks for the actual paper link, let's see it. :P I'll look at the ethics once I see it, but that title and its "inappropriate" is hilarious in itself from a methods perspective - this is why you don't rely on automated data collection without actually having a living human enter the context and look around first.

@werekat It's as if you need to understand the topic your researching.

@jaranta Yes. I only read the abstract but this is painfully clear.

@jaranta @socrates Are any of the paper authors active Mastodon users on or any other instance?

@drbjork @jaranta Not that I'm aware of

And based on how well they seem to have understood CW's on the Fediverse, I doubt that any of them used Mastodon at all, ever

@socrates @jaranta As far as I understand it is a conference paper. I really hope it wouldn't have passed peer review...

@jaranta They anonymized the data, right ... and those were public posts.

What am i missing?

@qcat It's not possible to anonymise textual data so that the original text is impossible to find afterwards, meaning that it is impossible to anonymise textual data. There are methodological work-arounds that are used in internet research, but these computer scientists probably did not know about them.

@jaranta Yeah, they said they anonymised the user data, but if nobody knows who posted what, then how relevant is the content? Also, I don't see them using any explicit text in their analysis. Of course if they keep that dataset lying around (or even worse, the non-anonymised version) and provide it to third parties, that would indeed be problematic.

@qcat They published the dataset. It's since been taken down.

@jaranta Ah, ok. And the implication being that anyone can find the original posts even if you don't know the user and imstance if you search through the instance list tjey used. That's the concern i guess?

