The Seven Deadly Sins of PCG Research
At the Procedural Content Generation workshop held with FDG this year, I had the pleasure of being on a panel talking about evaluation methods in PCG, alongside Julian Togelius, Staffan Bjork, and Adam Smith. As part of preparing for the panel, I quickly put together a list of questionable claims that I frequently see in PCG papers (including my own!) as a starting point for talking about evaluation: what should we be evaluating? how do we compare our work? how do we know if we are getting better at PCG? what does it mean to get “better” at it? do we really need to be evaluating our work at all?
Since the panel, I have been thinking on these topics a lot more. My work from a few years ago in evaluating the expressive range of a content generator still interests me, but I think it just barely scratches the surface of what we need to be doing in evaluating PCG. And what we value in “evaluation” depends so much on our goals and our backgrounds. Are we scientists? designers? engineers? artists? a little from column A and a little from column B?
So, I’m still processing these topics (especially related to disciplinary identity). But in the meantime, here are my Seven Deadly Sins of PCG Research: Questionable Claims Edition, intended as a starting point for my thoughts (and hopefully others’) on what it means to evaluate our work, and why we’re even doing this research in the first place.
1. I’ve made a generator that can create an interesting and wide variety of content! Here is a single (or, like, four) hand-picked pictures of a level it made or trace of a quest it can produce.
This is one of the most common forms of evaluation in PCG papers I’ve read, and it drives me completely nuts. It’s what I was trying to start to solve in the expressive range evaluation work (which, admittedly, has plenty of flaws too). Seeing a handful of examples of generated content does not help me understand the capabilities of a generator that can create thousands to millions to infinite pieces of content. How meaningfully different are the pieces of content from each other? Are there certain kinds of content that simply cannot be created? Are there certain kinds of content that the generator seems to be biased towards creating? None of these important questions are answered by providing a single data point.
2. I’ve made a generator that can create an interesting and wide variety of content! But I haven’t actually tested that it works in a game (or other appropriate form of media).
Now, I don’t think we all need to go out and make Galactic Arms Race for every game content generator we make (though that would be pretty amazing!), nor do I think every PCG system requires a user study of some playable experience for its evaluation. But I do think it’s crucial that, if we’re making content that should be experienced interactively by a player, it at least be shown that it is possible to be used in such a way. All of our work requires abstracting certain aspects of the content we are creating, and without showing that a generator is capable of existing within a particular game context, it’s not actually creating game content yet. For generators that aren’t intended to create game content (e.g. Aaron Reed’s use of grammars to create incidental, context-setting story content for human-authored stories), it should be shown that the generator works within the context it was created for.
3. I’ve made a generator that can create an interesting and wide variety of content! And I am here to prove it to you by showing that it creates fun levels, for a depressing numerical definition of fun (bonus sin points if this is also the definition of fun you used during generation).
This is probably where I start to make some enemies in the PCG community, but I genuinely do not believe it is possible or even worthwhile to try to boil down “fun” to a number on [0, 1]. It’s too subjective a concept, even beyond player personality — there are too many uncontrollable factors that go into player enjoyment. Maybe I had an argument with a friend right before playing, or maybe I am more tired than usual, or maybe that sandwich I ate for lunch was just the best sandwich ever created and so everything is amazing right now.
Even beyond the subjectivity problem: what designer is ever going to ask for a level that scores 0.742 on the “fun” scale? What does it mean for a level I create to be 0.2183 fun and a level that you create to be 0.5312 fun? How is this meaningful feedback? Didn’t we both just fail? Do we always want 100% “fun” content? And please don’t rename “fun” to “engagement” and hope that it fixes everything.
And again, even beyond all these issues, if the content generator was created to optimize for a particular measure (be it “fun” or “engagement” or “frustration” or “how pink it is”), it is problematic to use that measure during generation and then produce an evaluation that shows the generator creates content that meets that measure. This is just a proof that the optimization algorithm works, not a proof that the generator creates satisfactory content.
4. I’ve made a generator that can create an interesting and wide variety of content! Other generators have also tried to create content like this, but I’m not going to tell you why mine is better or in what situations it has strengths/weaknesses.
We have a really fascinatingly large number of level generators for Mario-style levels at this point, but I don’t think it’s possible to really understand the strengths and weaknesses of them beyond looking at screenshots of content they create. A competition that has players declare their “favorite” generator is as likely to say as much about the players as it does about the generator. Imagine that we took 20 artists and had them each paint a depiction of haystacks. Would we accept an “evaluation” of this work as 50 people voting on their favorite depiction of haystacks, or would we rather be able to have a vocabulary to critique each individual piece of art and compare them to each other?
5. I’ve made a generator that can create an interesting and wide variety of content! I have not defined what it means to be interesting or have wide variety, nor can I tell you how controllable the generator is or what kind of control (if applicable) a particular designer or player might want to have.
Most of us probably aren’t in the market of putting human designers completely out of jobs, nor do most of us believe that there is a future in which a single unified AI system will create an entire game without needing to trade off different art and design considerations. That means it’s crucial to understand how steerable a generator is — what can a human (or other AI) designer do to influence the kind of content it creates? How does that impact the kind of game it can be used in? What were the assumptions made when creating the generator — do you assume a particular art style, or range of acceptable game physics values? Do you want this system to be used in an exploratory design tool that helps designers brainstorm, or a tool that helps designers refine their ideas into a single perfect level?
6. I’ve made a generator that can create an interesting and wide variety of content! There is no clear reason for why this is a desirable thing to do.
Related to the above. While it’s pretty cool to just make little content generators, and the pleasure of creating a generative system and playing with it is often a worthy motivation in itself, others may be wanting to create a generator that can replace a human, or augment a human designer, or create a new kind of experience, or create personalized content for players. Communicating that motivation is absolutely crucial, and I think helps inform the kind of evaluation that you might want to be doing.
7. I’ve made a generator that can create an interesting and wide variety of content! Now any game that it is used in will be instantly more replayable.
We re-read books, re-watch movies, re-listen to music, and even re-play games that have absolutely zero procedurally generated content in them. We can find things in the static, human-authored content that haven’t been seen before, or perhaps it’s to gain a new perspective. Or maybe it’s just really pleasurable to re-experience something that was enjoyed the first time? But it seems clear that, given that current games are already replayable for a variety of complicated reasons, that PCG isn’t the sole factor that leads to replayability, and that introducing PCG into a game does not de facto make it more replayable than before. Nor am I convinced that replayability is necessarily the most desirable result to come from PCG: I would rather we be looking at ways we can use PCG to make entirely new kinds of playable experiences.