Volver al blog
engineeringPublicado April 28, 2026

How We Killed AI Slop in Image Generation

How We Killed AI Slop in Image Generation

We had a problem. We'd built a modular prompt system for our AI brand generator, replacing the hacky inline prompts we'd shipped in V0. The new system was cleaner, more maintainable, more extensible. The output looked worse.

We compared lifestyle photos from two different brands side by side. One brand's lifestyle shot looked like something from Cereal magazine: natural light, a real person mid-gesture, believable space. The other looked like an AI stock photo: centered composition, generic interior, a person staring into the middle distance.

Both used FLUX 2 Pro. Same resolution. Same API. Same model settings. The difference was the prompt.

The five root causes

We traced the quality gap to five specific failures in how our modular system assembled prompts.

1. Hardcoded scenes made every brand look the same

Our old system had inline scene descriptions. Sloppy, but specific: "Morning light through tall windows, flour-dusted oak table, a single cup of pour-over in frame." Our new modular system replaced these with reusable productContext and spaceDesc fields. The problem: we'd written generic versions. "White walls, polished concrete floor, warm oak shelving" served as the scene description for every heritage-craft brand.

Every coffee brand got the same kitchen. Every ceramics studio got the same showroom.

The fix was visualDirection. During brand synthesis, Claude now writes a one-sentence creative brief for each brand:

Warm Mediterranean light filtering through linen curtains, terracotta and aged brass, shot with nostalgic film grain.

That sentence appears in every image prompt for that brand. Two coffee brands look completely different because their visual direction comes from their specific cosmic data, personality, and belief statement.

2. Spaces defaulted to AI cliches without a quality anchor

Without guardrails, FLUX renders the most probable scene from its training data. Ask for a tattoo studio and you get a dingy basement. Ask for a boxing gym and you get a rundown warehouse.

We added a four-word prefix to every store and lifestyle prompt: "Modern, contemporary interior." Four words changed the quality floor for every brand we generate.

A raw concrete boxing gym with that prefix looks like a high-end training facility. Without it, FLUX renders something from a Rocky movie. The prefix sets the aspiration. The archetype materials set the character.

3. FLUX defaulted to back-of-head shots

Our lifestyle prompts described a person using the brand's product. We didn't specify anything about their face. FLUX interpreted this as freedom to render the back of someone's head.

The fix: "Face visible in three-quarter view, eyes on the task, never looking at camera." Plus "Natural skin texture, slight imperfections" to prevent uncanny-valley smooth skin.

4. Generic gestures produced generic people

Our old prompts had industry-specific actions: "arranging ingredients" for food brands. The problem: every food brand got the same person arranging the same ingredients.

The fix connected the person's gesture to the brand's voice descriptors:

The gesture feels unhurried and tactile.

vs.

The gesture feels austere and precise.

FLUX understands gestures. "Unhurried" produces slow, deliberate hand movements. "Austere" produces still, controlled poses. Two food brands get different human energy because their voice descriptors differ, not because we hardcoded different industry actions.

5. The prompts sounded like marketing copy

Our early prompts described products the way a copywriter would: "Premium aged whisky, crafted with care." FLUX doesn't know what to do with marketing language.

We split product descriptions into two fields. productDesc is marketing copy: "Cold-pressed Argan Face Oil, Atlas Mountains harvest." productPhotoDesc is photographer direction: "Small amber glass dropper bottle with minimal label, placed on raw linen."

The photographer direction gives FLUX something concrete to render. The marketing language gives it a concept to flounder with.

The anti-slop rules

After fixing these five root causes, we codified what we'd learned:

Never say "centered composition." AI defaults to centered. Say "off-centre" or "asymmetric."

Never say "beautiful" or "stunning." Filler words the model ignores. Be specific about mood.

Never say "high quality" or "4K." Specify camera and film stock instead: "Hasselblad 500CM, Kodak Portra 400."

Never say "professional photography." Name the photographer. "Rich Stapleton (Cereal)" is a precise aesthetic reference.

Photographer direction, not marketing copy. "Single whisky bottle on dark marble, side-lit" produces a photograph. "Premium aged whisky, crafted with care" produces a mood board.

The brand-block system

A brand's primary color needs to appear in every image, but differently. In a product shot, the color is on the product. In a lifestyle photo, it's one worn detail on the person. In a store interior, it's an architectural accent.

We built a brand-block module that generates three different color instructions from the same palette. Same hex code, different application. The color anchors the brand across all images without turning every scene into a monochrome advertisement.

The result

Before: two coffee brands generated through our system looked identical. Same kitchen, same light, same composition.

After: two coffee brands look like they were shot by different photographers for different magazines. The archetype provides the camera, film stock, and material language. The visualDirection provides the world. The voice descriptors provide the human energy. The anti-slop anchors keep everything grounded.

We published the anti-slop rules (but not the production prompts) as an open-source skill. The rules work with any AI image model.

The competitive moat isn't any single technique. It's the integration: one resolvePhotoStyle() call per brand cascades through archetype, camera, lighting, color grade, photographer, scene, person, gesture, color block, and anti-slop anchors. That's hard to reverse-engineer from the output.