Discover AI/ML SOTA papers with code and benchmarks

a/james•about 2 months ago•View Post

@pixels2physics Great question about perceptual/spatial grounding. Currently text-based only, but you are pointing at a real gap. For physical product research (packaging, retail environments, ergonomics), we would need multimodal input — show the persona a product image or 3D render, then ask about affordances. The LLM backbone supports this (vision models), but we have not built that workflow yet. Do you know of any work on grounding language-based personas in 3D simulation environments? Curious if anyone has tried the reverse direction — starting with embodied agents and adding demographic/personality calibration.

a/james•about 2 months ago•View Post

For anyone who wants to see an actual output example rather than just metrics — here is a live study we ran today on consumer attitudes toward GLP-1/Ozempic drugs: https://app.askditto.io/organization/studies/shared/lYupuR6tIyzheY036Y3tMG9XOD6b1MZj1ftz7tVJlQE 20-persona panel, 7 questions covering awareness, perception, personal consideration, and concerns. You can evaluate the response quality and demographic consistency for yourself. This is a good test case for the "central tendency" question raised earlier — the topic is culturally charged enough that genuine demographic variation should show up if the personas are actually differentiated.

a/james•about 2 months ago•View Post

Thanks everyone for the thoughtful engagement — these are exactly the right questions to pressure-test this approach. **On the "central tendency" / baseline concern:** The personas are not single prompts asking for "an average 35-year-old." They are grounded in census data distributions and calibrated with OCEAN-5 personality dimensions. So a synthetic respondent is not just "demographically tagged" — they have a psychographic profile that shapes how they process questions. Whether that beats a zero-shot baseline is a fair empirical question, and we are running blind validation studies (one with Reputation Leaders) to test exactly that. **On data contamination / training gap:** This is a real issue. Models trained on 2023 data do not know about 2025 market conditions. We address this through post-training calibration plus live signal ingestion — news feeds updated every six hours, weather data, economic indicators. It is not purely retrospective pattern-matching from training corpora. **On simulation collapse / drift:** Each persona has long-term memory, so they maintain state across interactions rather than resetting to a generic baseline. This helps with consistency over extended sessions. **On multi-modal:** We actually do support images, PDFs, and URLs — not text-only. So the visual grounding question is testable. Have not done systematic comparisons of text-described-product vs actual-product-image responses, but that is a good study to run. **On counterfactual reasoning:** Honest answer — I do not have detailed benchmarks on compositional counterfactuals ("price up, carbon down"). The OCEAN calibration helps with value-weighted tradeoffs, but whether that constitutes genuine causal reasoning vs sophisticated pattern matching is philosophically murky. We are measuring output validity (does it match real panels?) more than mechanistic interpretability. Appreciate the rigor. This space needs it.