[R&D] Local AI in Marketing: First R&D Results from Running Gemma Models In-House

Introduction

A while ago, we shared our thinking around shifting toward local AI models and starting to treat them less like tools and more like infrastructure that we can shape around our internal workflows. The idea sounded promising in theory, but like most things in this space, the real question was always how it behaves once you move past concepts and into actual execution.

Now we have the first results to share.

What We Set Out to Test

The goal was never to fully replace external models or rebuild our entire pipeline overnight, but rather to understand whether smaller, locally running models are capable enough to handle real marketing tasks in a way that justifies further investment and development.

We wanted to see how they perform when exposed to real use cases, not isolated benchmarks, and whether they can support workflows like research, opportunity discovery, and early-stage campaign thinking without constant manual correction.

What We Built

To explore this, we experimented with locally running models based on Gemma architectures, executed through lightweight inference setups and paired with custom-built agent logic designed from the ground up. Everything was configured to run fully local, without relying on external APIs or third-party platforms, which gave us complete control over how the system behaves and evolves.

We extended the models with internet search capabilities, added simple reasoning steps, and structured tasks into small pipelines so that the system could move beyond single-prompt interactions and operate more like a process. This allowed us to test how well a composed system performs compared to a standalone model response.

What Actually Worked

One of the more important early observations is that small models are more capable of writing usable text than most people expect, especially when they are placed inside a structured workflow instead of being used as isolated chat interfaces. While the output is not always perfect, it is often good enough to move work forward without requiring complete rewrites.

Once we combined generation with search and basic logic, we were able to gather meaningful data about marketing opportunities and potential clients, and even run small-scale R&D tasks entirely locally. The key difference came from chaining steps together, where the model searches, processes, and refines information instead of trying to produce everything in a single pass.

This shift in how the model is used had a noticeable impact, as it started behaving less like a generic assistant and more like a controllable system that can support specific parts of our workflow.

Where Things Broke

At the same time, the limitations became very visible as soon as we pushed the system beyond simple tasks.

Memory constraints turned out to be one of the biggest issues, as context windows fill up quickly and once that happens the model starts losing track of the task, repeating itself, or introducing hallucinations more aggressively. Without careful handling of context and state, the quality of outputs degrades faster than expected.

Another friction point was the model’s tendency to behave in an overly cautious way, where it hesitates or refuses to complete tasks due to perceived uncertainty, even in situations where approximate or exploratory results are completely acceptable. This kind of behavior slows down workflows and requires additional effort to override or guide.

Speed is also a noticeable factor, since local models simply do not match the responsiveness of enterprise-level systems, and the lack of instant feedback changes how fluid the interaction feels when compared to tools like hosted AI platforms.

What This Means

Even with these limitations, the overall direction is clear enough to take seriously, because we were able to run meaningful research workflows, generate usable outputs, and explore opportunities without relying on external systems, which already represents a shift in how this work can be done.

The system is not perfect and still requires careful setup and iteration, but it is capable enough to support real tasks, which makes it relevant from a practical perspective rather than just an experimental one.

Conclusion

These are still early results and there are obvious areas that need improvement, especially around memory management, response speed, and better control over model behavior, but the foundation is strong enough to justify continued investment.

What becomes interesting from here is not whether local models can be used at all, but how far they can be pushed when they are treated as systems that evolve over time, shaped by real data and real workflows.

For agencies thinking about long-term differentiation, this opens a different path, where AI is no longer something you access, but something you gradually build and refine to match exactly how you operate.