Abstract
Using AI tools like Cursor has reshaped how I approach development. But it is not always smooth sailing. Between different models, “Auto” mode versus handpicking one, and the quirks of memory and rules, the experience has been equal parts frustrating and rewarding. This post captures some of the lessons I’ve learned as I have tested models like Claude 3.5 Sonnet, Claude 4 Sonnet, Gemini 2.5, and GPT-5 inside Cursor, especially in the context of an AWS Amplify Gen 2 environment.
Why Cursor?
I started using Cursor because it fits naturally into my development workflow. My goal was simple: speed up coding, debugging, and test coverage while keeping the agent grounded in my AWS Amplify stack. Cursor promised that by combining editor context with AI, I could iterate faster, test locally in my sandbox, and use the agent almost like a pair programmer. That’s the problem it’s addressing, helping me move faster while navigating a fairly complex environment.
Cursor’s Growing Pains
When I first started with Cursor, there were “rules” but no memory. Even now, the concept of memory exists but does not always stick. I often have to remind the agent not to run commands on my behalf, even though that instruction is supposedly stored. Cursor can retrieve the memory when asked, but then quickly drifts away from it. The same happens with rules, they work if reminded but otherwise fade into the background.
This creates a strange dynamic: I know the agent has the information, but I cannot rely on it to act consistently. For any workflow that depends on the predictable application of rules, that is a problem.
Testing Different Models
I’ve rotated through multiple models in Cursor:
- Claude 4 Sonnet – My most reliable partner. It breaks down tasks, creates running to-do lists, and makes steady progress. Downsides: cost and occasional looping.
- Claude 3.5 Sonnet – Cheaper and still strong, but not as good at keeping track of multi-step work.
- GPT-5 – Looked promising for unit test coverage but struggled badly with Jest tests around authentication and session management. Likely compounded by Amplify’s abstractions over Cognito and AppSync. I spent hours debugging.
- Gemini 2.5 – Works in certain cases, but less consistent with my AWS-specific workflows.
- Auto mode – In theory, cost-efficient and adaptive. In practice, verbose, loopy, and bad at tracking its own to-dos. I only use it for low-stakes work.
The Sandbox Advantage
What makes this experimentation manageable is my sandbox. Running AWS services and a front end locally, I can test changes almost immediately. The feedback loop: logs, console messages, images, and prompts, lets me guide Cursor tightly. Without that real-time visibility, I doubt I could tolerate the inconsistencies I’ve seen with models and memory.
Where I’ve Landed
At this point, I default to Claude 4 Sonnet for important work. Auto mode gets a few chances when cost matters, but it rarely shines. GPT-5 is powerful but unreliable in my stack. The added layer of AWS Amplify makes everything trickier, the models can misinterpret how to wire up services, miss key patterns, or invent code paths that don’t align with Amplify’s abstractions. Cursor as a whole is improving, but memory and rules still feel half-finished.
That’s my experience. But what about you? Have you used Cursor in similar ways? Which models work best for your stack? I’d love to hear your experience, and if you’re active in a Cursor group on LinkedIn, let me know so I can compare notes there, too.
Matt Pitts, Sr Architect
No responses yet