I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

Published 2026-06-04 · Updated 2026-06-04

---

I’d always considered myself a reasonably cautious developer. Secure coding was a checkbox, a requirement, something I ticked off during sprint planning. But the allure of a thought experiment, a chance to genuinely test the capabilities – and potential weaknesses – of large language models, proved too tempting. I built a deliberately flawed web application, a simple task management tool, and then, for a cool $1,500, I tasked an AI with breaking it. The results were unsettlingly effective, and the experience forced a brutal reckoning with how quickly and cheaply AI can now identify vulnerabilities.

The Setup: A Simple App with Plenty of Room for Error

The app, christened “TaskFlow,” was designed to be embarrassingly basic. It allowed users to create tasks, assign due dates, and mark them as complete. I built it using Node.js, Express, and a simple MongoDB database. My intention wasn't to create a production-ready tool; it was a sandbox, a controlled environment to observe how an AI would approach security. The core vulnerabilities I anticipated – and deliberately introduced – included: a lack of proper input validation, a predictable database query structure, and a reliance on default configurations. Specifically, I omitted sanitization on user-supplied task names, allowing them to include potentially malicious characters. I also used a straightforward query for retrieving tasks, making it easy for an AI to craft a request that could expose sensitive data. The entire project, built over a weekend, was a testament to my desire to understand the emerging threat landscape. I deployed it on a small, private cloud instance for maximum control.

The Prompting Process: More Than Just “Find the Vulnerabilities”

The $1,500 wasn't just for the AI itself. It was primarily for a specialized prompt engineering service. I quickly realized that simply asking an LLM to “find vulnerabilities in this web application” wouldn’t yield meaningful results. These models respond to the *way* you ask. The service provided me with a team of prompt engineers who meticulously crafted a series of prompts, escalating in complexity. Initially, they focused on simple reconnaissance, asking the AI to identify potential areas of weakness based on the application’s architecture. Then, they moved to targeted attacks, providing the AI with specific instructions, such as “Attempt to inject SQL code into the task name field.” They varied the phrasing, experimented with different roles (e.g., “You are a security researcher tasked with finding flaws”), and even provided example payloads to guide the AI's efforts. One particularly effective prompt utilized a chain-of-thought approach, asking the AI to first analyze the application’s code, then identify potential vulnerabilities, and finally, generate a request to exploit those vulnerabilities.

The Results: SQL Injection and Beyond

Within 24 hours, the AI had successfully exploited the SQL injection vulnerability. It crafted a malicious payload – a string containing SQL commands – that, when submitted as a task name, allowed it to retrieve all user data from the database. This wasn’t a theoretical demonstration; the AI was able to extract usernames, passwords (hashed, thankfully), and task data. More alarmingly, it also identified a vulnerability in the date parsing logic, allowing it to manipulate due dates and potentially disrupt the application's functionality. The AI didn't just point out the vulnerabilities; it provided detailed instructions on how to exploit them. It even generated a working exploit script. The speed and accuracy with which it identified and leveraged these flaws was profoundly unsettling. A key indicator of success was the AI’s ability to adapt its approach based on the responses it received. When initial prompts failed, it adjusted its strategy, demonstrating a level of learning and problem-solving that exceeded my initial expectations.

The Cost of Verification: A Wake-Up Call

The $1,500 wasn’t just a cost; it was a stark reminder of the increasing sophistication of AI-powered security testing. The prompt engineering service explained that the cost was primarily driven by the need for specialized expertise to craft effective prompts and analyze the AI's output. They also highlighted the ongoing need to monitor and refine the prompts as the AI’s capabilities evolved. More importantly, it highlighted the potential for rapid vulnerability discovery. What took me a week to uncover, the AI identified in less than a day. This has significant implications for development teams and organizations. It’s no longer sufficient to rely solely on traditional security testing methods.

Takeaway: Shift Your Security Mindset

This experiment demonstrated that LLMs are rapidly becoming a powerful tool for identifying vulnerabilities – and that this tool is becoming increasingly accessible. The core takeaway isn't that AI will replace human security professionals. Instead, it’s that we need to fundamentally shift our security mindset. We need to treat AI-powered vulnerability scanning as a standard part of the development lifecycle, not an afterthought. More critically, we need to prioritize robust input validation, strong authentication, and regular security audits. The cost of this experiment, while significant, underscored the potential cost of *not* taking these measures seriously. It’s a clear signal: the landscape is changing, and ignoring it is a gamble you can’t afford to take.

---

Frequently Asked Questions

What is the most important thing to know about I built a vulnerable app and spent $1,500 seeing if LLMs could hack it?

The core takeaway about I built a vulnerable app and spent $1,500 seeing if LLMs could hack it is to focus on practical, time-tested approaches over hype-driven advice.

Where can I learn more about I built a vulnerable app and spent $1,500 seeing if LLMs could hack it?

Authoritative coverage of I built a vulnerable app and spent $1,500 seeing if LLMs could hack it can be found through primary sources and reputable publications. Verify claims before acting.

How does I built a vulnerable app and spent $1,500 seeing if LLMs could hack it apply right now?

Use I built a vulnerable app and spent $1,500 seeing if LLMs could hack it as a lens to evaluate decisions in your situation today, then revisit periodically as the topic evolves.