AINext.jsNode.jsLLMFull-Stack

Building AI-Powered Web Applications: What Actually Works in Production

JRHMSquare Team· AI & Full-Stack EngineersApril 1, 20269 min read

AI integration has moved from a differentiator to a baseline expectation in modern web applications. Customers now expect intelligent search, document summarisation, automated workflows, and conversational interfaces as standard features — not premium add-ons.

But there is a gap between AI features that work in a demo and those that hold up under real users, real data, and real cost constraints. This article covers what we have learned from building production AI applications across customer support, document intelligence, and enterprise automation.

The Architecture Decisions That Matter Most

Streaming Is Non-Negotiable for LLM Responses

The biggest UX mistake teams make when integrating large language models is waiting for the complete response before showing anything. LLM inference takes 5–30 seconds for a substantive response. Users will interpret that as a broken application.

Streaming — using Server-Sent Events (SSE) or WebSockets — is essential. In Next.js App Router:

// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json();

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages,
    stream: true,
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content || '';
        controller.enqueue(encoder.encode(text));
      }
      controller.close();
    },
  });

  return new Response(readable, {
    headers: { 'Content-Type': 'text/event-stream' },
  });
}

The perceived responsiveness improvement is dramatic — users start reading the response within 500ms even when the full completion takes 20 seconds.

Vector Search Over Keyword Search for Semantic Retrieval

Traditional keyword search fails for AI use cases because users describe concepts, not keywords. "How do I cancel my subscription" will miss documents titled "Terminating your account" unless you use semantic search.

Vector embeddings solve this. The pattern we use:

At content ingestion, generate embeddings for each document chunk using a model like text-embedding-3-small
Store embeddings in a vector database (we use pgvector with PostgreSQL for simplicity, or Pinecone for scale)
At query time, embed the user's query and find the nearest-neighbour documents
Inject those documents as context into the LLM prompt (Retrieval-Augmented Generation — RAG)

This approach keeps LLM responses grounded in your actual data and dramatically reduces hallucinations.

Context Window Management

LLMs have token limits. Unmanaged conversation histories will hit those limits, causing errors or expensive truncation. We use a sliding window with summarisation:

Keep the last N turns verbatim
Summarise older turns into a compact context block
Inject static system context (business rules, user profile) separately from conversation history

This keeps token consumption predictable and costs controllable.

Prompt Engineering Patterns That Work

System Prompt as Contract

The system prompt is where you establish the model's persona, constraints, and output format. Treat it as a contract, not a suggestion:

You are a customer support assistant for Acme Corp.
CONSTRAINTS:
- Only answer questions about Acme products and services.
- If you do not know the answer, say "I don't have that information. Let me connect you with our support team."
- Never disclose internal pricing structures.
- Always respond in the same language the user writes in.
OUTPUT FORMAT:
- Keep responses under 200 words unless the user asks for detail.
- Use bullet points for step-by-step instructions.

Explicit constraints outperform "be helpful and honest" instructions by a wide margin in our testing.

Few-Shot Examples for Consistent Output Format

When you need structured output (JSON, specific markdown formats), few-shot examples are more reliable than instructions alone:

Extract the following from the support ticket. Return JSON only.

Example input: "My order #12345 hasn't arrived after 2 weeks"
Example output: {"orderId": "12345", "issue": "delayed_delivery", "urgency": "high"}

Input: "{userMessage}"
Output:

Temperature and Top-P Tuning

Creative tasks (marketing copy, brainstorming): temperature 0.7–0.9
Factual retrieval (document Q&A, data extraction): temperature 0.0–0.2
Conversational AI: temperature 0.3–0.5

Getting this wrong means either robotic responses in conversational interfaces or hallucinated "facts" in knowledge base tools.

Real-Time AI Features: The Infrastructure Layer

Rate Limiting and Queuing

LLM APIs have rate limits and are expensive. In production you need:

Per-user rate limiting — prevent a single user from burning your monthly budget in an afternoon
Request queuing — during traffic spikes, queue requests rather than failing them
Retry logic with exponential backoff — transient API errors are common

We use Upstash Redis for lightweight rate limiting in Next.js edge functions.

Caching Identical Requests

Many AI requests are actually identical. A FAQ bot will get "What are your opening hours?" hundreds of times per day. Cache the response:

const cacheKey = `ai:${hashQuery(userMessage)}`;
const cached = await redis.get(cacheKey);
if (cached) return cached;

const response = await generateAIResponse(userMessage);
await redis.setex(cacheKey, 3600, response); // Cache for 1 hour
return response;

This can cut your LLM API spend by 30–60% for use cases with repeated queries.

Cost Attribution and Monitoring

At scale, AI costs are significant. Track them:

Log token counts (prompt tokens + completion tokens) per request
Tag requests by feature, user tier, and user ID
Set budget alerts at the API level
Monitor for prompt injection attempts (users trying to override system prompts)

We surface this data in a simple internal dashboard so product teams can make informed decisions about which features to expand or constrain.

The Applications We Build Most Often

1. AI Customer Support

Conversational interface over your knowledge base. Pattern: RAG retrieval → LLM response → escalation to human if confidence is low. Key metric: containment rate (% of queries resolved without human handoff).

2. Document Intelligence

Upload documents (PDFs, contracts, reports), ask questions, extract structured data. Pattern: document chunking → embedding → vector store → query-time retrieval. We have built this for legal due diligence, insurance underwriting, and compliance review workflows.

3. Workflow Automation with AI Agents

Multi-step tasks that combine LLM reasoning with tool use — web search, API calls, database queries, email drafting. We use LangChain or Vercel AI SDK's tool-calling primitives to orchestrate these flows. The key engineering challenge is reliability: agents need careful error handling and human-in-the-loop checkpoints for consequential actions.

4. Semantic Search and Recommendation

Replace keyword search in your product or content catalogue with embeddings-based semantic search. Users describe what they want in natural language; the system finds the best matches even when exact keywords don't appear.

What Fails in Production (And How to Prevent It)

Hallucination in knowledge base Q&A: Prevent with RAG + explicit instructions to say "I don't know" when no relevant context is retrieved. Never let the model answer from training data for domain-specific questions.

Slow cold starts on serverless: LLM calls from serverless functions have cold start overhead. Use edge functions where possible, or pre-warm critical paths.

Context injection attacks: Users will try to override your system prompt. Sanitise inputs, use a separate system prompt layer, and test with adversarial inputs before shipping.

Runaway costs from verbose prompts: Profile your average prompt size. It is common for poorly designed prompts to be 3× larger than necessary. Token count is your primary cost driver.

Getting Started: The Fastest Path to a Working AI Feature

Start with a well-defined, narrow use case — not "AI for our whole product" but "AI-powered answers on our docs page"
Use a managed LLM API (OpenAI, Anthropic, Google) — no need to self-host at early stage
Build evaluation into your workflow from day one — you need a way to score response quality or you can't measure improvement
Ship to a small beta group first — real users find failure modes that internal testing never catches

At JRHMSquare we have integrated AI into customer support systems, document processing pipelines, and enterprise automation workflows. If you are building an AI-powered product and want to talk through the architecture, we are happy to share what we know.

Building something?

We are a team of full-stack, mobile, and AI engineers who ship production products. Let's talk about your project.

Get in touch

Back to all posts