Streaming AI Responses in Rails with Action Cable and Turbo

Explore practical techniques for streaming AI responses in Ruby on Rails applications using Action Cable and Turbo, covering everything from architecture to performance optimization.

Pichandal - Technical content writer for Ruby on Rails

Pichandal

Technical Content Writer

Illustration of ruby gem with test "Streaming AI Responses in Ruby on Rails"

AI-powered applications have changed user expectations. People no longer want to submit a prompt and wait several seconds for a complete response. They expect to see answers appear as they're generated, much like ChatGPT and other modern AI tools.

You can build this experience using Action Cable and Turbo, allowing streaming AI responses in Rails applications without introducing a complex frontend framework. This combination of Rails Action Cable AI streaming and Turbo Streams enables AI-generated content to reach the browser in real time.

According to Vercel's State of AI survey, latency and performance rank among the top technical challenges teams face when building AI-powered features. As a result, techniques such as streaming responses and reducing Time to First Token (TTFT) have become key priorities for modern development.

In this blog, we'll explore how Rails developers can use Action Cable and Turbo in Rails to create responsive AI interfaces that feel fast, interactive, and modern.

Why Stream AI Responses Instead of Waiting for Completion?

Traditional AI workflows follow a simple pattern: a user submits a prompt, the server waits for the AI provider to generate a complete answer, and the response is displayed once everything is finished.

While functional, this approach creates noticeable delays.

Streaming AI responses in Rails changes the experience by sending response fragments as they become available, allowing users to begin reading an AI reply before generation is complete.

Improved User Experience

Instead of staring at a loading spinner, users begin receiving content almost immediately. Even if a response takes ten seconds to complete, users can start reading within the first second.

Besides, streaming reduces perceived waiting time by showing progress continuously rather than delivering a complete response all at once.

Better Engagement

Real-time output feels more conversational. Users can follow the response as it's generated, creating an experience similar to interacting with a human.

Reduced Frustration During Longer Tasks

Streaming becomes especially valuable when AI is:

  • Generating lengthy articles
  • Producing code snippets
  • Summarizing large documents
  • Analyzing datasets
  • Answering complex questions

Applications such as OpenAI ChatGPT, Anthropic Claude, and AI-powered coding assistants have popularized this pattern because it improves responsiveness without reducing output quality.

What Components Make AI Streaming Possible in Rails?

Rails already includes most of the tools needed to create a streaming AI experience.

Action Cable

Action Cable provides WebSocket support directly within Rails.

Unlike traditional HTTP requests, WebSockets maintain a persistent connection between the server and browser. This allows the server to push updates instantly whenever new AI content arrives.

Turbo Streams

Turbo Streams make it easy to update page content without writing large amounts of JavaScript.

Instead of manually manipulating DOM elements, the server broadcasts updates and Turbo applies them automatically.

AI Streaming APIs

Most modern AI providers support streaming responses.

Whether you're implementing Rails LLM streaming with OpenAI, Anthropic, or another provider, the streaming workflow remains largely the same.

Rather than returning one large block of text, these APIs deliver smaller chunks called tokens or events. Rails can process each chunk and immediately broadcast it to connected clients.

How Everything Works Together

The architecture typically looks like this:

Component Responsibility
User Interface Sends prompts and displays responses
Rails Application Coordinates requests and updates
AI Provider Generates streamed content
Action Cable Delivers updates in real time
Turbo Streams Updates the browser UI

This combination enables a smooth, ChatGPT-style experience with minimal frontend complexity.

How Does AI Response Streaming Work in Rails?

Understanding the request lifecycle helps simplify implementation.

Step 1: User Submits a Prompt

The process begins when a user enters a message and submits it. Rails creates a conversation record and stores the user's prompt.

Step 2: A Background Job Starts Processing

AI requests should not run inside the web request cycle. Instead, a background job handles communication with the AI provider.

This approach keeps web requests responsive and prevents browser timeouts.

Step 3: The AI Provider Streams Content

As the model generates output, small chunks arrive incrementally. And, instead of waiting for completion, Rails processes each chunk immediately.

Step 4: Action Cable Broadcasts Updates

As the AI generates content, Rails receives it in small chunks and immediately broadcasts each update through an Action Cable channel. Since the WebSocket connection remains open, connected clients receive updates in real time.

Step 5: Turbo Updates the Browser

Turbo Streams automatically apply these updates to the UI by appending or replacing content within the response container. As new chunks arrive and are rendered, users see the AI reply appear progressively, creating a natural typing effect.

How Do You Set Up the Rails Foundation?

Before implementing streaming, it's important to establish a solid application structure.

Prerequisites

You'll typically need:

  • Rails 7 or Rails 8
  • Hotwire/Turbo
  • Action Cable
  • A background job framework
  • An AI provider that supports streaming

Designing Core Models

Many applications use a structure similar to:

  • Conversation
  • Message
  • User

A conversation contains multiple messages, including both user prompts and AI-generated responses.

This structure makes it easier to persist history and replay conversations later.

Handling Partial Responses

One common approach is storing AI output progressively.

As chunks arrive:

  1. Append the content to the existing message.
  2. Broadcast the update.
  3. Persist the latest state.

Once generation completes, mark the message as finished. This ensures users can recover conversations even if they refresh the page.

This approach is particularly useful when streaming OpenAI responses in Ruby on Rails applications, where generation may continue for several seconds, and partial content should remain visible.

What Performance Challenges Should You Consider?

Streaming introduces unique performance considerations that don't exist in traditional request-response workflows.

  • Avoid Broadcast Overload: If an AI model streams at 80 tokens per second, sending 80 separate Action Cable updates per second can overwhelm the client's browser and your server. Consider throttling or aggregating tokens into small batches (e.g., every 3-5 tokens) before broadcasting.
  • Optimize Database Updates: Writing to the database for every single incoming token will quickly overwhelm your application. Frequent I/O operations create severe database bottlenecks and exhaust your connection pool. Instead, accumulate the text in memory within your background job while broadcasting the chunks immediately to the user. Only perform a single, final database save once the AI provider signals that the generation is completely finished.
  • Scale Action Cable with Redis: Ensure your production environment uses Redis for Action Cable. This allows you to distribute WebSocket messages across multiple server instances as your traffic scales horizontally.
  • Monitor Key Metrics: Successful AI streaming applications track specific metrics. It helps teams maintain a consistent user experience as usage grows.
Metric Why It Matters
Time to First Token Measures responsiveness
Response Completion Time Tracks AI generation speed
Active WebSocket Connections Indicates infrastructure load
Broadcast Frequency Helps identify bottlenecks

What Are Common Challenges During AI Streaming?

Even with a robust framework like Rails, streaming applications introduce unique edge cases that traditional request-response apps don't have to deal with:

  • Duplicate Content & Network Retries: Network hiccups or aggressive retry policies can occasionally duplicate incoming chunks. Using strict message IDs and writing idempotent parsing logic helps ensure your text stays perfectly sequenced.
  • Partial Response Failures: Rate limits or sudden upstream provider interruptions can cut an AI response short. Your application should be designed to gracefully display whatever was successfully generated rather than breaking the UI or throwing away the partial output.
  • Users Leaving Mid-Generation: If a user closes the tab or navigates away while the AI is still typing, the background job shouldn't crash. By storing updates in a consistent state, users can return later and see exactly where the generation left off.
  • Long Responses Slowing the Interface: Gigantic text generations can burden browser rendering engines if thousands of miniature updates are continuously pushed. Aggregating tokens into small chunks before broadcasting prevents the UI from becoming laggy.
  • AI Provider Timeouts: Upstream LLM networks face intermittent spikes in latency. Implementing clean retry logic and elegant fallback handling ensures a temporary hiccup doesn't completely disrupt the user experience.

Advanced Enhancements for a Better Experience

Once basic streaming is solid, you can layer on these features to elevate your application:

Token-by-Token Typing Effects: Simulating a natural typing flow creates a premium feel, though it must be carefully balanced against rendering performance.

Streaming Markdown Rendering: AI models love outputting markdowns. Incremental rendering via a Stimulus controller ensures the content formats correctly and remains readable while arriving.

Streaming Code Blocks: For developer tools, special handling of code blocks prevents ugly formatting issues and broken tags while the code snippet is still mid-generation.

Canceling Active Generations: Adding a "Stop Generating" button lets users halt background jobs immediately, giving them control and saving you API token costs.

Collaborative AI Experiences: Since Action Cable supports multiple concurrent subscribers out of the box, extending your AI stream so multiple users can watch the response generate simultaneously is incredibly straightforward. This capability opens the door to collaborative Rails real-time AI chat experiences where multiple users can observe the same generated response simultaneously.

Why Rails Dominates the AI Streaming Era

There is a common misconception that you need a modern JavaScript SPA framework to build reactive AI tools. Ruby on Rails proves otherwise.

By unifying Action Cable for real-time pipes, Turbo Streams for effortless frontend updates, and Active Job for asynchronous processing, Rails handles the entire AI streaming lifecycle natively. It gives you a premium, ChatGPT-like user experience with a fraction of the code, keeping your architecture clean, maintainable, and remarkably fast.

As AI-powered features become a standard expectation, many teams are looking for practical ways to integrate them into existing applications. Whether you're exploring Rails AI integration, planning an application modernization initiative, or looking to hire Ruby on Rails developers for AI-powered projects, the right technical foundation can make all the difference. To discuss your requirements, contact our team at RailsFactory.

Written by Pichandal

Tags

Other blogs

You may also like


Your one-stop shop for expert RoR services

join 250+ companies achieving top-notch RoR development without increasing your workforce.