IBM Content Aware Storage

The Enterprise Answer to Dark Data and AI Readiness

We sat down with Patrick Kay, IBM's Product Manager for Content Aware Storage, to understand how CAS turns decades of dark, scattered, unstructured data into a fully AI-ready knowledge repository without moving a single byte.

What Is IBM Content Aware Storage And Why Does It Exist?

IBM's CTO Vincent Sue spotted a collision coming. On one side: an explosion of unstructured data documents, PDFs, presentations, audio recordings, images accumulating across enterprise storage estates for decades. On the other: a generation of AI tools that needed precisely that information to be useful.

The problem was that the two couldn't communicate. AI tools needed data in a specific format vectorised, indexed, semantically rich. Enterprise data sat in storage pools, file servers, S3 buckets, tape archives, NFS shares. Raw, disconnected, unreadable by AI in any practical sense.

IBM Content Aware Storage CAS was built to bridge that gap. As Patrick describes it: "It's the semantic interface for AI ready data." Its job is to take whatever unstructured data your organisation already has, transform it into an AI-ready vector database, and make it available to any AI application that needs it, all without requiring a data migration.

Our CTO really saw we had an opportunity to unleash what he calls the holy grail of storage, being able to use natural language to query information in storage. That's where the concept of CAS was born.

Patrick Kay, Product Manager, Content Aware Storage, IBM

The Dark Data Problem

If you've been in enterprise IT for more than a decade, you'll recognise the pattern Patrick describes. An organisation accumulates storage pools over years a NAS here for the finance team, an S3 bucket for the product department, an on-premise archive from a system that was decommissioned in 2018. Nobody is quite sure what's in half of them. Nobody has the time to find out.

Patrick's colleague coined the phrase "dark data" for exactly this: vast pools of organisational information sitting in storage that nobody knows how to unlock. The value is enormous historical decisions, institutional knowledge, legal records, client history, product documentation but it's effectively invisible to anyone who needs it.

Real-World Example from the Conversation

Patrick shared a striking example. An insurance company in the US was settling any claim under $1 million automatically not because the claims were necessarily valid, but because performing the data discovery to evaluate them properly was too expensive and time-consuming.

With CAS, that same discovery becomes a natural language search against their entire historical knowledge base a task that would previously have taken weeks of manual work, or simply not happened at all.

How CAS Works: The Technical Picture, Explained Simply

To understand what CAS does, it helps to first understand RAG Retrieval Augmented Generation because CAS is, in Patrick's words, "embedding the RAG framework deep into the storage layer."

What Is RAG?

Retrieval Augmented Generation Plain English

When you ask an AI a question, it normally generates an answer from what it was trained on which is generic knowledge from the internet. RAG changes that by giving the AI access to your specific data at the moment it answers.

Here's how it works in practice. Your documents are broken into chunks and converted into mathematical vectors representations that capture the meaning of the content, not just the words. These live in a vector database. When a user asks a question, their prompt is also vectorised and compared against the database. The most relevant chunks are retrieved and fed into the AI's context window alongside the question so the answer it generates is grounded in your actual data, not a hallucination.

The result: an AI that doesn't just answer generically, but answers with knowledge of your organisation, your products, your history, your procedures.

What CAS adds to this picture is a solution to the problems that make RAG hard to implement at enterprise scale. How do you connect all your disparate data sources? How do you process petabytes of information efficiently? How do you keep the vector database current as your data changes? How do you ensure that the AI only returns data to people who are authorised to see it? CAS answers all of these.

Connecting Disparate Data Without Migration

CAS uses a technology called Active File Management (AFM) built within IBM Storage Scale to connect to third-party storage systems: S3-compatible cloud storage, NFS shares, NetApp, and others. Crucially, it doesn't require data migration. Your data stays exactly where it is. CAS caches it temporarily to process it, builds the vector database, and then releases the cache. The data never moves only its vectorised representation does.

For organisations with petabytes in the cloud, this is significant. Pulling a petabyte back to on-premise to process it would take weeks and cost a fortune in egress fees. CAS processes it where it lives.

Incremental Processing: The Efficiency Breakthrough

Traditional RAG implementations used batch processing. Every update to the data meant reprocessing everything. On a petabyte of data, that's enormously expensive Patrick estimates you might need 20 GPUs running continuously to keep a batch-processed database current.

CAS uses incremental processing instead. The initial ingest processes everything once. After that, AFM's watch folder mechanism monitors for changes and only the changed data gets reprocessed. If only 1% of a petabyte changes in a month, CAS processes 10 terabytes instead of 1,000. The GPU requirement drops from 20 down to the CAS minimum of two. That's a 90% reduction in ongoing compute cost.

90%+

recall accuracy achieved in IBM Research testing at 100 billion vectors with sub-second latency

10:3

ratio of source data to vector data created 10TB of documents generates ~3TB of vector database (text only)

90%

reduction in ongoing GPU compute requirements after initial ingest from 20 GPUs down to 2 for incremental updates

Security and Governance: Designed In, Not Bolted On

One of the most common objections to enterprise AI projects is data governance: who can see what, and how do you prevent the AI from surfacing confidential information to the wrong user? Patrick describes this as something IBM heard consistently from customers, from their own CIO office, and from Nvidia during the design phase. The response was to build governance into the CAS architecture from the ground up, not add it later.

About IBM CAS

Full name

IBM Content Aware Storage (CAS)

Released

GA as software, GTC 2025 · Appliance support Dec 2025

Product Manager

Patrick Kay, IBM

Runs on

IBM Fusion HCI · Storage Scale System 6000 · BYO hardware

Minimum compute

2 GPUs for incremental processing

Data types

Unstructured: PDFs, docs, images (OCR), audio/video

Storage ratio

~10:3 source to vector (text) · up to 1:2+ multimodal

Tested scale

100 billion vectors, 90%+ recall, sub-second latency

Governance

ACL preservation · file-level security · Fusion Data Catalog

Key Concepts Discussed

RAG

Retrieval Augmented Generation grounding AI in your data

Dark data

Unstructured data in storage that organisations cannot access or use

Vector database

Mathematical representation of document meaning, enabling semantic search

AFM

Active File Management connects third-party storage without migration

Incremental ingest

Only changed data is reprocessed 90% GPU saving vs batch

Fusion Data Catalog

Metadata tagging and governance layer on the CAS pipeline

Interested in IBM CAS?

We're an IBM partner. Talk to us about what a CAS deployment could look like for your data estate.

TALK TO FORTUNA DATA

Three layers of governance are built in:

Access control preservation. Existing security policies from your source data whatever permissions were set on files in your S3 bucket or NFS share flow through the vectorisation process and persist in the vector database. When an AI application queries the database, it only retrieves vectors that the requesting user is authorised to see. Existing security doesn't need to be reimplemented; it comes along for the ride.

File-level security from IBM Storage Scale. Because CAS runs at the storage layer, it inherits and enforces the file-level security policies already in place within Storage Scale. These aren't an add-on they're structural.

Fusion Data Catalog metadata filtering at the front end. IBM's Fusion Data Catalog sits in front of the CAS processing pipeline. It indexes metadata and allows administrators to set rules about what should and shouldn't be vectorised. PII-tagged data? Filter it out before it ever enters the vector database. Duplicate files? Removed automatically. The result is a processing pipeline that's not only safer, but more efficient — Patrick notes that filtering out 50% of unnecessary content also saves 50% of your processing costs.

Data governance in the world of AI is even more critical than before because eventually it's going to be automated, and you won't even have a human watching these data flows. We designed CAS to make sure the right data gets to the right user. Full stop.

Patrick Kay, IBM

The Appliance Model vs DIY: An Honest Comparison

CAS can be deployed as software on your own hardware. Patrick is clear about this if you have the infrastructure and the engineering resource, that route is available. But the conversation with Rick surfaced what 40 years of enterprise infrastructure experience makes obvious: DIY is almost never as simple as it sounds.

The IBM appliance model deployed natively on IBM Fusion HCI or the Storage Scale System 6000 offers something that DIY cannot guarantee: a fully validated, stress-tested, supported stack. IBM's expert lab services install everything and confirm readiness. You're not debugging driver incompatibilities six months into a project. You're not waiting for replacement parts in a supply chain crisis. You're at a working system, faster.

As Patrick puts it: the primary reason to choose the appliance is time to value. In the enterprise technology landscape of 2026, that is often the only metric that matters.

A Point from Rick 40 Years of Infrastructure Experience

The conversation gets candid here. Rick notes that in his career, DIY solutions that "should just work" have consistently taken six months where engineered solutions took one. The incompatibilities aren't hypothetical they happen. Drivers. Cooling. Power. Vendor finger-pointing when something goes wrong.

The cost of a CIO pulling their hair out over cost overruns and data migration headaches doesn't appear in any TCO spreadsheet. But it's real. The appliance model prices that risk out of the equation.

What Can CAS Actually Search? Data Types and Limitations

CAS is primarily designed for unstructured data — and that covers the vast majority of enterprise dark data. PDFs, Word documents, PowerPoint presentations, HTML, XML, plain text, and crucially, images containing text (via OCR) and audio or video files (via transcription). Any document format that contains readable text is a candidate.

The processing pipeline supports multimodal content through NVIDIA's NIM (Nvidia Inference Microservices) architecture — containerised tools for OCR, chart extraction, table extraction, and other data preparation tasks that can be deployed in sequence to handle complex documents.

Structured data databases, spreadsheets with millions of rows is where Patrick advises a different approach. CAS is exceptional at unstructured content; IBM's broader portfolio handles structured data management. For smaller spreadsheets and tabular content at a reasonable scale, vectorisation works fine. At billions of rows, it's a different conversation.

CAS doesn't currently connect directly to databases, but content management systems are on the roadmap Box, OneDrive, and IBM's own FileNet among them.

Scalability: How Far Does It Go?

IBM Research recently published testing that reached 100 billion vectors while maintaining sub-second query latency and over 90% recall accuracy. Patrick is measured about this lab conditions aren't production conditions but the results are genuinely impressive for a technology that's been generally available for under a year.

In practice, most organisations in early CAS deployments are working with tens of millions to hundreds of millions of vectors. Easily manageable. For organisations with petabytes of unstructured data the point where CAS becomes not just useful but essentially necessary the scale is there to meet them.

The honest performance trade-off is one worth naming: semantic search, unlike keyword search, is not exact. It operates on meaning and similarity, not literal string matching. Above 90% accuracy is the target and CAS achieves it consistently. But organisations used to deterministic database queries need to understand that semantic search is probabilistic in nature. That's not a weakness; it's how the technology works, and why results feel more natural and contextually relevant than keyword search ever could.

Storage as the Hero of the AI Stack

Perhaps the most interesting thread in the conversation is a reframing of where storage sits in the AI architecture. Historically, storage was infrastructure important but invisible, something that needed to just work and stay out of the way. In the AI era, Patrick argues, storage becomes central.

The reason is mathematical. For every 10TB of text source data, CAS generates approximately 3TB of vector data. For multimodal content, that ratio can approach 1:2 or higher. An organisation processing a petabyte of unstructured data needs to plan for 300TB of additional storage, minimum. This surprises customers. It shouldn't but it does, because storage has always been treated as a commodity afterthought rather than a strategic layer in the AI stack.

IBM's position with Storage Scale, CAS, tape (including S3 Glacier compatibility on IBM's enterprise tape systems), and tiering from flash through to tape means the entire storage hierarchy can participate in an AI pipeline. Cold data on tape is no longer inaccessible to AI. It's on the roadmap, and it changes the economics of AI data management at enterprise scale considerably.

Where to Go Next

If this conversation has raised questions about your own data estate about the dark data sitting in storage pools you haven't been able to make sense of, or about how to build an AI layer that actually uses your institutional knowledge we'd welcome that conversation. As an IBM partner, Fortuna Data works with IBM's storage and AI portfolio alongside the FSAS Technologies Private GPT appliance, and we can help you understand how these pieces fit together for your specific situation.

Patrick's suggestion: start with a proof of concept. IBM and Fortuna Data can work with you to design and demonstrate a CAS deployment against a representative sample of your data before any significant commitment is made. The technology is moving fast. The best way to understand what it can do for your organisation is to see it working on your own data.

LLMs vs AI Agents

The Difference Between Thinking Faster and Operating Smarter

IBM FlashCore Module 5 Explained

Performance, Cyber Resilience & What It Actually Changes for Enterprise Storage

AI Security

Are Your Staff Already Using AI With Your Company Data?