In today’s rapidly evolving technological landscape, businesses are increasingly grappling with a challenge known as “dark data”, which means unstructured or hidden information that, if not managed effectively, can create significant barriers to the successful adoption of AI technologies. This is particularly true in the context of Generative AI, where improperly managed data can introduce various vulnerabilities. Sathish Murthy, Senior Systems Engineering Lead for Cohesity India, who has been at the forefront of Cohesity’s mission to tackle dark data, in a recent conversation with Tech Achieve Media, discussed how organizations can overcome these obstacles. He also shared insights into how Cohesity’s solutions, including the Cohesity Data Cloud and the AI-powered GAIA platform, ensure data is clean, secure, and accessible, paving the way for successful AI model deployment, especially for cutting-edge technologies like Generative AI.
TAM: Can you elaborate on the concept of dark data and the key challenges businesses face in managing unstructured data, especially in the context of leveraging Generative AI for competitive advantage?
Sathish Murthy: Take any customer today, whether an enterprise or a mid-sized business, and you’ll see they handle vast amounts of data they consider highly valuable. This data often includes critical assets like customer records, production data, financial information, and other mission-critical databases—like hospital records—that are essential to their operations. Companies process and generate this high-performance data every day, creating what we can call “live” or “production” data, sometimes amounting to hundreds of terabytes.
For every 100 terabytes of production data, companies accumulate even more in what we refer to as data lakes or massive secondary repositories. This exponential growth in secondary data has led to the creation of data “avalanches” on an industry-wide scale. Globally, production data now spans hundreds of exabytes, while secondary or archival data extends into the thousands of exabytes. This secondary data may contain records going back 10, 15, even 20 years, including medical records, Social Security information, credit card numbers, and more.
Years ago, this type of data was stored on tapes. Companies saw it as insurance—something to retrieve only if necessary. But today, with a need for instantaneous access, things have changed. For example, imagine tracking someone’s passport usage across different countries or checking old credentials—these require quick access, not something that tape backups can provide effectively.
Additionally, managing security is a challenge. Many people, myself included, only use a small set of passwords derived from family names and other familiar words. With every app prompting us to change passwords periodically, keeping track becomes difficult. Many resort to storing these in files, adding to the backup load and creating what’s known as “dark data”—critical information stored in this massive data lake but rarely accessed or even noticed until needed.
Retrieving such dark data is a significant focus for customers now. For instance, if a bank faces an audit and needs to retrieve all transactions from 2020 for a specific customer group, they may face issues. The data is likely stored on outdated tape backups, perhaps created using software like Windows NT, which is now obsolete. Finding someone with the skills to retrieve it, along with compatible technology, can take months—a daunting process, especially when the original team may have retired.
The challenge of managing this avalanche of dark data is enormous. As we look toward modern solutions, the industry is excited about technologies that can tackle this burden and unlock value from data that would otherwise remain buried. This massive footprint of dark data is particularly prominent among large enterprises, and addressing it has become a critical goal in today’s data-driven landscape.
TAM: How does a lack of data visibility increase the risk of ransomware attacks, in what ways can organisations address these vulnerabilities, and how are you helping them do that?
Sathish Murthy: Ransomware operates as a business model, though a malicious one. Skilled attackers use evolving algorithms to target specific customers and demand ransom payments. They’re not looking to do anything else—just get paid. Understanding this mindset can help us address the way ransomware itself has evolved, much like enterprises have.
Initially, ransomware 1.0 attacks were simple, like Ryuk, which encrypted a company’s primary data—its critical databases or financial applications—and demanded payment. Attackers often used phishing to find entry points, or what we call ‘surface areas.’ The more surface areas a system has, the more vulnerable it is—like a house with 20 doors, where you can’t be sure if all are closed.
Then came ransomware 2.0, where attackers advanced by targeting backup data as well, encrypting or deleting it along with primary data. This forced companies to rethink their recovery plans. With ransomware 3.0, attackers started to extract data, threatening public exposure if payment wasn’t made—this is known as ‘extortion.’ Now, it wasn’t just about business disruption but also about protecting corporate reputations.
At Cohesity, we’re at the forefront of data security and management, recognizing that effective security is a team effort. We don’t just provide a shield and say, ‘we’ve got this.’ Instead, we partner with security giants like CrowdStrike, Palo Alto, Cisco, and Mandiant, building layered defenses around our customers’ ‘crown jewel’—their data. This layered approach spans application, physical, and network security, all working together to protect against ransomware.
With ransomware now in its 4.x phase, Cohesity has developed several technologies to help defend against these evolving threats:
- Immutability: Our proprietary operating system ensures data immutability, which means data can be written but not modified. This prevents ransomware from altering stored data.
- Zero Trust Protocols: We’ve integrated zero-trust principles. For instance, if an attacker impersonates a trusted administrator, the system will require additional approval for sensitive actions. Our ‘quorum’ feature, for example, mandates that multiple administrators approve actions like data deletion, adding a layer of security against unauthorized changes.
- DataLock Compliance: We offer data-locking capabilities that allow customers to lock data for set periods, such as 1, 3, or even 10 years, depending on compliance needs. Once locked, it is tamper-proof—even from us.
Additionally, we can replicate data and detect ransomware, protect against it, and—most critically—focus on recovery. No matter how many sophisticated tools companies have, recovery is paramount, as daily news reports on ransomware incidents show. Cohesity’s approach emphasizes rapid recovery to minimize downtime, as we’ve learned from real-world incidents and customer feedback.
TAM: In your view, what are the critical roles of data indexing, aggregation, and searchability in preparing backup data for Generative AI applications, and how can outdated data management practices hinder AI adoption?
Sathish Murthy: For data management, let’s start with a simple scenario: a small enterprise might only need a computer, a desk, and a basic file system. But as we consider larger enterprises—like a bank, hospital, or military deployment—the data needs grow exponentially. Data never stops growing; it scales from a hundred terabytes to an exabyte or beyond. So, the key question is, does the underlying technology have the capacity to handle this scale?
One approach is using cloud storage, which provides financial flexibility and the ability to scale up or down. However, it doesn’t solve all the challenges. We still need a file system that can grow with the business, without vendor lock-in, allowing customers to use various hardware brands without being tied to a single one.
To manage data effectively, we need a platform that can scale from hundreds of terabytes to an exabyte in a matter of years. Another crucial feature is indexing, similar to how Google works, enabling users to search and retrieve data efficiently. Just as “Googling” has become synonymous with searching, we need a system that can index data instantly as it comes in, allowing fast retrieval on demand.
Reliability is also essential: if any hardware component fails—a server, node, or network—it shouldn’t disrupt the business. With modern technologies like flash storage and direct-to-memory writing, a data management system should be able to scale seamlessly, even in case of failure.
This is where Cohesity’s technology, SpanFS, comes in. Founded by Mohit Aron, who also helped create the Google File System, Cohesity leverages his expertise in scalability and indexing. After working with Nutanix, a primary storage solution, Aron saw the need to address secondary data storage. SpanFS is designed to be hardware-agnostic, meaning it doesn’t rely on specific hardware to function. If a component fails, Cohesity’s system automatically manages it, so customers can focus on managing their data without concern over hardware brand or compatibility.
With Cohesity’s indexing, you can search data based on context—whether by file name, content, or specific details like passwords or credit card numbers. And it maintains multiple data copies for redundancy, so in case of a disaster, recovery is smooth. Cohesity’s system also allows data to be stored on various mediums, such as cheaper SATA drives or cloud storage on Amazon, Azure, or Google, while maintaining a unified index across all locations. This enables seamless access to data, whether it’s on-premises, at the edge, or in the cloud, with a common software interface to manage it all.
Cohesity serves numerous Fortune 500 companies, which value its scalability, performance, and, importantly, its ability to manage and index massive amounts of data. What was once stored on tape is now on disk or in the cloud, but with Cohesity, it remains fully searchable, regardless of where it’s stored. This provides organizations with an efficient way to locate, contextualize, and utilize their data across platforms.
TAM: How does Cohesity’s GAIA enterprise search assistant improve data retrieval and reduce dark data, and what impact does this have on the accuracy and reliability of AI models used in Generative AI?
Sathish Murthy: Gaia is a new product launched by Cohesity, designed to address challenges around dark data, ransomware readiness, and data management. So, where does Gaia fit in? Picture data management as having everything you need: a data repository and built-in indexing. Now, key considerations arise, such as, “Can I build something AI-driven using RAG models (retrieval-augmented generation) to pull the data I need?”
For AI to work effectively, you need scalability, CPU power, distributed architecture, indexing, and large language models (LLMs). Cohesity’s infrastructure offers these elements—it’s built on distributed architecture, allowing for CPU scaling and inherent indexing. In order for RAG and vector modeling to work, Cohesity’s native indexing system is in place. This allows the use of LLMs tailored to the needs of enterprise customers and supports scalability to handle data volumes as large as an exabyte, which is now a typical requirement.
Cohesity realized the opportunity to help customers manage and leverage vast amounts of backup data, which makes up 80-90% of enterprise data storage. This data can include everything from unstructured data, Oracle data, M365 emails, and other enterprise data sources. With Cohesity’s indexing capabilities, AI models (such as Gen AI and LLMs) can leverage this extensive file system.
A unique feature Cohesity introduced is the concept of apps within the data ecosystem, similar to smartphone apps. Imagine if you could perform tasks like photo editing directly within your phone’s file system—similarly, Gaia allows customers to run apps directly in the data repository, accessing and leveraging data without external processes. For example, Gaia can work with Microsoft M365 to back up massive volumes of emails. With a simple query, Gaia can quickly locate specific emails from a particular user, including those containing certain keywords.
This ability to interact with data in a conversational way is transformative for customers. Instead of complex console searches, customers can ask Gaia questions in plain English, like, “How many backups failed last week that included Microsoft email data?” In seconds, Gaia provides answers that previously took hours or even days to gather manually. This change in how enterprises can interact with their data is leading to greater adaptability in managing data and enhancing operational efficiency.
Many of Cohesity’s customers in regions like ASEAN and India have started requesting the ability to build custom LLMs to better suit their needs. This is the future direction Cohesity is working toward—enabling customers to manage, control, and derive value from data in ways that are reshaping the industry.
TAM: Can you share examples of how Indian enterprises are leveraging Cohesity’s solutions to tackle dark data challenges, successfully integrate AI technologies, and capitalize on the opportunities that Generative AI presents?
Sathish Murthy: Let me give you a classic example with banks. I like using banks as they handle massive amounts of data; hospitals are another good example, along with other companies in India that use Cohesity. For these customers, the first question is often: how do I transition from traditional tools to a scalable approach? That’s where Cohesity steps in. Customers can standardize their backups with Cohesity, moving from older tools to a platform that provides enhanced performance, scalability, and indexing capabilities.
As they migrate to Cohesity, their “dark data” is indexed and organized. This step addresses a huge challenge, giving them access to vast amounts of data that would otherwise remain unstructured. Now, with a more advanced backup system, they can leverage this data for building LLM (Large Language Model) capabilities.
Here’s another example, again with a bank (different from the previous example). They mentioned that when a new employee joins, they receive a laptop. After logging in, this employee can access a structured, guided experience thanks to data from Cohesity. For large organizations with around 50,000 employees, finding specific resources or people can be challenging. This AI-powered guide helps new hires locate everything from schematic diagrams and team information to the nearest pantry or restroom. The Cohesity platform makes this possible, providing an intuitive, interactive onboarding experience. The AI can even suggest colleagues to reach out to or relevant teams within the company.
Imagine the impact on productivity! New employees can become productive much faster, potentially reducing ramp-up time by a significant factor. This idea of an “AI-driven copilot” for onboarding is inspiring, and it’s where AI is heading—transforming onboarding into a seamless process where employees can learn and integrate without depending on a single person for guidance.
Returning to the bank example, with massive amounts of backup data at their disposal, they can now perform timely audits, ensure GDPR compliance for European clients, and meet India’s and Singapore’s data protection laws. Compliance verification becomes easier with Cohesity’s capabilities.
There’s another valuable aspect here: Cohesity can also help detect ransomware. Leveraging AI, we identify anomalies in backup data. Since Cohesity performs daily backups, any deviation from the norm is noticeable. Think of it this way: just as you might notice if your child is unwell because you see them every day, AI detects patterns in data. For instance, if an Oracle backup typically takes up 1TB but suddenly jumps to 5TB, that’s an immediate red flag. Unlike traditional tools, Cohesity can alert users to unusual changes.
Our AI doesn’t stop there. We assess data entropy, read-write ratios, and even compare data signatures (hashing) with those on the cloud to check for potential ransomware encryption. If something seems amiss, Cohesity can trigger alerts, not only for the IT team but for the Security Operations Center (SOC) as well. It integrates seamlessly with security tools like Palo Alto XSOAR, Cisco SecureX, and CrowdStrike, enabling a proactive approach to security. To sum it up, Cohesity’s modern tools allow enterprise IT to move toward a fully digitalized, secure ecosystem, delivering substantial benefits for the customer.