Follow BigDATAwire:

March 30, 2021

Achieving Data Accessibility and Durability, Without Making Copies

Glen Shok

(Timofeev Vladimir/Shutterstock)

As the volume of large-scale unstructured data increases across the enterprise, there is commensurate need for scalable file storage and management. Alongside the shift from on-premises data to multi-cloud solutions, legacy approaches to these processes have contributed to data bloat.

For example, there is tremendous growth in healthcare, financial services and manufacturing technology. Organizations must store the data generated by these applications, but this typically requires they also keep multiple copies, for compliance, business continuity and/or disaster recovery. Companies in industries such as energy, media and entertainment need to transfer large files between disparate locations, requiring complex synchronization, replication and back up into perpetuity.

Traditional solutions designed for peer-to-peer replication can take hours to copy these files, leaving otherwise productive employees waiting. Cloud infrastructure is complicated by an array of local storage, replication solutions, WAN accelerators, on-site backup, and often off-site tape backup. Consequently, paying cloud providers like Amazon, Google and Azure for more capacity, well ahead of when it is needed, is compulsory. Costs multiply exponentially.

NAS Deployments and Legacy Topology

The fact is that legacy file storage was created for single sites. Allowing users to collaborate on files between repositories, or meeting business performance goals around the time from a disaster or comparable data loss-causing event—known as Recovery Point Objectives (RPO)—is only possible through scheduled replication of files between those sites. Moreover, latency makes it unworkable to collaborate across sites as users may be able to access data remotely, but they cannot do it in a performant way.

(deepadesigns/Shutterstock)

Traditional network-attached storage (NAS) deployments can create additional cost and complexity for even a simple two-site deployment. A primary datacenter housing user files and replicating them to a secondary site, typically provides users from both sites access to the same files. In addition to the secondary copy of data, enterprises are required to back up files as a means of recovery should they fall victim to data corruption or a malicious attack.

Furthermore, business units are often required to integrate with cloud partners. This is a burden that legacy storage solutions, which were never designed with cloud-native requirements, did not anticipate. Applications designed to work with files do not handle object storage, so this limits the ability to work with data in cloud storage, unless applications are rewritten. It should be noted that most cloud providers offer cloud-native SMB and NFS filesystems. While they do support these protocols, the intention of these file systems is to ingest data into the cloud providers, and the file system is local to that cloud provider These file systems do not solve the issue with forced copy operations from on-premises to cloud, and back, once cloud-native data processing is completed. They in no way provide hybrid/multi-cloud functionality.

Instead, S3-compatible storage buckets are frequently used as an archive tier. There is no way to consume cloud-native service using this type of deployment. Migration of critical file storage and associated workflows to the cloud is therefore limited. In short, legacy storage requires multiple copies of every file, and costs proliferate accordingly.

Accessibility Without Replication

An effective strategy for accessing files in hybrid cloud configurations without requiring replication to meet regulatory compliance, or BR/BC  RPO/RTO, will centralize file data and provide a single, consistent system across multiple sites. Ideally, data itself will be efficiently distributed across a geographically dispersed network in real time, without replicating it. Durability in terms of copying data and accessing it without copying files for backup and disaster recovery is also requisite.

(NicoElNino/Shutterstock)

One way of achieving this is via a global file system that, instead of replicating files across locations, uses public, private or dark-cloud storage as a single authoritative data source. Virtual machines at the edge, that is on-premises or in multi-cloud regions, can overcome latency by holding file system metadata as well as caching the most frequently used files to achieve performance.

User changes made at the edge then need to be synced simultaneously with the cloud object store—and with every other location—after being de-duplicated, compressed and encrypted. This type of immediate file consistency for all locations in a file system and across peer-to-peer connections whenever users open files has been difficult to achieve.

It often requires specialized software to find and remove duplicate files. Consider that, in the case of Google Drive’s Backup and Sync app, a copy of files must be downloaded to a local drive and once the duplicates are removed, the app will automatically sync the changes and delete duplicates in the cloud. Clearly, immediacy is not achievable in this example.

Storing unstructured data in object storage as immutable data blocks has proven effective because once written, data cannot be modified, encrypted or overwritten. As changes are made to files at the edge, for instance, new immutable data blocks and file system pointers are updated with every change to reflect which data blocks are required to form files at any given point in time.

This has been variously implemented. Azure Files is capable of taking snapshots of file shares as protection against data corruption and application errors. Share snapshots capture the state of data files at each point in time. In a similar context, Panzura proffers a granular ability to restore data by taking lightweight read-only snapshots of the file system at configurable intervals. These snapshots capture point-in-time data blocks used by every file.

In the latter instance, data blocks themselves cannot be overwritten, so snapshots allow single files, folders or the complete file system to be restored in a negligible duration of time after an error or disaster event in support of Recovery Time Objectives (RTOs) and according to last-change RPOs.

Not only is this process faster and far more precise than restoring from traditional backups, cloud providers themselves replicate data across cloud regions or buckets to provide up to “13-9’s” availability. This exceeds the durability that many organizations can achieve using even multiple copies of data, and relieves IT teams from maintaining separate backup processes for data replication instead of relying on the cloud object-store itself.

(Gorodenkoff/Shutterstock)

Defensive and Offensive Data Posture

An immutable data architecture also provides ancillary defense against data loss and corruption. Cyber threats have recently seen an uptick as remote work and collaboration increasingly has a preponderance of workers outside the perimeter. Immutability prevents stored data from being encrypted, and allows for file restoration to fixed points in time as required.

It is conceivable that ransomware can also be slowed as frequently used files can be cached at the edge. In this scenario, if files and directories that are not cached are accessed, they will need to be retrieved from cloud storage. This takes time.

In the event of a malware attack, data is written to object storage as new objects. Ostensibly, defensive alerts around spikes in cloud ingress and egress can provide early detection of attacks, reducing contamination and allowing for faster recovery. This is familiar in the case of cloud-based firewalls, as well as with advanced file sharing systems, both of which aim to solve the dilemma of capturing traffic from both cloud-to-cloud and on-premise ingress and egress points. The value of these forward-looking strategies are well understood within IT departments.

The cloud offers tremendous potential for enterprises to reduce storage costs, improve productivity, and reduce data availability risk. However, enterprises attempting to fully integrate the cloud as a storage tier have been faced with cobbling together discretely limited solutions using technologies, most of which were never designed to be used with cloud storage, from various vendors.

Harnessing the potential of cloud storage is too often an exercise in consuming precious IT resources in terms of implementation and management. Approaches toward immutable data frameworks have demonstrated significant competitive advantage while reducing both business and technological risk.

Deployed correctly, this type of real-time data productivity and protection can also break the unending cycle of on-site storage refresh and expansion, rescuing stranded islands of storage that make it difficult for people in different locations to work together.

About the author: Glen Shok is Director of Alliances at Panzura, which provides the fabric that transforms cloud storage into a global file system, allowing enterprises to use the cloud as a high performance, globally available data center. 

Related Items:

Object and Scale-Out File Systems Fill Hadoop Storage Void

Blurred Storage Lines: Clouds That Appear Like On-Prem

Big Data Is Still Hard. Here’s Why

 

BigDATAwire