Blog

  • Ransomware Detection Model: A Use Case for Nutanix Hyperconverged Infrastructure (AOS) and Azure Machine Learning Studio

    Ransomware Detection Model: A Use Case for Nutanix Hyperconverged Infrastructure (AOS) and Azure Machine Learning Studio

    Javier E. Rodriguez, PE
    School of Cybersecurity and Privacy
    Georgia Institute of Technology
    javirodz@gatech.edu

    Abstract

    This paper presents the design and implementation of a behavioral ransomware detector built with machine learning. The system models the input and output (I/O) patterns of a virtual machine in two states: at steady state, and while its files are being encrypted by a typical ransomware attack. The model relies on key performance indicators (KPIs) that are available in most modern storage arrays, together with Azure Machine Learning Studio in the Microsoft Azure cloud. This combination keeps the method accessible to practitioners who do not have specialized knowledge of ransomware internals or machine learning.

    Data collection takes place at the storage array level through the Nutanix distributed data cluster. Observing I/O at this layer makes the measurement invisible to adversarial ransomware running inside the guest operating system. Because the method is behavioral, it can be expressed as anomaly detection, which allows it to provide a general detection capability against previously unseen, zero day ransomware.

    The experiments show that, once a virtual machine reaches steady state I/O, the model reacts to the anomalies caused by active encryption with very high accuracy.

    1. Introduction

    According to several industry reports, including the CrowdStrike Global Threat Report [1] and guidance from the Cybersecurity and Infrastructure Security Agency (CISA) [2], ransomware remains one of the most visible cybersecurity risks. The practice will continue for as long as it stays profitable. Estimates of its cost vary, but they consistently exceed the billion dollar mark, and industry coverage describes a threat that keeps evolving in both scale and technique [3].

    The average ransom payment nearly doubled in the year preceding this study, yet that figure is small next to the cost of downtime. PurpleSec reported that the average cost of downtime per incident in 2020 was approximately $283,000 [4]. The growth in attacks reached every sector, public and private. Readers should treat that figure as a vendor reported estimate and verify it against a primary source before citing it independently.

    The central difficulty in detecting ransomware is that it uses the same libraries and system calls as legitimate applications and routine operating system tasks. By taking a generic approach built on off the shelf tools, the system described here aims to address that difficulty without depending on signatures specific to any one family.

    2. Background

    Ransomware is a class of malware that denies a user access to their data until a ransom is paid [5]. The threat actor demands payment in exchange for restoring the data, which may be anything held in the system’s storage. The goal is to prevent victims from carrying out their normal activities (see Figure 1).

    Figure 1. Steps in ransomware activity [6].

    Figure 1. Steps in ransomware activity [6].

    Ransomware is commonly divided into two basic types, locker and crypto, with hybrid variants in some cases [6]. The approach in this study targets crypto ransomware, which encrypts the victim’s original data and renders it unavailable. The scheme typically includes a ransom note with instructions for paying and for obtaining the key needed to decrypt the data. This is an attack against the availability of the system.

    In some cases the threat actor also exfiltrates the data and threatens to publish it unless the ransom is paid. That tactic attacks the confidentiality of the data and can expose victims to regulatory fines, for example when the data includes payment card information or medical history.

    2.1 Technology Overview

    Figure 2 shows the technology used in the laboratory setting. The top layer, labeled App, holds the virtual machine running the Windows 10 operating system. The left side of the figure shows the conceptual layout of a hyperconverged system, and the right side shows the physical equipment used in the experiments.

    Figure 2. The legacy three tier infrastructure consists of three layers: the compute layer, the storage area network or fabric (SAN), and the storage array or arrays [7].

    Figure 2. The legacy three tier infrastructure consists of three layers: the compute layer, the storage area network or fabric (SAN), and the storage array or arrays [7].

    Hyperconverged infrastructure (HCI) is a software defined, unified system that combines the conventional data center elements of storage, compute, networking, and management. It uses software and x86 servers in place of expensive, purpose built hardware, which reduces data center complexity and improves scalability through a single, simple console. The laboratory equipment used in these experiments is a Nutanix hyperconverged cluster running the Acropolis hypervisor. The second part of the laboratory environment runs in the Azure cloud and is described in the Analysis section.

    3. Literature Review

    Dozens of studies address ransomware detection using a wide range of techniques. A substantial body of work applies machine learning and dynamic analysis to the problem, including wrapper based feature selection [8], network traffic analysis [9], [10], software defined networking [11], behavioral classification of variants [12], layered machine learning defenses [13], finite state machine models [14], and broad surveys of the field [15], [6], [16], [17]. Behavior based automated malware analysis has also been studied in depth [18], [19], [20], [21]. Hypervisor based and disk or storage level monitoring has been explored as well [22], [23], [24], [25], along with self healing, ransomware aware file systems [26] and data centric stopping techniques [27]. The approach in this study aims to distinguish itself by collecting data only at the storage array level and by using the hypervisor to keep the attacker unaware that it is being observed. Two studies that rely on dynamic behavior are worth discussing in more detail.

    The detection system described by Kharraz et al. [28] is based on the disk access actions performed by a process. It observes the change in entropy between a read and a write to the same region of a file, the proportion of file content that is overwritten, and whether the process deletes files. It also collects metadata about disk access, including whether a process writes to many files and whether those files span very different types or come from a single application. It measures the time between write requests and assigns higher risk as that interval shortens. These features are combined into a risk score through a linear function whose weights are determined by recursive feature elimination.

    In a second study, Baek et al. [24] proposed a detection model based on a set of lightweight behavioral features that describe the overwriting pattern of ransomware, a pattern that is largely invariant across families.

    Figure 3. Ransomware overwriting pattern contrasted with valid applications [24].

    4. Methodology

    This section presents the design of the I/O pattern analyzer. By drawing on the key performance indicators present in most modern storage arrays, the detection model is independent of the operating system installed on top of the hypervisor. The design has two goals: first, to create an efficient monitoring tool, and second, to remain hidden beneath the operating system layer so that it resists ransomware evasion techniques.

    4.1 Threat Model

    The threat model considered in this experiment is an attacker who can infect the operating system of the virtual machine. The attacker has evaded the static detection techniques and has begun the encryption process. The attacker has no access, physical or remote, to the hypervisor. Framing the problem this way follows established threat modeling guidance for ransomware [29].

    4.2 Ransomware

    There are close to 400 families of ransomware. The behavioral characteristics of each family matter in the design of a detection model. The characteristics relevant to this experiment are the way the data is encrypted and the type of evasion techniques used. Other characteristics, such as the network flow and the attack vector, are outside the scope of this study.

    After a certain period, a guest operating system reaches a steady state of I/O access patterns. A typical application is unlikely to behave in the same way a malicious payload does, at least not continuously. Everything a ransomware executable does requires resources such as CPU and memory, and it requires access to files, because the primary goal of a crypto locker is to encrypt all of the data in a way that makes it unusable to the victim.

    I/O Access patternI/O CharacteristicsTypical Applications
    Streaming Reads100% Reads; Large contiguous requests; 1-64 concurrent requests. It may be threaded.Media Servers (Video-on-demand, etc.). Virtual Tape Libraries (VTL), Application Servers
    Streaming Writes100% Writes; Large contiguous requests; 1-64 concurrent requests. It may be threaded.Media Capture, VTL, Medical Imaging, Archiving, Backup, Video Surveillance, Reference Data
    OLTPTypically, 2KB to 16KB request sizes; Read modify, write, verify operations resulting in 2 reads for every write; Primarily random accesses. Large number of concurrent requests. When running SQL statements in parallel, Database will typically perform large random I/Os.Databases (SAP, Oracle, SQL), Online Transaction Servers
    File ServerModerate distribution of request sizes from 4KB to 64KB. However, 4KB and 64KB comprise 70% of requests; it is primarily random; Generally, four reads for every write operation, a large number of concurrent requests during peak operational periods.File and Printer Servers, e-mail (Exchange, Notes), Decision Support Systems
    Web ServerA wide distribution of request sizes from 512 bytes to 512KB; Primarily random accesses; a Large number of concurrent requests during peak operational periodsWeb Services, Blogs, RSS Feeds, Shopping Carts, Search Engines, Storage Services
    WorkstationsPrimarily small to medium request sizes; 80% sequential and 20% random; Generally, four reads for every writes operation. 1-4 concurrent requests.Business Productivity, Scientific/Engineering Applications

    Table 1. Application I/O characteristics by access pattern [30].

    Table 1 summarizes common application types together with their typical I/O patterns and behavior. Other characteristics, such as streaming versus batch access, serial versus random access, and the block size histogram, also change during a ransomware attack. Figures 4 and 5 show examples of how data processing and access patterns differ.

    Figure 4. Data processing model [30].

    Figure 5. Access pattern contrast [31].

    Kharraz et al. divide the characteristics of ransomware I/O access patterns into three main categories:

    The attacker overwrites the user’s file with the encrypted version.

    The attacker reads the file, writes a new encrypted file, and then deletes the original.

    The attacker reads the file, writes a new encrypted file, and then overwrites the original.

    Figure 6. I/O pattern categories according to Kharraz et al. [28].

    Most families use a specific file extension for the encrypted output. For example, some Mespinoza variants of ransomware use the .pysa extension. Taking these access patterns into account, some families list and then randomly encrypt the files, which is a more advanced evasion technique. Detecting this kind of malicious behavior reliably requires several orthogonal methods of monitoring, a point expanded in the Discussion and Limitations section.

    With the experimental setup ready, data collection began. A simulated ransomware script (see Appendix III) traverses the Documents folder in the Windows 10 test VM and encrypts, from top to bottom, every file with one of the following extensions:

    “.pptx”, “txt”, “csv”, “.db”, “.mdb”, “.log”, “.sav”, “.sql”, “.xml”,”.key”, “.cert”,

    “.pem”, “.doc”, “.pdf”, “.email”, “.eml”, “.msg”, “.oft”, “.ost”,

    “.pst”, “.vcf”, “.apk”, “.bat”, “.pl”, “ps1”, “.pl”, “.vsd”, “.vss”, “.vst”, “.vdx”,

    “.vsx”, “.vtx”, “.vsw”, “.vsl”, “.dot”, “.xls”, “.py”, “.jpg”, “.jpeg”, “.png”,

    “.pgp”, “.tiff”, “sys”, “.pfx”, “plist”, “.vmx”, “.gif”, “.lic”, “.kit”, “.ctx”,

    “.sh”, “.conf”, “.ttf”, “.ico”, “.exe”, “.dmg”, “kdbx”, “.java”, “.jar”, “.yml”, “.json”,

    “kdb”, “.dll”, “.img”, “.msi”, “.wsf”, “.htm”, “.php”, “.vb”, “.c”, “.pcap”

    A complete traversal of the Documents folder takes approximately seventeen minutes. Appendix IV shows a timestamped sequence of performance metric snapshots captured during a traversal.

    4.3 Testbed

    Configuring a laboratory setting involves several considerations. The first is to provide an environment that resembles production. In this case, several tools were used to populate a Windows 10 VM with the data needed for a ransomware attack. Building such a testbed is not trivial, and additional observations appear in the Future Work section. The design followed prudent practices for malware experiments [32], and drew on isolated analysis environments such as Cuckoo Sandbox [33], [34].

    For this scenario, the operating system is assumed to be free of ransomware during the time it takes to reach steady state. That period is when the monitor reads the I/O patterns to create a clean baseline.

    The operating system used for these experiments is Windows 10 Enterprise. Figure 7 shows the layout of the laboratory. The guest operating system runs on a Type 1 hypervisor, which in this case is Nutanix Acropolis. The hypervisor isolates the guest VM and prevents the ransomware from reaching anything outside the experimental environment.

    Figure 7. Analysis layout with a Type 1 hypervisor.

    Figure 7. Analysis layout with a Type 1 hypervisor.

    To populate the test VM with files, two Python programs were developed to generate data. The first program, shown in Figure 8, accepts a root folder path as a starting point and creates folders to a chosen depth.

    def gen_tree(depth, parent_dir):
    while depth > 0:
    depth = depth - 1
    new_directory = random_line('words.txt')
    path = os.path.join(parent_dir, new_directory)
    try:
    os.mkdir(path)
    except OSError as error:
    pass
    parent_dir = path
    return path

    Figure 8. Python routine that creates a folder tree using common English words.

    The code uses a list of the one thousand most common words in the English language [35]. According to Kharraz et al. [28], some ransomware variants compute the entropy of a file or folder name and will not trigger if the name appears too random, so realistic names matter.

    For each path created, a second Python program generates Word files using the python-docx library [36]. To add images, two techniques were combined, one from Arrington [37] and one from Zita [38]. Finally, to increase the data volume, older documents were added, including PowerPoint presentations, PDF files, and additional images. Because only simulated ransomware was used, there was no risk of anyone stealing real data.

    The sample data consisted of 16,447 files (see Appendix I). The final number of encrypted files was lower, approximately 12,000, because the system stalled on very large files such as .zip and .ova archives.

    A second important consideration is the hardware isolation of the system in which the ransomware is triggered. Hardware isolation refers mainly to the network and to the ability of the monitoring environment to inject the ransomware without any risk of spreading it. To close network access, an isolated virtual switch with no uplink connections was used. The setup is flexible enough to move eth0 to a switch with internet access when software needs to be added to the operating system. In Figure 9, the br0 virtual switch has physical uplinks to a physical switch, while br1 is isolated.

    Figure 9. Laboratory network diagram.

    Because the test VMs run in a virtual environment, the console is available at any time without the risk of spreading the ransomware. With the environment and configuration described, the next section covers the key performance indicators that are available and how the data is collected.

    4.4 Features

    Feature identification is a broad subject. Features are the foundation of the dataset, and the dataset is only as useful as the features selected. The insight gained from the observations improves when the features chosen are well suited to the problem. This experiment had a rich set of features available; Appendix II lists them in full.

    Dataset quality improves when features are selected through a formal process such as feature engineering [39]. In this case, a combination of heuristics and the findings of the research papers reviewed for this problem guided the selection. Table 2 lists the features used.

    #Selected feature
    1ctl_random_ops_per_sec
    2ctl_read_io_bandwidth_kBps
    3ctl_write_io_bandwidth_kBps
    4ctl_num_read_iops
    5ctl_num_write_iops
    6hv_avg_read_io_latency_usecs
    7hv_avg_write_io_latency_usecs
    8ctl_total_read_io_size_kbytes
    9ctl_read_size_histogram_4kB
    10ctl_read_size_histogram_8kB
    11ctl_read_size_histogram_16kB
    12ctl_read_size_histogram_32kB
    13ctl_read_size_histogram_64kB
    14ctl_read_size_histogram_512kB
    15ctl_read_size_histogram_1024kB
    16ctl_write_size_histogram_4kB
    17ctl_write_size_histogram_8kB
    18ctl_write_size_histogram_16kB
    19ctl_write_size_histogram_32kB
    20ctl_write_size_histogram_64kB
    21ctl_write_size_histogram_512kB
    22ctl_write_size_histogram_1024kB

    Table 2. Features selected for the detection model.

    The intuition behind this selection is that, during a ransomware attack, the I/O statistics rise above their normal levels and the characteristics of the steady state I/O pattern change. The two clearest signals were the block size and the randomness of access. For a complete list of candidate features, see Appendix II.

    4.5 Dataset

    A dataset is a collection of data samples. The dataset in this experiment contains measurements collected every 120 seconds through a REST API. There are several ways to collect this data, as shown in Figure 10. A REST API request was chosen because the results can be written to a comma separated value (.csv) file for use in training. Most modern storage arrays expose the same measurements, so the results apply to enterprises of any size without vendor lock in.

    Figure 10. Monitoring tools for the Nutanix Acropolis cluster.

    Figure 10 shows Prism, the built in monitoring tool, which includes I/O and network flow monitoring. The hyperconverged system provides an HTML5 user interface, a REST API, and a command line utility. The experiment assumes that the operating system, in this case Windows 10 Enterprise, has reached a steady state I/O pattern and is free of any ransomware infection.

    When building a machine learning dataset, the ground truth data is split into a training dataset and a testing dataset. The algorithm is trained on the training data and then evaluated on its ability to perform on the testing data [40].

    Figure 11. REST API access to the Nutanix data platform [41].

    To retrieve the performance indicators, the Nutanix cluster is queried with a request that includes:

    the unique identifier of the virtual disk (UUID);

    the metric, or KPI, being requested;

    the start time and end time in microseconds, using the 24 hour Unix epoch format; and

    the interval in seconds, where the minimum for this version of Nutanix is 120 seconds.

    Appendix V contains the code used to retrieve the KPI through the REST API.

    One of the main challenges in behavioral detection is distinguishing a valid application from an actual ransomware attack. Some families go further and become adversarial by using several evasion techniques. One technique that would make this approach less robust [42] is for the malware to observe its environment and imitate normal behavior.

    To model a valid application workload, the experiment used DISKSPD, a command line tool for micro benchmarking [43], [44]. The following options were used:

    diskspd b8K d30 o4 t8 h r w25 L Z1G c20G C:\iotest.dat > iotestResults.txt

    This command runs a 30 second random I/O test against a 20 GB file on the C: drive, with a 25 percent write and 75 percent read ratio and an 8 KB block size. It uses eight worker threads, each with four outstanding I/Os, and a write entropy seed of 1 GB, and it saves the results to a text file. The equivalent utility on Linux is fio [45], [46].

    The DISKSPD emulator was used to model a SQL database as a representative application workload. The Future Work section returns to the need to model many suitable applications in order to build a more robust model.

    5. Analysis

    With the data collected (see Appendix VI for an example), it must be prepared before it can train the model. Two columns were added. The first is a VM identifier, which keeps the model ready for future experiments with additional test VMs. The second is the target metric, a column that indicates whether the data was collected during a ransomware attack. The value is zero for a normal operating system and one for an operating system under a ransomware attack. The word controller was shortened to ctl_, because the dataset import process appears to limit the length of a feature name and the characters it can contain.

    5.1 Azure Machine Learning

    There are two ways to apply Azure Machine Learning here. In the first, the collected data trains a model that classifies an operating system as either clean or infected. This is a binary classification problem, which can use Azure Automated Machine Learning. In the second, the collected data serves as a baseline and Azure Anomaly Detection identifies departures from it. Open source workbenches such as WEKA [47] offer comparable modeling capabilities, but a managed cloud service keeps the workflow accessible without local setup [48].

    5.2 Automated Machine Learning

    Automated machine learning, also called automated ML or AutoML, automates the time consuming, iterative tasks of model development. It lets data scientists, analysts, and developers build models at scale with efficiency and productivity while preserving model quality [49]. The automated workflow used here was as follows:

    A .csv file with the collected data was uploaded. The data includes the I/O information of the test VM both with and without ransomware [50].

    The target metric for the classification is the Ranso column.

    Azure Automated ML evaluated several algorithms, trained the corresponding models, and recommended the best model based on accuracy. This process is time consuming.

    The recommended pipeline used MaxAbsScaler with a random forest [51].

    The data was flagged as imbalanced, most likely because there were far more samples without ransomware than with it.

    Automated ML ran for close to an hour and recommended a random forest (Figures 12 and 13).

    Figure 12. Top ranked algorithms reported by Azure Automated ML.

    Figure 13. Lowest ranked algorithms reported by Azure Automated ML.

    Figure 14. Supervised learning pipeline using a two class decision forest.

    In Figure 14, the trained model uses the two class decision forest. The steps are:

    upload the .csv to create a dataset, in this case Win10WithRanso;

    normalize the data using min and max scaling;

    split the data into 70 percent for training and 30 percent for evaluation;

    train a model using the two class decision forest algorithm; and

    score and evaluate the model (see the Findings section for details).

    At this point there is a trained model that can detect at least the type of ransomware in which encryption proceeds by reading, overwriting, and renaming the file. Because there are many families of ransomware, broader coverage would require additional models.

    5.3 Azure Anomaly Detection

    Because a ransomware event is not common, collecting data about it is difficult, and by the nature of malicious activity the datasets are imbalanced. To handle imbalanced data, Azure Machine Learning provides a category called anomaly detection.

    The collected data fits that category well: it is numerical data gathered as a uniformly spaced time series. Azure ML can detect trends and spikes and report the changes as anomaly scores. It uses principal component analysis (PCA), a technique often used in exploratory data analysis because it reveals the inner structure of the data and explains its variance [52].

    Figure 15. Learning pipeline for the anomaly detection approach.

    Figure 15. Learning pipeline for the anomaly detection approach.

    Once the model is trained with data collected while the operating system has no ransomware, future KPI readings can be evaluated through a deployed API. The Python code below tests the model with new data, and the JSON sample that follows shows the source KPI data.

    def test_model(sample_file_path = '_samples.json'):
    service_name = 'ransomaly'
    ws = Workspace.get(
    name='RansoML',
    subscription_id='e7af3a72-63c8-4a9c-a78c-d28c017f238a',
    resource_group='Ranso'
    )
    service = Webservice(ws, service_name)
    with open(sample_file_path, 'r') as f:
    sample_data = json.load(f)
    score_result = service.run(json.dumps(sample_data))
    print(f'Inference result = {score_result}')
    return score_result

    Figure 16. Python source used to query the anomaly model.

    [
    {
    "VM": 1,
    "ctl_random_ops_per_sec": 612,
    "ctl_read_io_bandwidth_kBps": 3438,
    "ctl_write_io_bandwidth_kBps": 1187,
    "ctl_num_read_iops": 428,
    "ctl_num_write_iops": 145,
    "hv_avg_read_io_latency_usecs": 0,
    "hv_avg_write_io_latency_usecs": 0,
    "ctl_total_read_io_size_kbytes": 412656,
    "ctl_read_size_histogram_4kB": 0,
    "ctl_read_size_histogram_8kB": 41125,
    "ctl_read_size_histogram_16kB": 0,
    "ctl_read_size_histogram_32kB": 43,
    "ctl_read_size_histogram_64kB": 0,
    "ctl_read_size_histogram_512kB": 0,
    "ctl_read_size_histogram_1024kB": 0,
    "ctl_write_size_histogram_4kB": 127,
    "ctl_write_size_histogram_8kB": 13831,
    "ctl_write_size_histogram_16kB": 5,
    "ctl_write_size_histogram_32kB": 13,
    "ctl_write_size_histogram_64kB": 8,
    "ctl_write_size_histogram_512kB": 0,
    "ctl_write_size_histogram_1024kB": 1280,
    "Ranso": 0
    },

    Figure 17. Sample JSON file with source KPI data.

    6. Findings

    This section interprets the data and points to directions for further research. The results are presented with a confusion matrix, the standard way to evaluate a classification model [53]. In these matrices, cases where both the predicted and actual values are one (true positives) appear at the top left, and cases where both the predicted and actual values are zero (true negatives) appear at the bottom right.

    Data was collected for two ransomware encryption events. The first dataset contained 1,441 rows collected while the operating system was normal and nine rows collected during encryption. Azure Machine Learning trained the model by splitting the data into 70 percent for training and 30 percent for evaluation. Figure 18 shows the results.

    Figure 18. Confusion matrix for the first run.

    As Figure 18 shows, the model produced a 100 percent true positive rate, and on only one occasion it predicted no encryption while encryption was in progress. In a second, fully independent experiment, the model was retrained using 63 rows collected during encryption. This time the output was 100 percent true positives and 100 percent true negatives, as shown in Figure 19.

    Figure 19. Confusion matrix for the second run.

    Figure 19. Confusion matrix for the second run.

    The more actions that are considered, and the more of them that are present during ransomware activity, the higher the identification rate. Collecting all of these actions, however, requires letting the ransomware run freely long enough to encrypt and destroy many files. In these experiments, the ransomware encrypted approximately seven hundred files per minute.

    A second model was configured with Azure anomaly detection. For ransomware encryption, it identified the anomaly in 100 percent of cases. The anomaly model was trained on data from the operating system without ransomware. Table 3 shows the output of the anomaly model on the collected data.

    RansoScored LabelProbability
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196

    Table 3. Output of the anomaly model on the collected data.

    The decision threshold can be tuned between zero and one; the default is 0.5. In this experiment the lowest certainty probability was 0.76, which stayed well above the default threshold.

    7. Discussion and Limitations

    Both trained models, the binary classifier and the anomaly detector, effectively detect an attack. Once detection occurs, rapid response techniques and good operational practices can support recovery. A short script can take a snapshot of the system as soon as an anomaly is detected, and on most modern storage arrays a snapshot does not affect performance.

    After several weeks of research, reading, and testing, a number of limitations of this approach became clear:

    The model was trained on the behavior of one specific VM, so it is not a generic model. Addressing this would require an automated training process that builds one model per VM.

    Collecting data from a VM that is in production is difficult. In the laboratory it was possible to infect the test VM and take it down to collect data, but a production approach would need to clone the production VM, infect the clone, aggregate the data from both the production VM and the infected clone into a dataset, train and deploy the model, and then periodically update the model by repeating those steps.

    The number of observations in these experiments is small. To confirm the encouraging early findings, the observations and the collected data should be extended to many operating systems running multiple workloads.

    For the two reasons above, the anomaly detector is likely a better choice than the two class algorithm.

    8. Future Work

    This section offers a few ideas for stronger protection and higher detection accuracy. Rather than relying on a single way to detect ransomware after the static defenses have been defeated, the proposal is to combine several subsystems that together form a layered defense around the environment:

    I/O to the storage array. This is the approach presented in this study. It would be worth adding both per VM and total storage array performance, since that combination did not appear in the reviewed material.

    File decoys. A honey file technique helps to reduce false positive results [54], [55].

    Compression and deduplication. Both measurements drop during encryption, because encrypted files are poor candidates for compression and deduplication. The idea is promising, although by the time the change is visible it may be too late to stop the encryption.

    Backup verification. Most attacks try to stop the backup system; the challenge is that backup systems vary from one environment to another.

    Network communication. This signal supports the overall strategy and is very effective when combined with the other layers.

    There is also a newer way to consume storage in a virtualized environment. The underlying storage is divided into chunks using virtual volumes (VVols) [56]. As the limitations show, it would help for the system to be aware of the specific files the ransomware accesses. Correlating file metadata with VVol utilization could give the model more insight and raise confidence in detection.

    Figure 20. Proposed system for future work.

    Figure 20. Proposed system for future work.

    In Figure 20, the proposed system has two dynamic behavior monitors on the left. When an anomaly occurs, the message queue receives the alert. The top right shows a backup process monitor, and the lower right shows the honey file, or canary file, check. The bottom center shows the two outputs for a positive ransomware detection: on the left, the process that snapshots the system, and on the right, the alert module.

    Acknowledgments

    I thank my family for their support and patience during this research. I also thank Dr. Mustaque Ahamad for his guidance during the semester; his feedback and recommendations helped me meet the learning objectives of the course. Finally, I thank my fellow students, whose positive attitude and strong work in the weekly progress reports kept motivation high.

    Appendix

    Appendix I. File Extension Detailed Count

    The sample dataset used to exercise the simulated ransomware contained the file types listed below, grouped by extension with total size and file count.

    File ExtensionTotal Size(Mb)File Count
    Total107.846421
    .al0.6721
    .at0.0532
    .backup0.1212
    .bak34.4167
    .basex0.00710
    .bash_history0.011
    .bashrc0.0031
    .bat0.03517
    .bin3.0996
    .boot01
    .bz20.8321
    .c2.4627
    .c320.3384
    .cat0.7065
    .cfg0.2886
    .changed01
    .class1.942794
    .clb0.0462
    .com0.1211
    .common0.0051
    .conf0.696
    .config0.0873
    .controlio0.1725
    .cpgz24.6451
    .cpp0.0291
    .crash0.0241
    .crdownload292.1163
    .crit01
    .css0.14143
    .csv3.082114
    .ctd0.0094
    .ctx0.5755
    .db0.4061
    .deb57.3755
    .debug0.1764
    .default01
    .der0.0011
    .dir0.0341
    .diskdefines01
    .dll0.7872
    .dmg88.353
    .doc15.56932
    .docm0.3162
    .docx518.988486
    .dotx0.3983
    .drt0.7761
    .dtd0.0883
    .dump0.0312
    .dylib0.0732
    .EFI2.3672
    .eml0.0532
    .ena0.463
    .ent0.0293
    .eps14.63717
    .epub5.3891
    .err0.0196
    .exe2074.40227
    .factoryio0.3863
    .FCD04
    .flake80.0011
    .gif0.11943
    .gpg5.2523
    .grp20.7642
    .gz1374.106205
    .h0.065
    .hpp0.0081
    .htc0.0022
    .htm0.0283
    .html5.55725
    .icns0.2492
    .ico0.0492
    .ics0.0196
    .img9.6045
    .in01
    .info02
    .ini0.0027
    .input_i0.3122
    .iso20275.22116
    .jar156.356342
    .jnlp0.05615
    .jpeg3.90618
    .jpg157.814162
    .jpg_large0.3251
    .js4.105106
    .json24.039164
    .kdb0.0084
    .kdbx0.03813
    .keystream0.8553
    .kit0.0332
    .lbb0.2793
    .len07
    .lic0.20993
    .license0.0011
    .lock01
    .log204.64302
    .lst0.018
    .manifest0.0713
    .md0.05315
    .md502
    .mgmtd0.711
    .mod2.012236
    .mp4493.2191
    .mpp0.1331
    .msg0.0921
    .msi125.7177
    .names0.0041
    .nar107.01932
    .ndp-proxy0.0061
    .netconfig0.0111
    .netrwhist01
    .nib0.34622
    .notice01
    .nvram0.0711
    .old1881.35711
    .omsg0.0841
    .one0.8361
    .out31.37927
    .ova20349.0799
    .pak92.2252
    .pcap0.0031
    .pcf0.0012
    .pdf1264.3231457
    .pem0.0032
    .pf20.0051
    .pfx0.0021
    .pg_dump0.0171
    .pkg16.1062
    .plist0.0052
    .png108.4061415
    .policy0.0131
    .potx5.9451
    .ppt34.6544
    .pptx529.425171
    .profile01
    .properties0.0248
    .psd1.42710
    .pxd0.0053
    .py6.193647
    .pyc0.5463
    .pyd4.31316
    .pyi0.463184
    .pyx0.0853
    .rar5.2781
    .rdp0.0116
    .rll1.3571
    .rpc0.0412
    .rpm1764.36527
    .rpm-utils0.0021
    .rsrc0.0011
    .rtf1.823102
    .run140.4761
    .s0.0182
    .sample0.0212
    .sb1987.0472
    .SET0.0373
    .sh223.58720
    .SHData0.29512
    .size01
    .slf0.0077
    .so12.77259
    .sql0.0041
    .sqlite1.1251
    .st0.0139
    .strings02
    .symbolMap50.9156
    .sys0.0661
    .tar933.6215
    .template1.8858
    .tex01
    .tgz1520.1227
    .thrift0.0011
    .tif2.0373
    .tiff5.49975
    .TORRENT0.0181
    .ts0.22114
    .ttf3.28821
    .txt91.105567
    .url01
    .vdx8.4487
    .vib103.0356
    .viminfo0.0111
    .vmdk26319.8764
    .vmsd02
    .vmsn0.0281
    .vmx0.0062
    .vmxf0.0042
    .vscodeignore0.00141
    .vsd116.92850
    .vsdx11.48214
    .vss216.99413
    .vssx5.8462
    .war3.8332
    .warn02
    .x320.1991
    .x640.21
    .xls64.106203
    .xlsm98.673121
    .xlsx242.441415
    .xml70.8624454
    .xq8.211735
    .xqm0.19631
    .xsd0.12220
    .xslt0.0012
    .yml0.0011
    .zip16207.155232

    Appendix II. Available Features

    The Nutanix platform exposes the performance indicators below at the VM, cluster, and storage container levels. The subset used for the model appears in Table 2.

    VMClusterStorage Container
    CPU Usage (%)CPU Usage (%)Storage Controller IOPS (IOPS)
    CPU Ready Time (%)Memory Usage (%)Storage Controller Read IOPS (IOPS)
    Memory Usage (%)Controller IOPS (IOPS)Storage Controller Write IOPS (IOPS)
    Storage Controller IOPS (IOPS)Controller Read IOPS (IOPS)Storage Controller Latency (ms)
    Storage Controller Read IOPS (IOPS)Controller Write IOPS (IOPS)Storage Controller Read Latency (ms)
    Storage Controller Write IOPS (IOPS)Controller AVG Latency (ms)Storage Controller Write Latency (ms)
    Storage Controller Latency (ms)Controller AVG Read Latency (ms)Storage Controller I/O Bandwidth (Mbps)
    Storage Controller Read Latency (ms)Controller AVG Write Latency (ms)Storage Controller Read Bandwidth (Mbps)
    Storage Controller Write Latency (ms)Controller I/O Bandwidth (Mbps)Storage Controller Write Bandwidth (Mbps)
    Storage Controller I/O Bandwidth (Mbps)Controller Read Bandwidth (Mbps)
    Storage Controller Read Bandwidth (Mbps)Controller Write Bandwidth (Mbps)
    Storage Controller Write Bandwidth (Mbps)
    Disk Usage (GiB)Virtual Disk
    Disk Usage (%)Random I/O (%)
    Snapshot Usage (GiB)Read Source Cache (KBps)
    Shared Data (GiB)Read Working Set size (MiB)
    I/O Working Set size (MiB)Write Working Set size (MiB)
    Read I/O Working Set size (MiB)Union Working Set Size
    Write I/O Working Set size (MiB)
    Read Size Distribution (bytes/%)
    Write Size Distribution (bytes/%)
    Network Receive Packets Dropped (# packets)
    Network Transmit Packets Dropped (# packets)
    Network Rx (KiB)
    Network Tx (KiB)

    Appendix III. Simulated Ransomware Script (Python)

    The script below traverses a target folder and, for each file whose extension matches the encryption list, encrypts the file in place using a symmetric key. It was used only against the isolated test VM, and the equivalent PowerShell technique is described by Rayner [57].

    import os
    from cryptography.fernet import Fernet
    def encrypt_file(filename):
    # process one file here
    #Generate a key
    key = Fernet.generate_key()
    #Save the key to the file my_key.key
    with open('my_key.key', 'wb') as my_key:
    my_key.write(key)
    # Initialize fernet object
    fernet_object = Fernet(key)
    # Read the file
    with open(filename, 'rb') as original_file:
    original = original_file.read()
    # Encrypt the file
    encrypted = fernet_object.encrypt(original)
    # Overwrite the file
    try:
    with open(filename, 'wb') as encrypted_file:
    encrypted_file.write(encrypted)
    except:
    pass
    def decrypt_file(filename):
    # Read the key from the file "my_key.key"
    with open('my_key.key', 'rb') as my_key:
    key = my_key.read()
    # Initialize fernet object
    fernet_object = Fernet(key)
    # Read the encrypted file
    with open(filename, 'rb') as encrypted_file:
    encrypted = encrypted_file.read()
    # Decrypt the file
    decrypted = fernet_object.decrypt(encrypted)
    # Overwrite the file
    with open(filename, 'wb') as decrypted_file:
    decrypted_file.write(decrypted)
    def get_file_list(root_folder):
    file_list = []
    # for root, dirs, files in os.walk(root_folder, topdown=False): #to list bottom-up
    for root, dirs, files in os.walk(root_folder):
    for name in files:
    #print("Filename ", os.path.join(root, name))
    file_list.append(os.path.join(root, name))
    # for folder in dirs:
    # print("Folder :",os.path.join(root, folder))
    return file_list
    def test_file_extension(file_name):
    encryptable = False
    extensions = [".pptx", "txt", "csv", ".db", ".mdb", ".log", ".sav", ".sql", ".xml",".key", ".cert", ".pem", ".doc", ".pdf", ".email", ".eml", ".msg", ".oft", ".ost", ".pst", ".vcf", ".apk", ".bat", ".pl", "ps1", ".pl", ".vsd" , ".vss" , ".vst" , ".vdx" , ".vsx" , ".vtx" , ".vsw" , ".vsl", ".dot", ".xls", ".py", ".jpg", ".jpeg", ".png", ".pgp", ".tiff", "sys", ".pfx", "plist", ".vmx", ".gif", ".lic", ".kit", ".ctx", ".sh", ".conf", ".ttf", ".ico", ".exe", ".dmg", "kdbx", ".java", ".jar", ".yml", ".json", "kdb", ".dll", ".img", ".msi", ".wsf", ".htm", ".php", ".vb", ".c", ".pcap"]
    for ext in extensions:
    if ext in file_name.rpartition('\\')[2]:
    encryptable = True
    return encryptable
    if __name__ == '__main__':
    root_folder = 'C:\\Users\\Win\\Documents\\'
    if(os.path.exists(root_folder)):
    file_list = get_file_list(root_folder)
    count = 0
    #'''
    for file_name in file_list:
    #print(file_name)
    if test_file_extension(file_name):
    print(file_name)
    #encrypt_file(file_name)
    #os.rename(file_name, file_name + ".pysa")
    #'''
    '''
    # To decrypt: uncomment lines 65 and 70 and comment lines 72 and 78
    for file_name in file_list:
    decrypt_file(file_name)
    print(file_name.split('.'))
    #os.rename(file_name, file_name.split('.pysa')
    '''
    else:
    print("Folder does not exist")

    Appendix IV. Sequence of Graphical Data During Ransomware Encryption

    The snapshots below show the Prism performance metrics captured at successive timestamps while the simulated ransomware encrypted the Documents folder.

    Figure A4.1. Performance metrics snapshot 1 of 9 during encryption.

    Figure A4.2. Performance metrics snapshot 2 of 9 during encryption.

    Figure A4.3. Performance metrics snapshot 3 of 9 during encryption.

    Figure A4.4. Performance metrics snapshot 4 of 9 during encryption.

    Figure A4.5. Performance metrics snapshot 5 of 9 during encryption.

    Figure A4.6. Performance metrics snapshot 6 of 9 during encryption.

    Figure A4.7. Performance metrics snapshot 7 of 9 during encryption.

    Figure A4.8. Performance metrics snapshot 8 of 9 during encryption.

    Figure A4.9. Performance metrics snapshot 9 of 9 during encryption.

    Appendix V. Python Code to Retrieve KPI Using the REST API

    import pprint
    import json
    import os
    import random
    import time
    import requests
    import sys
    import traceback
    # This block initializes the parameters for the request.
    class AHVRestApi():
    def __init__(self):
    # Initializes the options and the logfile from GFLAGS.
    self.serverIpAddress = "NUTANIX SERVER IP ADDRESS"
    self.username = "USERNAME"
    self.password = "PASSWORD"
    # Base URL at which REST services are hosted in Prism Gateway.
    BASE_URL = 'https://%s:9440/api/nutanix/v2.0/'
    self.base_url = BASE_URL % self.serverIpAddress
    self.session = self.get_server_session(self.username, self.password)
    def getVirtualDiskInformation(self, virtual_disk_id, start_time_usecs, end_time_usecs, interval_secs, metric ):
    URL = self.base_url + "virtual_disks/"+virtual_disk_id+"/stats/?metrics="+metric+ \
    "&start_time_in_usecs="+start_time_usecs+"" \
    "&end_time_in_usecs="+end_time_usecs+"" \
    "&interval_in_secs="+interval_secs
    serverResponse = self.session.get(URL)
    return json.loads(serverResponse.text)
    if __name__ == "__main__":
    try:
    ahvRestApi = AHVRestApi()
    ckoo_virtual_disk_id = 'c2193bad-29f2-4156-94d8-7bfc928f25c0'
    #win10_virtual_disk_id = '8a337f0a-d6d4-4157-a26a-93729680fb70' #old id
    win10_virtual_disk_id = '5065fba7-0671-409c-a746-eba05c38dda9'
    win2019_virtual_disk_id = 'd2e69200-82c8-4f7f-bc4a-8de856f905cc'
    #start_time_usecs = 1614429000000000 #Saturday, February 27, 2021 7:30:00 AM GMT-05:00
    #start_time_usecs = 1614774600000000 #Saturday, March 3, 2021 7:30:00 AM GMT-05:00
    #start_time_usecs = 1615077000000000 #Saturday, March 6, 2021 7:30:00 AM GMT-05:00
    start_time_usecs = 1616247600000000 #Wed, March 17, 2021 10:45:00 AM GMT-05:00
    end_time_usecs = 1616248620000000 #Wed, March 17, 2021 1:15:00 PM GMT-05:00
    interval_secs = "120"
    metrics = ["controller.random_ops_per_sec",
    "controller_read_io_bandwidth_kBps",
    "controller_write_io_bandwidth_kBps",
    "controller_num_read_iops",
    "controller_num_write_iops",
    "hypervisor_avg_read_io_latency_usecs",
    "hypervisor_avg_write_io_latency_usecs",
    "controller_total_read_io_size_kbytes",
    "controller.read_size_histogram_4kB",
    "controller.read_size_histogram_8kB",
    "controller.read_size_histogram_16kB",
    "controller.read_size_histogram_32kB",
    "controller.read_size_histogram_64kB",
    "controller.read_size_histogram_512kB",
    "controller.read_size_histogram_1024kB",
    "controller.write_size_histogram_4kB",
    "controller.write_size_histogram_8kB",
    "controller.write_size_histogram_16kB",
    "controller.write_size_histogram_32kB",
    "controller.write_size_histogram_64kB",
    "controller.write_size_histogram_512kB",
    "controller.write_size_histogram_1024kB" ]
    with open("data.txt",'w') as my_file:
    for metric in metrics:
    win10_virtual_disk = ahvRestApi.getVirtualDiskInformation(win10_virtual_disk_id, str(start_time_usecs), str(end_time_usecs), interval_secs, metric)
    this_value = win10_virtual_disk['stats_specific_responses'][0]['values']
    print(metric + "," + str(this_value) + "\n")
    my_file.write(metric + "," + str(this_value) + "\n")
    except Exception as ex:
    print(ex)
    ex
    sys.exit(1)

    Appendix VI. Collected Data

    Figure A6.1 shows an example of the data collected through the REST API and prepared for training.

    Figure A6.1. Example of the collected and prepared dataset.

    Appendix VII. DISKSPD Output

    Command Line: C:\DISKSPD\x86\diskspd.exe -b8k -d30 -o4 -t4 -h -r -w25 -Z1G -L -c20G c:\iotest.dat
    Input parameters:
    timespan:   1
    -------------
    duration: 30s
    warm up time: 5s
    cool down time: 0s
    measuring latency
    random seed: 0
    path: 'c:\iotest.dat'
    think time: 0ms
    burst size: 0
    software cache disabled
    hardware write cache disabled, writethrough on
    write buffer size: 1073741824
    performing mix test (read/write ratio: 75/25)
    block size: 8192
    using random I/O (alignment: 8192)
    number of outstanding I/O operations: 4
    thread stride size: 0
    threads per file: 4
    using I/O Completion Ports
    IO priority: normal
    System information:
    computer name: Win
    start time: 2021/02/27 13:53:11 UTC
    Results for timespan 1:
    *******************************************************************************
    actual test time: 30.01s
    thread count: 4
    proc count: 2
    CPU |  Usage |  User  |  Kernel |  Idle
    -------------------------------------------
       0|  23.02%|   7.92%|   15.10%|  76.98%
       1|  24.43%|  14.64%|    9.79%|  75.57%
    -------------------------------------------
    avg.|  23.72%|  11.28%|   12.45%|  76.28%
    Total IO
    thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |        84271104 |        10287 |       2.68 |     342.73 |   11.662 |    16.062 | c:\iotest.dat (20GiB)
         1 |        81010688 |         9889 |       2.57 |     329.47 |   12.127 |    17.012 | c:\iotest.dat (20GiB)
         2 |        84172800 |        10275 |       2.67 |     342.33 |   11.676 |    16.164 | c:\iotest.dat (20GiB)
         3 |        80904192 |         9876 |       2.57 |     329.04 |   12.142 |    17.595 | c:\iotest.dat (20GiB)
    -----------------------------------------------------------------------------------------------------
    total:         330358784 |        40327 |      10.50 |    1343.58 |   11.897 |    16.710
    Read IO
    thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |        62570496 |         7638 |       1.99 |     254.48 |   11.267 |    16.182 | c:\iotest.dat (20GiB)
         1 |        60710912 |         7411 |       1.93 |     246.91 |   11.872 |    16.114 | c:\iotest.dat (20GiB)
         2 |        63102976 |         7703 |       2.01 |     256.64 |   11.461 |    16.900 | c:\iotest.dat (20GiB)
         3 |        60448768 |         7379 |       1.92 |     245.85 |   12.000 |    18.401 | c:\iotest.dat (20GiB)
    -----------------------------------------------------------------------------------------------------
    total:         246833152 |        30131 |       7.84 |    1003.88 |   11.645 |    16.920
    Write IO
    thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |        21700608 |         2649 |       0.69 |      88.26 |   12.802 |    15.654 | c:\iotest.dat (20GiB)
         1 |        20299776 |         2478 |       0.65 |      82.56 |   12.891 |    19.429 | c:\iotest.dat (20GiB)
         2 |        21069824 |         2572 |       0.67 |      85.69 |   12.321 |    13.705 | c:\iotest.dat (20GiB)
         3 |        20455424 |         2497 |       0.65 |      83.19 |   12.560 |    14.952 | c:\iotest.dat (20GiB)
    -----------------------------------------------------------------------------------------------------
    total:          83525632 |        10196 |       2.65 |     339.70 |   12.643 |    16.050
    total:
      %-ile |  Read (ms) | Write (ms) | Total (ms)
    ----------------------------------------------
        min |      0.442 |      1.430 |      0.442
       25th |      7.628 |      8.612 |      7.870
       50th |      9.215 |     10.198 |      9.463
       75th |     10.993 |     11.980 |     11.277
       90th |     14.605 |     15.977 |     14.952
       95th |     22.197 |     24.319 |     22.712
       99th |     68.312 |     70.150 |     68.543
    3-nines |    285.154 |    274.683 |    285.154
    4-nines |    468.722 |    467.886 |    468.722
    5-nines |    473.159 |    472.866 |    473.159
    6-nines |    473.159 |    472.866 |    473.159
    7-nines |    473.159 |    472.866 |    473.159
    8-nines |    473.159 |    472.866 |    473.159
    9-nines |    473.159 |    472.866 |    473.159
        max |    473.159 |    472.866 |    473.159

    References

    [1] CrowdStrike, 2020 Global Threat Report. Sunnyvale, CA, USA: CrowdStrike, Inc., 2020.

    [2] Cybersecurity and Infrastructure Security Agency, “Protecting against ransomware,” Security Tip ST19-001, Apr. 11, 2019. [Online]. Available: https://www.cisa.gov/news-events/news/protecting-against-ransomware

    [3] The Hacker News, “Everything you need to know about evolving threat of ransomware,” thehackernews.com, Feb. 2021. [Online]. Available: https://thehackernews.com/2021/02/everything-you-need-to-know-about.html

    [4] PurpleSec, “Ransomware statistics, data, and trends,” 2021. [Online]. Available: https://purplesec.us/resources/cyber-security-statistics/ransomware/

    [5] G. Hull, H. John, and B. Arief, “Ransomware deployment methods and analysis: Views from a predictive model and human responses,” Crime Science, vol. 8, no. 2, 2019, doi: 10.1186/s40163-019-0097-9.

    [6] E. Berrueta, D. Morato, E. Magana, and M. Izal, “A survey on detection techniques for cryptographic ransomware,” IEEE Access, vol. 7, pp. 144925-144944, 2019, doi: 10.1109/ACCESS.2019.2945839.

    [7] B. Scott, “Case for HCI in the modern datacenter,” MyPureSupport Community, 2017. [Online]. Available: https://community.mypuresupport.com/case-for-hci-over-legacy-3-tier/

    [8] M. S. Abbasi, H. Al-Sahaf, and I. Welch, “Particle swarm optimization: A wrapper-based feature selection method for ransomware detection and classification,” in Applications of Evolutionary Computation (EvoApplications 2020), Lecture Notes in Computer Science, vol. 12104. Cham, Switzerland: Springer, 2020, pp. 181-196, doi: 10.1007/978-3-030-43722-0_12.

    [9] O. M. K. Alhawi, J. Baldwin, and A. Dehghantanha, “Leveraging machine learning techniques for Windows ransomware network traffic detection,” in Cyber Threat Intelligence, Advances in Information Security, vol. 70. Cham, Switzerland: Springer, 2018, pp. 93-106, doi: 10.1007/978-3-319-73951-9_5.

    [10] R. Moussaileb, N. Cuppens, J.-L. Lanet, and H. Le Bouder, “Ransomware network traffic analysis for pre-encryption alert,” in Foundations and Practice of Security (FPS 2019), Lecture Notes in Computer Science, vol. 12056. Cham, Switzerland: Springer, 2020, pp. 20-38, doi: 10.1007/978-3-030-45371-8_2.

    [11] G. Cusack, O. Michel, and E. Keller, “Machine learning-based detection of ransomware using SDN,” in Proc. 2018 ACM Int. Workshop on Security in Software Defined Networks & Network Function Virtualization (SDN-NFV Sec), 2018, pp. 1-6, doi: 10.1145/3180465.3180467.

    [12] H. Daku, P. Zavarsky, and Y. Malik, “Behavioral-based classification and identification of ransomware variants using machine learning,” in Proc. 2018 17th IEEE Int. Conf. Trust, Security and Privacy in Computing and Communications / 12th IEEE Int. Conf. Big Data Science and Engineering (TrustCom/BigDataSE), 2018, pp. 1560-1564, doi: 10.1109/TrustCom/BigDataSE.2018.00224.

    [13] S. K. Shaukat and V. J. Ribeiro, “RansomWall: A layered defense system against cryptographic ransomware attacks using machine learning,” in Proc. 2018 10th Int. Conf. Communication Systems & Networks (COMSNETS), 2018, pp. 356-363, doi: 10.1109/COMSNETS.2018.8328219.

    [14] G. Ramesh and A. Menen, “Automated dynamic approach for detecting ransomware using finite-state machine,” Decision Support Systems, vol. 138, art. 113400, 2020, doi: 10.1016/j.dss.2020.113400.

    [15] B. A. S. Al-rimy, M. A. Maarof, and S. Z. M. Shaid, “Ransomware threat success factors, taxonomy, and countermeasures: A survey and research directions,” Computers & Security, vol. 74, pp. 144-166, 2018, doi: 10.1016/j.cose.2018.01.001.

    [16] D. W. Fernando, N. Komninos, and T. Chen, “A study on the evolution of ransomware detection using machine learning and deep learning techniques,” IoT, vol. 1, no. 2, pp. 551-604, 2020, doi: 10.3390/iot1020030.

    [17] O. Or-Meir, N. Nissim, Y. Elovici, and L. Rokach, “Dynamic malware analysis in the modern era: a state of the art survey,” ACM Computing Surveys, vol. 52, no. 5, art. 88, pp. 1-48, 2019, doi: 10.1145/3329786.

    [18] A. Mohaisen, O. Alrawi, and M. Mohaisen, “AMAL: High-fidelity, behavior-based automated malware analysis and classification,” Computers & Security, vol. 52, pp. 251-266, 2015, doi: 10.1016/j.cose.2015.04.001.

    [19] D. Sgandurra, L. Munoz-Gonzalez, R. Mohsen, and E. C. Lupu, “Automated dynamic analysis of ransomware: Benefits, limitations and use for detection,” arXiv:1609.03020, Sep. 2016.

    [20] M. E. Ahmed, H. Kim, S. Camtepe, and S. Nepal, “Peeler: Profiling kernel-level events to detect ransomware,” in Computer Security: ESORICS 2021, Lecture Notes in Computer Science, vol. 12972. Cham, Switzerland: Springer, 2021, pp. 240-260, doi: 10.1007/978-3-030-88418-5_12.

    [21] A. Y. Huang, “Towards robust malware detection,” M.Eng. thesis, Dept. Electr. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, MA, USA, 2018.

    [22] A. Fattori, A. Lanzi, D. Balzarotti, and E. Kirda, “Hypervisor-based malware protection with AccessMiner,” Computers & Security, vol. 52, pp. 33-50, 2015, doi: 10.1016/j.cose.2015.03.007.

    [23] N. Paul, S. Gurumurthi, and D. Evans, “Towards disk-level malware detection,” in Proc. Workshop on Code Based Software Security Assessments (CoBaSSA), 2005.

    [24] S. Baek, Y. Jung, A. Mohaisen, S. Lee, and D. Nyang, “SSD-Insider: Internal defense of solid-state drive against ransomware with perfect data recovery,” in Proc. 2018 IEEE 38th Int. Conf. Distributed Computing Systems (ICDCS), 2018, pp. 875-884, doi: 10.1109/ICDCS.2018.00089.

    [25] W. Xie, N. Chen, and B. Chen, “Poster: Incorporating malware detection into flash translation layer,” in Proc. 2020 IEEE Symp. Security and Privacy (Poster Session), 2020.

    [26] A. Continella, A. Guagnelli, G. Zingaro, G. De Pasquale, A. Barenghi, S. Zanero, and F. Maggi, “ShieldFS: A self-healing, ransomware-aware filesystem,” in Proc. 32nd Annu. Computer Security Applications Conf. (ACSAC), 2016, pp. 336-347, doi: 10.1145/2991079.2991110.

    [27] N. Scaife, H. Carter, P. Traynor, and K. R. B. Butler, “CryptoLock (and drop it): Stopping ransomware attacks on user data,” in Proc. 2016 IEEE 36th Int. Conf. Distributed Computing Systems (ICDCS), 2016, pp. 303-312, doi: 10.1109/ICDCS.2016.46.

    [28] A. Kharraz, W. Robertson, D. Balzarotti, L. Bilge, and E. Kirda, “Cutting the Gordian knot: A look under the hood of ransomware attacks,” in Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2015), Lecture Notes in Computer Science, vol. 9148. Cham, Switzerland: Springer, 2015, pp. 3-24, doi: 10.1007/978-3-319-20550-2_1.

    [29] D. Sebayan, “How threat modeling can prevent your next ransomware attack,” ThreatModeler, 2019. [Online]. Available: https://threatmodeler.com/

    [30] Datacadamia, “I/O: workload (access pattern),” 2019. [Online]. Available: https://datacadamia.com/io/access_pattern

    [31] J. Layton, “IO patterns: what you do not know can hurt you,” Enterprise Storage Forum, 2013. [Online]. Available: https://www.enterprisestorageforum.com/management/io-patterns-what-you-dont-know-can-hurt-you/

    [32] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, and M. van Steen, “Prudent practices for designing malware experiments: Status quo and outlook,” in Proc. 2012 IEEE Symp. Security and Privacy, 2012, pp. 65-79, doi: 10.1109/SP.2012.14.

    [33] Cuckoo Foundation, “Preparing the host: Cuckoo Sandbox v2.0.7 book,” 2019. [Online]. Available: https://cuckoo.readthedocs.io/en/latest/installation/host/

    [34] D. Murchison, “Home lab series: Cuckoo Sandbox on ESXi,” murchisd.github.io, Jan. 25, 2019. [Online]. Available: https://murchisd.github.io/pr0j3cts/2019/01/25/Cuckoo-Sandbox-and-ESXi.html

    [35] EF Education First, “1000 most common words in English,” 2015. [Online]. Available: https://www.ef.com/wwen/english-resources/english-vocabulary/top-1000-words/

    [36] S. Canny, “python-docx documentation,” 2013. [Online]. Available: https://python-docx.readthedocs.io/

    [37] A. Arrington, “Automate Google image downloads with Python,” Medium, Apr. 19, 2020. [Online]. Available: https://medium.com/@austin_9875/automate-google-image-downloads-with-python-91b633130ba9

    [38] C. Zita, “How to download Google images using Python (2021),” Level Up Coding (Medium), Jan. 25, 2021. [Online]. Available: https://levelup.gitconnected.com/how-to-download-google-images-using-python-2021-82e69c637d59

    [39] DataRobot, “Feature variables,” DataRobot AI Wiki, 2019. [Online]. Available: https://www.datarobot.com/wiki/

    [40] W. Arbash, “Dataset vs ground-truth dataset,” wao.ai, 2019. [Online]. Available: https://wao.ai/blog/dataset-vs-ground-truth-dataset

    [41] Nutanix, “API reference,” Nutanix.dev, 2020. [Online]. Available: https://www.nutanix.dev/api-reference/

    [42] Wikipedia contributors, “Robustness (computer science),” Wikipedia, The Free Encyclopedia, 2019. [Online]. Available: https://en.wikipedia.org/wiki/Robustness_(computer_science)

    [43] G. Berry, “Using Microsoft DiskSpd to test your storage subsystem,” SQLPerformance.com, Aug. 4, 2015. [Online]. Available: https://sqlperformance.com/2015/08/io-subsystem/diskspd-test-storage

    [44] J. Yi, “Use DISKSPD to test workload storage performance,” Azure Stack HCI Documentation, Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure-stack/hci/manage/diskspd-overview

    [45] B. Sjerps, “Pinpointing I/O bottlenecks on Linux,” Dirty Cache, Mar. 4, 2011. [Online]. Available: https://bartsjerps.wordpress.com/2011/03/04/io-bottleneck-linux/

    [46] Wikipedia contributors, “Memory access pattern,” Wikipedia, The Free Encyclopedia, 2020. [Online]. Available: https://en.wikipedia.org/wiki/Memory_access_pattern

    [47] G. Holmes, A. Donkin, and I. H. Witten, “WEKA: A machine learning workbench,” in Proc. 2nd Australia and New Zealand Conf. Intelligent Information Systems (ANZIIS), 1994, pp. 357-361.

    [48] Microsoft, “Create machine learning models,” Microsoft Learn Training, 2020. [Online]. Available: https://learn.microsoft.com/training/paths/create-machine-learn-models/

    [49] Microsoft, “What is automated machine learning (AutoML)?,” Azure Machine Learning Documentation, Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure/machine-learning/concept-automated-ml

    [50] Microsoft, “Tutorial: Train a classification model with no-code automated ML in the Azure Machine Learning studio,” Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml

    [51] F. Lazzeri, “How to select algorithms for Azure Machine Learning,” Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure/machine-learning/how-to-select-algorithms

    [52] Microsoft, “PCA-based anomaly detection (ML Studio classic),” Azure Machine Learning Studio Module Reference, 2019. [Online]. Available: https://learn.microsoft.com/previous-versions/azure/machine-learning/studio-module-reference/pca-based-anomaly-detection

    [53] Microsoft, “Train and evaluate classification models,” Microsoft Learn Training, 2020. [Online]. Available: https://learn.microsoft.com/training/modules/train-evaluate-classification-models/

    [54] C. Moore, “Detecting ransomware with honeypot techniques,” in Proc. 2016 Cybersecurity and Cyberforensics Conf. (CCC), 2016, pp. 77-81, doi: 10.1109/CCC.2016.14.

    [55] Kaspersky, “What is a honeypot?,” Kaspersky Resource Center, 2020. [Online]. Available: https://usa.kaspersky.com/resource-center/threats/what-is-a-honeypot

    [56] C. Hosterman, “The case for vVols and ransomware,” codyhosterman.com, Mar. 17, 2020. [Online]. Available: https://www.codyhosterman.com/2020/03/the-case-for-vvols-and-ransomware/

    [57] T. Rayner, “Simulating a ransomware attack with PowerShell,” CanITPro Blog, Microsoft TechNet, Jan. 27, 2016. [Online]. Available: https://learn.microsoft.com/archive/blogs/canitpro/simulating-a-ransomware-attack-with-powershell

  • Deploying Nutanix AHV with Pure Storage FlashArray: A Practical Field Guide

    Deploying Nutanix AHV with Pure Storage FlashArray: A Practical Field Guide

    Author: Javier Rodriguez, Managing Technical Architect, ePlus Technology  |  javier.rodriguez@eplus.com

    Why This Architecture Matters Now

    For years, Nutanix was almost synonymous with HCI, where compute and storage live together in the same nodes. That model works exceptionally well for general-purpose workloads, but it has always had a ceiling: when you need more storage capacity or performance, you also have to buy more compute, whether you need it or not.

    The formal partnership between Nutanix and Pure Storage, announced at .NEXT 2025 in May, changes that equation. Nutanix AHV can now run as a compute only platform backed by Pure Storage FlashArray over NVMe/TCP. Each Nutanix AOS vDisk maps directly to a FlashArray volume, which means per VM granularity for snapshots, quality-of-service controls, and replication. You get the operational simplicity of Prism as a unified management plane while the FlashArray handles the storage heavy lifting underneath.

    This post walks through what it actually takes to build that environment, based on the Cisco FlashStack with Nutanix Installation Field Guide (v1.0, December 2025) and supplemental information from Nutanix and Pure Storage documentation.


    Architecture Overview

    The deployment model covered here uses Cisco UCS servers (X-series, C-series, or B-series) as compute only nodes, managed by Cisco Intersight in Intersight Managed Mode (IMM). The nodes connect to Cisco UCS Fabric Interconnects (FIs), and those FIs connect upstream to top of rack switches. The Pure Storage FlashArray sits off to the side as a dedicated external storage array, connected to those same ToR switches over NVMe/TCP.

    • No local storage is used for AOS datastores. The nodes are diskless or have local drives used only for the hypervisor boot.
    • Storage traffic travels over dedicated VLANs and dedicated vNIC pairs, separate from management and guest VM traffic.
    • The Nutanix Controller VM (CVM) on each node handles the NVMe/TCP initiator connections to the FlashArray automatically. Administrators do not need to manually configure NVMe initiators.
    • Prism Central (or Prism Element) is the primary management interface for the cluster, while Cisco Intersight manages the UCS hardware layer.

    Software Version Requirements

    Before any hardware gets racked, confirm that all components meet the minimum software versions. Using mismatched versions is one of the most common causes of failed deployments.

    ComponentMinimum VersionNotes
    Nutanix AOS7.5 or later
    Nutanix AHV11.0 or later
    Foundation Central1.10 onlyDo not use 2.x at this time
    Prism Central7.5 or laterRequired for Licensing
    Nutanix LCM3.3Included with AOS 7.5
    Pure Storage Purity/FA6.10.3 or laterUpgrade must be done before installation begins
    Cisco Fabric Interconnect4.3(4.240066) or later
    Cisco Intersight Virtual Appliance1.1.5-1 or laterOlder CVA/PVA versions will cause failures
    Cisco UCS X210c-M7 Firmware5.4(0.250048) or later
    Cisco UCS C-series M6/M7 Firmware4.3(6.250053) or later
    Cisco UCS B-series M5/M6 Firmware5.3(0.250021) or later

    Important: Only Foundation Central version 1.10 should be used. The Appliance VM version 2.x is explicitly not supported for this deployment type. Do not use it.


    IP Address Planning

    IP address planning should be completed before any configuration begins. Retrofitting addressing after the fact wastes time and introduces risk.

    Infrastructure

    • 2 addresses for the Fabric Interconnects
    • 1 address for the Foundation Central Appliance VM
    • 1 optional address for Prism Central / Foundation Central VM

    Per Nutanix Host (five addresses each)

    1. AHV hypervisor management address
    2. Controller VM (CVM) management address
    3. CIMC management address (assigned as a pool in Intersight)
    4. Storage interface address, VLAN 1 (assigned as a pool in Prism Element)
    5. Storage interface address, VLAN 2 (assigned as a pool in Prism Element)

    Pure Storage FlashArray (seven addresses)

    • 1 per controller management interface
    • 1 roaming array management address
    • 1 per NVMe/TCP storage interface (minimum 4 across two controllers and two VLANs)

    Storage addresses must be Layer 2 adjacent to the hosts and cannot traverse a router. Use two separate storage VLANs, one for the A side controller interfaces and one for the B side.


    Step 1: Pure Storage FlashArray Configuration

    The FlashArray setup is best done via CLI. Some configuration tasks cannot be completed through the Purity GUI. This assumes the array is already racked, cabled, powered on, and reachable on its management network, with Purity/FA 6.10.3 or later already running.

    Enable and Configure NVMe/TCP Interfaces

    Four Ethernet interfaces need to be assigned to NVMe/TCP. The example below uses interfaces eth10 and eth11 on each controller.

    # Enable the four storage interfaces
    purenetwork eth enable ct0.eth10
    purenetwork eth enable ct0.eth11
    purenetwork eth enable ct1.eth10
    purenetwork eth enable ct1.eth11
    # Assign addresses, MTU 9000, and NVMe/TCP service
    purenetwork eth setattr --address 10.1.61.100/24 --mtu 9000 --servicelist nvme-tcp ct0.eth10
    purenetwork eth setattr --address 10.1.62.100/24 --mtu 9000 --servicelist nvme-tcp ct0.eth11
    purenetwork eth setattr --address 10.1.61.101/24 --mtu 9000 --servicelist nvme-tcp ct1.eth10
    purenetwork eth setattr --address 10.1.62.101/24 --mtu 9000 --servicelist nvme-tcp ct1.eth11

    Use MTU 9000 (jumbo frames) wherever possible. If jumbo frames are not supported end to end in your network, set MTU to 1500 and ensure consistency across all components.

    Create the Realm, Pod, and Administrative User

    # Create the Realm for RBAC segmentation
    purerealm create <realm_name>
    # Create a Pod within the Realm for this Nutanix cluster
    purepod create <realm_name>::<pod_name>
    # Create the management access policy granting admin rights to the Realm
    purepolicy management-access create --role admin --realm <realm_name> <policy_name>
    # Create the user Nutanix will use to authenticate to the array
    pureadmin create <username> --access-policy <policy_name>

    Important: Set a storage quota on the Pod after creation. Without a quota, the array will not accurately report available storage to Nutanix.


    Step 2: Cisco UCS and Intersight Configuration

    Configure Fabric Interconnect A first via serial console or HTTPS Express Setup, setting the management mode to Intersight. After FI-A shows a login prompt, configure FI-B. FI-B will detect the peer and prompt to join the cluster.

    Once the FIs are up, log into Cisco Intersight and claim the UCS domain using the Device ID and Claim Code from the FI web console. Create Resource Groups and an Organization, create and deploy a Domain Profile to the Fabric Interconnects, and set the System QoS Best Effort MTU to 9216 to allow jumbo frames.

    This is where compute only Nutanix deployments require careful attention. Each server needs at least two vNIC pairs: an infrastructure pair for AHV and CVM management traffic, and a dedicated storage pair carrying the NVMe/TCP storage VLANs.

    Infrastructure vNIC naming is case sensitive. Foundation Central will reject the deployment if these names are wrong. For a single VIC server use ntnx-infra-1-A on Slot MLOM, PCIe Order 0, Fabric A, Failover disabled and ntnx-infra-1-B on Slot MLOM, PCIe Order 1, Fabric B, Failover disabled.


    Step 3: Foundation Central Deployment

    For first-time cluster deployments with no existing Nutanix infrastructure, the Appliance VM is the simplest path. Deploy it with 2 vCPUs and 4 GB RAM with a static IP address. DHCP is not supported. Run the setup script from the local console after booting and access the GUI at https://<FC_IP>:9440.

    Upload the AOS installation package, its metadata JSON file, and the AHV ISO via API calls to the Appliance VM. Retrieve the hosted file URLs by browsing to http://<FC_IP>:8053/files/images and enter those URLs into the cluster deployment wizard.


    Step 4: Nutanix Cluster Deployment

    Connect Foundation Central to Cisco Intersight by entering the API Key ID and Secret Key under Settings. The API key user must have at minimum Server Administrator privileges in the relevant Intersight Organization.

    1. Onboard the Cisco UCS servers by selecting Intersight Managed Mode and choosing the target nodes
    2. Select the onboarded nodes and click Create Cluster
    3. Select Compute Cluster (not HCI Cluster, since there is no local storage)
    4. Configure the infrastructure vNIC pair and at least one dedicated storage vNIC pair
    5. Assign IP addresses and hostnames. Use Bulk Configuration to set sequential addresses efficiently
    6. Enter the download URLs for AOS, the AOS metadata file, and the AHV ISO
    7. Set NTP servers, DNS servers, and timezone
    8. Select the Foundation Central API Key and click Create Deployment

    Deployments without firmware changes typically complete in 75 to 90 minutes. If firmware upgrades are required, add 60 to 90 minutes.


    Step 5: External Storage Connectivity

    After the Nutanix cluster is up and accessible in Prism Element, select I’ll Do This Later when prompted to set up external storage. The virtual switch configuration must be done first.

    1. Edit the default virtual switch vs0 to remove any storage vNICs, leaving only the infrastructure vNIC pairs as uplinks
    2. Create a new dedicated storage virtual switch, assign it the storage vNIC pair, set MTU to 9000, and Bond Type to Active-Active with MAC pinning
    3. Create one External Storage Interface per storage VLAN, associated with the new storage virtual switch, with an IP pool large enough for one address per node plus room for growth. Enable the External Storage option and set MTU to 9000.

    Before attaching the array, verify jumbo frame connectivity from the CVMs to all four FlashArray storage interfaces:

    ping -M do -s 8972 10.1.61.100
    ping -M do -s 8972 10.1.62.100
    ping -M do -s 8972 10.1.61.101
    ping -M do -s 8972 10.1.62.101

    All four tests should complete with 0% packet loss. From Prism Element, click Attach External Storage, select Pure Storage FlashArray, enter the clustered management IP, the Realm administrative username and password, select the Realm and Pod, and click Attach. The connection typically completes within 30 to 60 seconds.


    Step 6: Post-Installation Tasks

    Change the default passwords on three accounts on AHV (root, admin, and nutanix) and on the CVM nutanix account. Run the NCC password health check after: ncc health_checks system_checks default_password_check

    Run a full NCC health check from Prism Element and resolve all failures and warnings before the cluster goes into production. LCM 3.3 ships with AOS 7.5. Run an inventory job to see available Nutanix software updates. Note that LCM will not perform server firmware updates for compute only nodes connected to external storage.


    Things Worth Calling Out

    vNIC naming. The ntnx-infra-1-A and ntnx-infra-1-B names in the LAN Connectivity Policy are case sensitive. A single capitalization error will cause the deployment to fail at validation. Fix it in the Intersight policy and resubmit.

    Foundation Central version. Version 1.10 only. The Appliance VM version 2.x exists and is available, but it does not work with Cisco UCS hardware in this context. Do not use it.

    Jumbo frames. The MTU 9000 setting at the vNIC level and the virtual switch level only permits jumbo frames to pass; it does not enforce them. All switching infrastructure between the hosts and the FlashArray interfaces must also support 9000 byte frames. Use the ping test above to verify before attaching the array.

    Storage VLAN design. Use two separate storage VLANs, one for the A side controller interfaces and one for the B side. Storage addresses within each VLAN must be in the same Layer 2 domain as the hosts and cannot be routed.

    Pod quota. Without setting a storage quota on the Pure Storage Pod, Nutanix will not accurately display available storage capacity. Set it immediately after verifying the Realm and Pod are created.


    Summary

    The Nutanix AHV compute only model backed by Pure Storage FlashArray over NVMe/TCP represents a meaningful shift in how converged infrastructure is deployed. It separates the scaling concerns for compute and storage, delivers per VM storage granularity at the array level, and maintains a single management plane through Prism for day to day operations.

    The installation process involves more moving parts than a traditional HCI cluster. Cisco Intersight, Foundation Central, Purity CLI, and Prism Element all play distinct roles, and the sequencing matters. Following the steps in order and confirming each layer before moving to the next is the most reliable path to a successful deployment.

    For questions about this architecture or assistance planning a deployment, reach out at javier.rodriguez@eplus.com.

  • Migrating Virtualized Workloads to Nutanix AHV: A Phased Approach That Works in Production

    Migrating Virtualized Workloads to Nutanix AHV: A Phased Approach That Works in Production


    Every VMware to Nutanix AHV migration project comes with the same fundamental tension: you want to move workloads to a better platform without disrupting the people who depend on those workloads every day. The good news is that Nutanix Move, when paired with a well defined phased methodology, handles that tension well. This post walks through how we approach these engagements at ePlus, covering the migration mechanics, database specific considerations, and the operational steps that close out each phase cleanly.

    How Nutanix Move Works


    Nutanix Move is a cross hypervisor mobility tool that automates VM migrations from VMware ESXi, Hyper-V, or public cloud sources to Nutanix AHV. The core model is straightforward: Seed, Sync, Cutover.

    1. Discovery: Connect Move to the source environment (vCenter, standalone ESXi, or Hyper-V) and the target Nutanix cluster. Move inventories the VMs and validates compatibility before anything touches production data.
    2. Data Seeding: Move creates a placeholder VM on the AHV side and begins copying virtual disks from the source. This initial seed runs in the background while the source VM stays live.
    3. Changed Block Tracking (CBT): After the initial copy, Move uses CBT to replicate only blocks that have changed since the last sync. This keeps the replication delta small and the eventual cutover window short.

    Why Daytime Replication Is Safe

    A common concern when planning migrations is whether running replication during business hours will hurt production performance. In practice, it does not, and here is why.

    Non Disruptive Snapshots
    Move uses native snapshot mechanisms (VMware CBT, for example) to read source data. The VM stays powered on and users experience no interruption.
    Network Throttling
    Move supports bandwidth throttling on migration traffic so replication does not compete with production traffic on shared links during peak hours.
    Background Operation
    The seeding phase is a background task. End users are fully isolated from the process because their application is still running on the source hypervisor.
    Incremental Efficiency
    After the initial seed, subsequent syncs only move changed blocks, so the bandwidth consumption of ongoing replication is a fraction of the initial transfer.

    The Cutover Process

    The cutover is the only step that involves any downtime, and even that window is typically measured in minutes per VM. The sequence is deterministic and should be documented in the project plan before any work begins.

    1. Final Sync: Move performs one last incremental sync to capture the most recent changed blocks.
    2. Graceful Shutdown: The source VM is powered off cleanly, not forcefully terminated.
    3. Final Delta: A final incremental pass captures any blocks written during the shutdown sequence.
    4. Activation: Move installs the required VirtIO drivers for AHV, optionally reconfigures IP addressing, and powers the VM on within the Nutanix cluster.

    Practical Note
    For most general purpose VMs, the combined downtime from final sync through power on on AHV is under five minutes. Database VMs with large in flight transactions may take slightly longer depending on the final delta size.

    Rollback Strategy

    One of the most important things to communicate to stakeholders before a cutover is that rollback is not a complex recovery procedure. It is simply reversing a power state.

    Because Move does not delete or modify the source VM during cutover (it only powers it off and disconnects its network interface), the path back to the original state requires no data restoration. If a migrated VM does not perform as expected on AHV, the steps are:

    1. Power off the VM on the Nutanix AHV side.
    2. Reconnect the network interface on the source VM.
    3. Power on the source VM in the original environment.

    The source disks remain completely untouched throughout the process, so this rollback takes seconds rather than hours. It also means stakeholder sign off on a cutover carries much lower risk than it would in a traditional migration approach.

    Special Migration Scenarios

    Not every VM is a candidate for a straightforward Move migration. A few categories require a different approach:

    • Legacy Operating Systems: Windows Server 2003 and older Linux kernels with unsupported kernel versions are explicitly unsupported by modern versions of Nutanix Move and the standard AHV VirtIO driver set. These workloads cannot use the standard Move migration path and require an alternative approach such as a cold clone, a bare metal backup restoration, or an application level migration to a newly provisioned VM.
    • Physical Hardware Pass through: VMs with PCI pass through devices or Raw Device Mappings (RDMs) require manual reconfiguration on the target side.
    • Shared Disk Clustering: Certain older Oracle RAC or MSCS configurations that rely on shared SCSI bus emulation need architectural review before migration.

    For these cases, the alternatives range from a manual cold clone, to an application level migration, to a fresh OS installation with data restoration from backup. The right path depends on the workload, and that decision should be made during technical discovery before the project schedule is finalized.


    Database Migration Methodology

    Databases deserve a separate treatment because the consequences of a failed migration, or even a migration that succeeds but lands on a poorly configured target, are higher than for stateless application servers. We cover both Microsoft SQL Server and Oracle here.

    Storage Architecture for Database VMs

    Nutanix gives database workloads two primary storage paths: native vDisks and Nutanix Volume Groups.

    • Native vDisks are the default for AHV VMs and are simple to manage through Prism. Starting with AOS 6.x, the Autonomous Extent Store (AES) improved local sharding for native vDisks, so they are no longer as constrained as they were in earlier releases. That said, a single CVM still serves as the primary I/O path for a given vDisk, which means very high throughput workloads can reach a performance ceiling at the CVM level.
    • Nutanix Volume Groups (VG) are collections of vDisks presented as block devices. For AHV, VGs can be direct attached, appearing as native SCSI devices to the guest OS. When Volume Group Load Balancing (VGLB) is enabled, the system shards vDisks across all CVMs, removing the single CVM I/O path and allowing the database to draw on the aggregate throughput of the entire cluster’s Stargate processes.

    iSER Support: For the highest performance requirements, Nutanix supports iSER (iSCSI Extensions for RDMA), which bypasses the TCP/IP stack entirely to reduce latency and CPU overhead between the guest and the CVM. This is worth evaluating for latency sensitive OLTP workloads.

    AHV Specific Tuning for Databases

    Several AHV configuration decisions have a direct and measurable impact on database performance.

    • vCPU to pCPU Ratio: For production databases, size assuming 1 vCPU equals 1 physical core, not one hyperthreaded thread. Oversubscription introduces CPU Ready Time, which is particularly harmful to latency sensitive query workloads. Target below 5% CPU Ready.
    • Memory Reservations: Reserve 100% of assigned VM memory for SQL Server and Oracle VMs. AHV memory reclamation through ballooning or swapping can cause significant and hard to diagnose latency spikes in database workloads.
    • Huge Pages: AHV uses 2 MB Huge Pages to reduce Translation Lookaside Buffer (TLB) pressure. Ensure the guest OS is configured to use large page allocations to take advantage of this.
    • vNUMA: For VMs larger than a single physical socket, enable vNUMA and match the virtual topology to the physical hardware. This allows the database engine to schedule threads and memory access with NUMA awareness. Disable CPU hot add, as enabling it disables vNUMA and can cause performance degradation of up to 30%.

    AOS Features That Matter for Databases

    Data Locality
    AOS stores a VM’s data on the same physical node where the VM runs. Read I/O is served locally without network traversal, which reduces database read latency materially.
    AHV Turbo (Frodo I/O Path)
    Bypasses traditional QEMU emulation with a multi-queue I/O path that scales with the number of vCPUs, delivering higher I/O capacity and lower CPU overhead for storage intensive workloads.
    Nutanix Blockstore
    A block management system that moves device interactions into user space, eliminating context switching and kernel driver overhead for data disks.
    VGLB for OLAP
    Volume Group Load Balancing distributes I/O across all CVMs in the cluster. Critical for high throughput OLAP and reporting workloads that can saturate a single CVM.

    Microsoft SQL Server Migration Options

    There are three viable paths for SQL Server migrations, and the right choice depends on the deployment type and the acceptable downtime window.

    • Nutanix Move: The simplest path for standalone instances. Move handles disk conversion to AHV RAW format, VirtIO driver injection, and IP configuration. Best suited for standalone instances where a brief cutover window is acceptable.
    • Always On Availability Groups: Build a new SQL VM on AHV, join it to the existing Windows Server Failover Cluster (WSFC), and add it as a new secondary AG replica. Once synchronized, perform a planned manual failover to promote the Nutanix based node, then decommission the old nodes. This approach reduces cutover risk for business critical SQL workloads and can achieve near zero application downtime.
    • Backup and Restore: Take a full backup of the source database, restore it on a pre staged SQL VM on AHV using WITH NORECOVERY, and during the cutover window take a tail log backup, restore it with WITH RECOVERY, and redirect applications to the new instance.

    Oracle Migration Options

    • Nutanix Move: Recommended for migrating the Oracle VM as is from vSphere to AHV when the VM itself is in Move’s compatibility matrix. Move handles VirtIO driver injection automatically.
    • RMAN Active Duplication: Use Oracle Recovery Manager to perform an active duplication from the source to a new Oracle VM on AHV. The source database remains online until the final switchover, minimizing the downtime window.
    • Data Guard: Set up a physical standby on the Nutanix cluster, synchronize it via RMAN, and then perform a Data Guard switchover to promote the Nutanix instance to primary. This is the lowest risk option for Oracle databases with strict RPO/RTO requirements.
    • Oracle RAC with Nutanix Volumes: For RAC deployments, Nutanix Volumes provide the shared block storage required by clusterware. Volume Groups should be attached via iSCSI and configured with SCSI-3 Persistent Reservations.

    SQL Server Best Practices on AHV

    These configurations should be treated as baseline for any production SQL Server on Nutanix, whether migrated or newly deployed.

    Storage Layout

    • Use at least four vDisks to distribute data files, log files, TempDB, and the OS independently.
    • Format all data and log volumes with a 64 KB NTFS allocation unit size.
    • Do not use Windows Dynamic Disks or in guest volume managers. Add vDisks directly to the VM instead.
    • Keep OS, SQL binaries, user database data, logs, and TempDB on separate volumes.

    Instance Level Tuning

    • Instant File Initialization (IFI): Grant the SQL Server service account the “Perform Volume Maintenance Tasks” privilege to enable IFI. This eliminates zero initialization overhead during data file creation and auto growth events. IFI applies only to data files (.mdf and .ndf). Log files (.ldf) are always zero initialized regardless of this setting. Starting with SQL Server 2016, IFI can also be enabled directly from the installation wizard.
    • Lock Pages in Memory (LPIM): Enable LPIM to prevent Windows from paging the SQL Server buffer pool to disk. Max Server Memory must be set correctly before enabling LPIM to avoid starving the guest OS.
    • Max Server Memory: For mid to large VMs, leave 6 to 8 GB for the OS. For VMs under 32 GB of RAM, 4 GB is often sufficient. A practical formula: reserve 10% of total RAM for the OS, with a ceiling of around 8 GB unless SSIS or SSRS also run on the same instance.
    • MAXDOP: Set MAXDOP to the number of logical cores within a single vNUMA node. For SQL Server 2016 and later, the updated guidance is to use either 8 or the number of cores per NUMA node, whichever is smaller.
    • Cost Threshold for Parallelism (CTFP): Increase from the default of 5 to at least 50. OLTP workloads land at 50. Hybrid environments sometimes use a value in the 25 to 50 range.
    • TempDB: Match the number of data files to the logical processor count when that count is 8 or fewer. Start at 8 data files when the logical processor count exceeds 8. Only increase beyond 8 (in increments of 4) if PAGELATCH_UP or PAGELATCH_SH waits confirm actual contention.

    SQL Server Baseline Configuration Summary

    SettingRecommended BaselineReason
    IFIEnabledEliminates zero initialization overhead for data files during creation and auto growth.
    LPIMEnabledPrevents Windows from reclaiming the SQL Server buffer pool. Requires Max Server Memory to be set first.
    Max Server MemoryTotal RAM minus 4 to 8 GB (or 10% of total RAM)Prevents SQL Server from starving the guest OS.
    MAXDOP8 or cores per NUMA node, whichever is smallerKeeps parallel query execution within a single NUMA domain.
    CTFP50 (or 25 to 50 for hybrid workloads)Prevents low cost queries from triggering parallelism on modern multi core hardware.
    TempDBMatch logical processor count up to 8; increase by 4 only when contention is confirmedReduces allocation contention. All files must be equally sized with identical growth settings.

    Oracle Best Practices on AHV

    Oracle on Nutanix AHV benefits from the same platform level advantages as any other workload, but the database engine has enough specific tuning requirements that it warrants its own treatment.

    Memory Allocation: SGA and PGA

    Reserve approximately 10 percent of the total VM memory for the guest OS and file cache. Of the remaining 90 percent, allocate 80 percent to the System Global Area (SGA) and the remaining 20 percent to the Program Global Area (PGA). Memory reservations should be set to 100% of the assigned VM memory. Memory overcommit is not recommended for Oracle workloads.

    Storage Layout and Disk Groups

    NDB provisions multiple vDisks spread across ASM disk groups to maximize throughput across the Distributed Storage Fabric. The two primary disk groups are DATADG for database data files and RECODG for redo logs and archive files. For Oracle RAC, a third disk group CRSDG is required for Grid Infrastructure and clusterware files.

    Disk GroupSmall or Medium (500 GB and under)Large (501 GB and above)
    CRSDG (RAC only)3 vDisks3 vDisks
    DATADG4 vDisks8 vDisks
    RECODG2 vDisks4 vDisks

    ASM Configuration Options

    Nutanix supports ASMFD (ASM Filter Driver), ASMLIB, and udev rules for ASM disk mappings. ASMFD is the preferred method on modern Linux distributions. All ASM disks should be placed on vDisks in an AOS storage container with inline compression enabled and deduplication disabled.

    Network Design for Oracle RAC

    Oracle RAC requires a public network for client connections and a private interconnect for cache fusion on separate VLANs. Mixing them on the same VLAN introduces the risk of cache fusion traffic competing with client traffic. When using NDB to provision Oracle RAC, NDB manages IP address assignment across public, private, and virtual (scan and VIP) network types.

    RAC and Nutanix Volumes: Oracle RAC requires shared storage for the CRSDG disk group. On AHV, this is provided through Nutanix Volume Groups attached via iSCSI with SCSI-3 Persistent Reservations enabled. This is a prerequisite for RAC clusterware to function correctly.

    Oracle Patching with NDB

    NDB uses an out of place patching model for Oracle. Rather than patching a running Oracle home directly, the process involves provisioning a new database VM from an existing software profile, manually applying the patch set to that VM, and then creating a new software profile version from the patched VM. Once published, that version becomes available to all Oracle VMs managed by NDB. Patching can be performed in either a rolling or non rolling fashion for Oracle RAC environments.

    Time Machine Backup and Recovery for Oracle

    NDB Time Machine creates application consistent snapshots of Oracle databases along with copies of transaction log files. An SLA attached to the time machine controls snapshot frequency and retention. Point in time recovery is available as long as both a base snapshot and the covering transaction logs exist for the target timestamp. NDB restores the vDisks from the appropriate snapshot and then applies log files forward to bring the database to a consistent state.

    Decommissioning Protocol

    The migration is not complete when the VM powers on successfully on AHV. A structured decommissioning process ensures the legacy environment is cleaned up safely.

    StepActionOwner
    1Source VMs remain powered off with NIC disconnected for a 48 to 72 hour burn in period to prevent IP conflicts.Infrastructure Team
    2Confirm with Application Owners that performance and stability on AHV is acceptable after the burn in period.Project Lead
    3Archive a final backup of the source VM according to the organization’s retention policy before deletion.Backup Admin
    4Remove the VM from the source cluster inventory.Infrastructure Team
    5Update the CMDB or asset tracker to reflect the VM’s new hypervisor and decommission the legacy record.IT Operations

    Technical Discovery Requirements

    The quality of the discovery work done before migration determines how smooth everything else goes. At a minimum, the following information should be gathered before any migration plan is finalized.

    General Infrastructure

    • Specific vSphere version and ESXi build number in use on source hosts
    • Networking configuration: LACP, Jumbo Frames (MTU 9000), or standard configuration
    • IP retention requirement: retain existing IPs after migration or assign new IPs on AHV
    • Guest OS list with versions and BIOS/UEFI boot mode for each VM in scope

    SQL Server Environments

    • SQL Server versions and editions (Standard vs. Enterprise) deployed
    • Deployment type: Standalone, Failover Cluster Instance (FCI), or Always On AG
    • Current vCPU to physical core allocation and whether LPIM is already configured
    • vDisk layout per VM: number of disks, purpose (Data, Log, TempDB), and whether any single large data files exist that should be split
    • Dependencies on MSDTC, Linked Servers, or SQL Agent Jobs that require documentation before cutover

    Oracle Environments

    • Oracle versions in scope and whether instances are Single Instance or RAC
    • Shared storage configuration for RAC: ASM with ASMLib, ASMFD, or udev rules
    • Huge Pages configuration status in the guest OS
    • Existing RMAN backup workflows or Data Guard standbys that can be leveraged
    • Source platform architecture: if any workloads currently run on AIX or Solaris (SPARC), be aware that Nutanix Move is strictly an x86-to-x86 tool and cannot be used for these migrations. AIX and Solaris on SPARC are Big-Endian, while Nutanix AHV runs exclusively on x86-64 (Little-Endian). Cross-endian migrations require a fully manual path using RMAN CONVERT for Oracle or an application level export and restore, and should be scoped separately from the rest of the Move migration plan.

    Migration Constraints

    • Maximum acceptable maintenance window for final cutover
    • Average daily change rate for production databases (drives seeding bandwidth planning)
    • Top 10 application functions or queries to validate Day 1 performance after migration
    • Total allocated versus used storage per database environment, plus expected annual growth
  • Designing VMware Cloud Foundation 9.1: The 31 Decisions You Need to Make

    Every VCF deployment starts the same way: someone hands you a blank whiteboard and says design it. The problem is that VCF 9.1 is a broad platform, and without a structured approach it is easy to make decisions out of order, miss dependencies, or find out three phases in that an early choice locked you into something you did not intend.

    Broadcom organizes the VCF 9.1 design process into nine phases covering 31 distinct decisions. This post walks through each phase, what the decisions are, and why they matter in practice. If you are using the VCF Designer tool, this maps directly to the decision schema it uses.

    Phase 1: Starting Point and Profile

    Before touching any configuration, you need two things nailed down: the design blueprint and the scope.

    The Design Blueprint is your baseline deployment profile. Broadcom defines several: single site minimal, single site, multi-site single region, multi-region, and others covering application and security modernization. This is not a technical decision as much as it is a business one. It defines the complexity ceiling for everything that follows.

    Scope and Use Cases is where you gate the rest of the design. VCF 9.1 can cover private cloud IaaS, Kubernetes via Supervisor, Private AI Foundation, vDefend lateral security, VCF Edge, and disaster recovery. What you check here enables or disables options in later phases. Do not mark something in scope unless there is a real requirement behind it.

    Phase 2: Fleet-Level Decisions

    The VCF Fleet Deployment Model defines how the fleet is laid out. A single VCF instance is the most common for customers starting out or running a standalone private cloud. A connected fleet with multiple instances comes into play when you have multiple sites or organizational boundaries that require separate management planes.

    The VCF Fleet Sizing Model covers appliance sizing: Small, Medium, HA Medium, Large, and HA Large. Sizing here is not about your workload VMs. It is about the management plane itself. Undersizing the fleet appliances is one of the most common mistakes in early VCF deployments.

    Phase 3: Consumption Decisions

    This phase covers how cloud consumers interact with the platform. Five decisions, and they are tightly interconnected.

    The VCF Automation Model decides whether VCF Automation is deployed and in what topology. If your organization needs self-service provisioning or catalog-driven deployments, you need this. If not, skip it. Running it just because it is available adds operational overhead without benefit.

    The Network Consumption Model is one of the most consequential decisions in the entire design. VLAN, NSX Overlay Segments, VPC, or Transit Gateway. This drives downstream decisions on edge clusters, load balancers, and how workloads connect. Get this wrong and you are rearchitecting the network mid-project.

    Workload Connectivity and Load Balancer Model follow from the network consumption choice. For load balancing, NSX Native covers most use cases. Avi (VCF Advanced LB) is needed when you require full L7 with advanced policies, SSL offload, or WAF capabilities.

    Phase 4: Operations Decisions

    Six decisions covering management services, management networking, operations tooling, logging, network observability, and recovery.

    The VCF Management Services Model defines availability for SDDC Manager, vCenter, and NSX Manager. Standard vs. Highly Available. For production environments, the answer is almost always HA. The cost of an HA management plane is small compared to the cost of a failed SDDC Manager during a critical operation.

    The VCF Management Network Model determines whether management components share a VLAN, use isolated VLANs per component, or run on NSX segments. NSX segments require NSX to be up before management components can communicate, which creates a chicken-and-egg risk during recovery scenarios. Plan this carefully.

    The VCF Recovery Option aligns to your RPO and RTO requirements. Backup and restore, component-level recovery, and instance-level recovery each have different complexity and cost profiles. Define your recovery requirements before choosing this, not after.

    Phase 5: Security and Compliance

    Identity Broker and SSO decisions define how users authenticate to VCF components. Most enterprise environments will federate to Active Directory or an external IdP. Plan this early since it affects every component that needs authentication.

    vDefend Lateral Security only applies if it was included in scope in Phase 1. If deployed, the Security Services Platform adds distributed IDS/IPS and east-west traffic inspection.

    Phase 6: Virtual Infrastructure

    Seven decisions covering domains, clusters, networking, and storage. This is where the design gets concrete.

    The VCF Domain Model defines your management and workload domain topology. Single-AZ with one management plus one workload domain is the most common starting point. Stretched (multi-AZ) adds complexity but is required for metro HA.

    The Storage Model is one of the decisions with the most downstream impact. VCF 9.1 supports vSAN OSA, vSAN ESA, NFS, VMFS on Fibre Channel, iSCSI, and NVMe variants. vSAN ESA is the recommended path for new deployments using compatible hardware. If you are connecting to an existing SAN or NAS, the external storage options apply.

    NSX Manager topology and NSX Edge Cluster decisions define the control plane and data plane for your overlay network. Edge cluster sizing depends on the volume and type of north-south traffic. A shared NSX Manager cluster across domains reduces overhead. Dedicated per domain gives you blast radius isolation.

    Phase 7: Physical Infrastructure

    One decision: the Network Fabric Model. Routed VLAN fabric, Leaf-Spine VXLAN underlay, or EVPN-VXLAN fabric. This needs to be made in coordination with the network team. The fabric model affects how VLANs are extended across the environment and how the NSX overlay integrates with the underlay. EVPN-VXLAN provides the most flexibility for multi-site and stretched cluster scenarios.

    Phase 8: Optional Workload Capabilities

    VCF Edge and Private AI Foundation, both conditional on Phase 1 scope. For VCF Edge, single-host is suitable for small remote sites where HA is not required. Three-host provides local HA at the edge.

    For Private AI Foundation, the compute model selection depends heavily on the type of workloads. Training workloads typically want full GPU passthrough or MIG. Inference workloads can often share via vGPU.

    Phase 9: Closeout

    Two workflow tasks, not configuration decisions. First, reconcile every decision made in Phases 1 through 8 against the Broadcom VCF Design Library to confirm alignment with supported patterns. Second, translate the finalized design into the VCF Planning and Preparation Workbook, which is the actual input consumed by the VCF Installer during bring-up. A clean design that does not translate into a properly completed workbook will cause bring-up failures. Budget time for this step.

    The Full Decision Index

    StepPhaseDecision
    1Phase 1Design Blueprint
    2Phase 1Scope and Use Cases
    3Phase 2VCF Fleet Deployment Model
    4Phase 2VCF Fleet Sizing Model
    5Phase 3VCF Automation Model
    6Phase 3vSphere Supervisor Model
    7Phase 3Network Consumption Model
    8Phase 3Workload Connectivity Model
    9Phase 3Load Balancer Model
    10Phase 4VCF Management Services Model
    11Phase 4VCF Management Network Model
    12Phase 4VCF Operations Model
    13Phase 4Log Management Model
    14Phase 4VCF Operations for Networks Model
    15Phase 4VCF Recovery Option
    16Phase 5Identity Broker Model
    17Phase 5VCF Single Sign-On Model
    18Phase 5Lateral Security with vDefend
    19Phase 6VCF Domain Model
    20Phase 6vSphere Cluster Model
    21Phase 6Distributed Switch Model
    22Phase 6Storage Model
    23Phase 6NSX Manager and Control Plane Model
    24Phase 6NSX Edge Cluster Model
    25Phase 6Virtual Network Appliance Cluster Model
    26Phase 7Network Fabric Model
    27Phase 8VCF Edge Model
    28Phase 8Private AI Foundation Platform Model
    29Phase 8Private AI Foundation Compute Model
    30Phase 9Reconcile Against Broadcom Design Library
    31Phase 9Produce the Planning and Preparation Workbook
  • Insights from a Nutanix Migration Specialist

    Insights from a Nutanix Migration Specialist

    My work life as an IT specialist has always been quite varied.

    I spent part of my time installing traditional datacenter infrastructure, some of my time implementing cybersecurity solutions, and bits and pieces here and there, working on projects with a number of different technology vendors.

    But over the past 18 months, my main focus has been: migrate customers’ virtualization environments to Nutanix.

    The timing lines up with some big shakeups in the tech industry, as well as the continued growth of hyperconverged infrastructure (HCI). I heard my customers worry that support quality would decline for their existing environments, or that innovation might stall. In reality, what my customers have mostly seen is severe sticker shock on their renewal bills—partly due to inflation that has hit all sectors, but also due to dramatic changes to vendor licensing agreements. 

    Some customers have seen 3x, 5x, or even 10x increases in their virtualization costs, practically overnight. These are customers that have been with a vendor for 15 or 20 years, in many cases, and many had come to view their virtualization environments as something of a commodity with a stable pricing structure. But changes to licensing agreements have upended this stability. Before, customers could mostly purchase individual product licenses as needed, but they’re now being funneled into bundled packages with add-on features they don’t want and can’t use.

    Some large enterprises are able to absorb these new costs. But for others—especially small and medium-sized companies—the impact to their business is comparable to tripling their rent, or adding a zero to their monthly utility bills. These smaller customers also find themselves in a poor negotiating position with tech giants. 

    For example, we recently worked with Norfolk Public Schools in Virginia to migrate to Nutanix. The district was facing an eye-popping 680% cost increase if it stayed with its previous provider, but a five-year licensing agreement saved it approximately $2 million.

    For customers like Norfolk Public School, the numbers of the new virtualization landscape simply don’t add up. And for the first time, many of these organizations are willing to seriously consider a change.

    Even non-technical people can understand the anxiety that comes with switching technology platforms. (Think of how rarely people change to a phone with a different operating system.) Most of my customers never even considered switching from their existing virtualization provider until recently. After all, virtualization is a foundational technology that supports their entire business. Many system administrators have built their careers and expertise around the environment they know, developed their own workflows around its interface and capabilities, and integrated their entire application environment with that platform.

    Most importantly, businesses have come to rely on the stability of their virtualization environments to keep their mission-critical systems up and running. So, it’s understandable why many approach a change with a degree of trepidation. They want to know whether their applications will work the same way, how much downtime to expect, and whether their teams will need extensive retraining.

    However, once customers make the move, they tend to find that Nutanix infrastructure provides everything they need—and often in a more intuitive way, at what essentially amounts to what they were paying before the market shifts of the past couple off years. During the pre-sales process, I sit with customers to walk them through the Nutanix interface. We spend much of this time exploring the equivalent functionality between the platforms, which is often mostly a matter of learning new terminology for familiar features.

    At Norfolk Public Schools, we conducted site assessments, installed and configured new hardware, configured the Nutanix platform, and migrated more than 400 virtual machines—all in just over a month. The cutover to the new operating environment was seamless, and the district saw immediate improvements in performance and reliability.

    For most organizations, the migration is just as painless. Some clients prefer to migrate in small batches of just a few virtual machines, while others are ready to move hundreds of virtual machines over a single weekend. The actual cutover process for each virtual machine takes only about five to ten minutes—comparable to the standard maintenance window for most security patches. Post-migration, customers typically notice improved performance (mostly due to new hardware). In addition to the cost savings, many also cite Nutanix’s simplified disaster recovery capabilities as a major benefit of the move.

    After we start the migration, I can see the anxiety on my customers’ faces melt away, replaced by relief. Recently, one even started laughing. “This is so amazing!” he kept repeating. “This is so easy!”

  • Cloning Linux: A Step-by-Step Guide to Booting from iSCSI LUN

    In this comprehensive guide, we demystify the process of cloning a Linux operating system (Ubuntu) and guide you through the intricacies of booting directly from an iSCSI LUN. We’ll walk you through the entire process, from selecting the right tools for cloning to configuring your system for iSCSI boot. Whether you’re a seasoned Linux administrator or a curious enthusiast, this step-by-step guide is tailored to empower you with the knowledge and skills needed to successfully clone and boot Linux from an iSCSI LUN.

    Let’s begin with a summary of the technology prerequisites for accomplishing this task. Firstly, you’ll require a Linux box, whether physical or virtual. It’s essential to note that the method I propose involves system downtime, so scheduling a maintenance window is advisable, particularly if your system is in production. As part of this approach, a connection between the source system and the target Volume/LUN is crucial. I’ll explore the concept of cloning to a file from the source and transporting it to the target side in a future post.

    Lastly, a target system capable of providing the iSCSI volume is indispensable for the successful execution of this process. Keep these key components in mind as we delve into the steps for cloning a Linux OS and booting from an iSCSI LUN in our detailed guide.

    If you want to Lab the cloning, you’ll need three things:

    1. Linux Box: I will be using Ubuntu, you can download Ubuntu here: https://ubuntu.com/
    2. If you want to boot a Virtual Machine (VM) from iSCSI you will need iPXE: https://ipxe.org/download
    3. For the iSCSI server I used the Nutanix Community Edition: https://next.nutanix.com/discussion-forum-14/download-community-edition-38417 (you’ll need a Nutanix Next Community login)

    Here we go, in your Linux box gather a few nuggets of information by executing these commands:

    1. Become a super user with: sudo su
    2. List your disk drives with: fdisk -l
    3. Verify the boot device with: df -f and cat /etc/fstab, and blkid
    lab@lab-vm:~$ sudo su
    [sudo] password for lab:*****

    root@lab-vm:/home/lab# fdisk -l
    ...
    Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors
    Disk model: Virtual disk
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: gpt
    Disk identifier: 2F122466-CF57-4DAB-A441-276FFFFE87BD
    ...
    Device Start End Sectors Size Type
    /dev/sda1 2048 4095 2048 1M BIOS boot
    /dev/sda2 4096 41940991 41936896 20G Linux filesystem

    root@lab-vm:/home/lab# df -h
    Filesystem Size Used Avail Use% Mounted on
    tmpfs 391M 1.2M 390M 1% /run
    /dev/sda2 20G 6.1G 13G 33% /
    tmpfs 2.0G 0 2.0G 0% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 391M 4.0K 391M 1% /run/user/1000
    root@lab-vm:/home/lab#

    root@lab-vm:/home/lab# cat /etc/fstab
    # /etc/fstab: static file system information.
    #
    # Use 'blkid' to print the universally unique identifier for a
    # device; this may be used with UUID= as a more robust way to name devices
    # that works even if disks are added and removed. See fstab(5).
    #
    # <file system> <mount point> <type> <options> <dump> <pass>
    # / was on /dev/sda2 during curtin installation
    /dev/disk/by-uuid/91cf8b5a-2c4d-49c4-bcb5-57b59339a2c0 / ext4 defaults 0 1
    /swap.img none swap sw 0 0

    root@lab-vm:/home/lab# blkid
    /dev/sda2: UUID="91cf8b5a-2c4d-49c4-bcb5-57b59339a2c0" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="853d57ea-8b35-4dc7-bb28-813d1a2e4769"
    ...
    /dev/sda1: PARTUUID="3d8c878b-7431-4732-acd4-ba2a21f5458a"
    root@lab-vm:/home/lab#

    Based on the results of the earlier commands, we’ve identified that our system is installed directly on /dev/sda. With this understanding, let’s proceed to boot Linux from a Live Ubuntu ISO Image and open a Terminal window. See the following slideshow of the process:

    While in the live Ubuntu you could enable ssh to make everything easier. From the terminal executed the following commands

    1. sudo su
    2. apt install openssh-server -y
    3. systemctl enable ssh
    4. ufw allow ssh

    I already have my Nutanix CE deployed and an iSCSI lun configured. The Discovery IP address in my case is 192.168.1.51 and the name on the iSCSI Lun is iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09. To configure the Lice Ubuntu to access the Lun execute these commands (while logged with root using ‘sudo su’):

    • apt install open-iscsi -y
    • apt install multipath-tools -y
    • service multipath-tools start
    • iscsiadm -m discovery -t sendtargets -p 192.168.1.51
    • iscsiadm -m node –op=update -n node.conn[0].startup -v automatic
    • iscsiadm -m node –op=update -n node.startup -v automatic
    • systemctl enable open-iscsi
    • systemctl enable iscsid
    • systemctl restart iscsid.service
    • iscsiadm -m node –loginall=automatic
    • iscsiadm -m session -o show

    The output of the last command should show something like this:

    root@ubuntu:/home/ubuntu# iscsiadm -m session -o show
    tcp: [1] 192.168.1.51:3260,1 iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09-tgt0 (non-flash)
    root@ubuntu:/home/ubuntu#

    In my case the original drive is in /dev/sda and the new iSCSI lun is /dev/sdb, to start the cloning execute the following command: dd if=/dev/sda of=/dev/sdb bs=32M status=progress

    The next step must be done before the reboot. It will configure the system to boot from the new iSCSI Lun.

    • mount /dev/sdb2 /mnt
    • mount –bind /dev /mnt/dev
    • mount –bind /sys /mnt/sys
    • chroot /mnt
    • mount -t proc none /proc
    • hostname -F /etc/hostname
    • echo “nameserver 8.8.8.8” >> /etc/resolv.conf
    • apt-get install initramfs-tools -y
    • apt-get install open-iscsi -y
    • echo “iscsi” >> /etc/initramfs-tools/modules
    • touch /etc/iscsi/iscsi.initramfs
    • update-initramfs -u
    • Edit /etc/default/grub:
      • Replace:
        • GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash”
      • With:
        • GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash ip=dhcp ISCSI_INITIATOR=iqn.2004-10.com.ubuntu:01:a3ea501f8a8 ISCSI_TARGET_NAME=iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09 ISCSI_TARGET_IP=192.168.1.51 ISCSI_TARGET_PORT=3260″
    • update-grub

    Now shutdown and boot the Linux from the iPXE iso. Follow the same steps and the slideshow above, except that you will use the ipxe.iso image now:

    Be alert to use the ctrl-b early in the boot process:

    Now, type ‘dhcp‘ to acquire an IP address and type ‘show net0/ip‘ to verify it.

    It is time to boot from the iSCSI Lun using this command:

    sanboot iscsi:192.168.1.51::::iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09

    And you should have a system booting from the iSCSI Lun. In case that you don’t have access to both the source and target drives, you can pipe the dd command to gzip and save it to a file that can be read at the target system. An example is:

    • dd if=/dev/sda | gzip > file.gz

    I hope you find this post useful and remember to have a good backup before attempting the cloning procedure.

  • CCNA DevNet Study Guide – Describe parsing of common data format (XML, JSON, YAML) to Python data structures

    CCNA DevNet Study Guide – Describe parsing of common data format (XML, JSON, YAML) to Python data structures

    This is the second post in a series about the new CCNA DevNet certification (Previous Post Here). In this post, we will look at how to manage in Python the three formats that we previously discussed.

    Example of XML parsing in Python

    from __future__ import print_function
    import xml.etree.ElementTree as ET
    def main():
        # create element tree object
        with open('xmlfile.xml', 'r') as xmlFile:
            tree = ET.parse(xmlFile)
        # get root element
        root = tree.getroot()
        print("Root Tag: " + root.tag)
        print("Using a for Loop:")
        for child in root:
            print(child.tag)
            for attrib in child:
                print(attrib.tag, end=' ')
                print(attrib.text)
    
        print("Using Indexes:")
        print(root[0].tag)
        print(root[0][0].tag,end=' ')
        print(root[0][0].text)
        print(root[0][1].tag,end=' ')
        print(root[0][1].text)
    
        print(root[1].tag)
        print(root[1][0].tag,end=' ')
        print(root[1][0].text)
        print(root[1][1].tag,end=' ')
        print(root[1][1].text)
    
        print(root[2].tag)
        print(root[2][0].tag,end=' ')
        print(root[2][0].text)
        print(root[2][1].tag,end=' ')
        print(root[2][1].text)
    
        print("Other:")
        for hostname in root.iter('hostname'):
            print(hostname.tag,end=' ')
            print(hostname.text)
    
    if __name__ == "__main__":
        # calling main function
        main()

    Example run of the previous code:

    Root Tag: esx
    Using a for Loop:
    XX1
    hostname ESXi01
    ipaddress 10.10.10.101
    XX2
    hostname ESXi02
    ipaddress 10.10.10.102
    XX3
    hostname ESXi03
    ipaddress 10.10.10.103
    Using Indexes:
    XX1
    hostname ESXi01
    ipaddress 10.10.10.101
    XX2
    hostname ESXi02
    ipaddress 10.10.10.102
    XX3
    hostname ESXi03
    ipaddress 10.10.10.103
    Other:
    hostname ESXi01
    hostname ESXi02
    hostname ESXi03

    This is the XML file we used:

    <?xml version="1.0" encoding="UTF-8"?>
    <esx>
    <XX1>
    <hostname>ESXi01</hostname>
    <ipaddress>10.10.10.101</ipaddress>
    </XX1>
    <XX2>
    <hostname>ESXi02</hostname>
    <ipaddress>10.10.10.102</ipaddress>
    </XX2>
    <XX3>
    <hostname>ESXi03</hostname>
    <ipaddress>10.10.10.103</ipaddress>
    </XX3>
    </esx>

    To achieve something similar with the .json file we would use “import json”

    import json
    
    def main():
        with open('jason.json', 'r') as jsonFile:
            #load jason file
            myJasonFile = json.load(jsonFile)
        print(myJasonFile)
    
    if __name__ == "__main__":
        # calling main function
        main()

    And to parse YAML

    import yaml
    
    with open("yamlfile.yaml", 'r') as yamlFile:
        try:
            print(yaml.safe_load(yamlFile))
        except yaml.YAMLError as exc:
            print(exc)

    Let us read the RSS feed (XML) directly from a Website (https://vwannabe.com/feed/)*

    from urllib.request import urlopen
    from xml.etree.ElementTree import parse

    myURL = urlopen("https://vwannabe.com/feed/")
    myXML=parse(myURL)

    for item in myXML.iterfind('channel/item'):
    title = item.findtext('title')
    date = item.findtext('pubDate')
    link = item.findtext('link')

    print(title)
    print(date)
    print(link)
    print()

    *adapted from “Python – How to Read XML from URL?” by Vinish Kapoor

    This is the result of the previous code:

    CCNA DevNet Study Guide – Part 1
    Sun, 19 Jan 2020 17:17:45 +0000
    
    CCNA DevNet Study Guide – Part 1
    vSphere Upgrade 6.0 to 6.5 Fails with Replace Process Level Token error. Wed, 19 Jun 2019 15:32:13 +0000
    vSphere Upgrade 6.0 to 6.5 Fails with Replace Process Level Token error.
    Vembu now supports Hyper-V Cluster Thu, 01 Nov 2018 12:25:42 +0000
    Vembu now supports Hyper-V Cluster
    vCenter 6.7 upgrade walkthrough Fri, 20 Apr 2018 19:26:54 +0000
    vCenter 6.7 upgrade walkthrough
    Vembu Wed, 28 Mar 2018 19:11:09 +0000
    Vembu
    How to re-register the embedded VMware Update Manager (VUM) to its vCenter (VCSA) 6.5 Wed, 21 Feb 2018 23:06:36 +0000
    How to re-register the embedded VMware Update Manager (VUM) to its vCenter (VCSA) 6.5
    How to spin up a Linux instance in AWS Thu, 08 Feb 2018 20:57:06 +0000
    How to spin up a Linux instance in AWS
    CCNA Cyber Ops – SECOPS 1.0 Tue, 02 Jan 2018 20:23:21 +0000
    CCNA Cyber Ops – SECOPS 1.0
    Hacking Public Speaking Wed, 30 Aug 2017 16:50:28 +0000
    Hacking Public Speaking
    VMworld 2017 General Session Day Two Tue, 29 Aug 2017 17:44:05 +0000 https://vwannabe.com/2017/08/29/vmworld-2017-general-session-day-two/

    The CISCO Blueprint uses REST calls to a site and parses the JSON. You can find the example here. In the next port for this series, I will talk about “Describe the concepts of test-driven development”.

  • CCNA DevNet Study Guide – Part 1

    CCNA DevNet Study Guide – Part 1

    I will start a series of posts on the new CCNA DevNet certification. I will keep my SOP of going through the curriculum and google the concepts for you. I will try to include Youtube videos of some of the topics that have more hands-on exercises. The certification name is Cisco Certified DevNet Associate. The first topic if about software development and design. It includes some essential topics that I synthesize in the rest of this post.

    1.0 Software Development and Design – Compare data formats (XML, JSON, YAML)

    The formats XML, JSON, and YAML are data-serialization formats, from Wikipedia: In computer science, in the context of data storage, serialization (or serialisation) is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer) or transmitted (for example, across a network connection link) and reconstructed later (possibly in a different computer environment). When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object.

    Mostly you will probably accomplish your task with any of the three. If you are a javascript developer, you will probably feel more comfortable with JSON (JavaScript Object Notation), or if you code in Python, you might stick to YAML (YAML Ain’t Markup Language). The XML (eXtensible Markup Language) format comes from the World Wide Web Consortium (W3C).

    One difference between them is the format used by each. The XML uses tags like HTML, JSON uses objects in attribute-value pairs, and YAML uses indentation like Python.

    Here is a JSON snippet that I use as part of the Cisco HyperFlex Installation.

    {
        "esx": {
            "XX1": {
                "ipaddress": "10.10.10.101",
                "hostname": "ESXi01"
            },
            "XX2": {
                "ipaddress": "10.10.10.102",
                "hostname": "ESXi02"
            },
            "XX3": {
                "ipaddress": "10.10.10.103",
                "hostname": "ESXi03"
            }
        }
    }
    

    The previous example means that I have something called “esx”, which is the Hypervisor, and that I have three of them (XX1-XX3). Each has an IP address and a hostname. In XML it should like this:

    <?xml version="1.0" encoding="UTF-8"?> 
    <esx> 
      <XX1> 
        <hostname>ESXi01</hostname> 
        <ipaddress>10.10.10.101</ipaddress> 
      </XX1> 
      <XX2> 
        <hostname>ESXi02</hostname> 
        <ipaddress>10.10.10.102</ipaddress> 
      </XX2> 
      <XX3> 
        <hostname>ESXi03</hostname>
        <ipaddress>10.10.10.103</ipaddress> 
      </XX3>
    </esx>
    

    And in YAML, it should be something like this:

    ---
    esx:
      XX1:
        ipaddress: 10.10.10.101
        hostname: ESXi01
      XX2:
        ipaddress: 10.10.10.102
       hostname: ESXi02
      XX3:
        ipaddress: 10.10.10.103
        hostname: ESXi03

    I used two free online tools to convert one format to the other.

    1. https://www.freeformatter.com/json-to-xml-converter.html
    2. https://www.json2yaml.com/

    It is recommended to use the builtin libraries and not make your own to avoid mistakes. For example, javascript uses the JSON.parse() method, and python uses the JSON library. Example of use of the JSON library:

    import json
    json_string = '{"name": "Jason", "last_name":"Parser"}'
    parsed_json = json.loads(json_string)
    print(parsed_json['name'])
    "Jason"

    That is all for this post, I will publish periodically to add more sections to the software development and design topic:

    • Describe parsing of common data format (XML, JSON, YAML) to Python data structures
    • Describe the concepts of test-driven development
    • Compare software development methods (agile, lean, waterfall)
    • Explain the benefits of organizing code into methods/ functions, classes, and modules
    • Identify the advantages of common design patterns (MVC and Observer)
    • Explain the advantages of version control
    • Utilize common version control operations with Git
  • vSphere Upgrade 6.0 to 6.5 Fails with Replace Process Level Token error.

    Recently I was upgrading a vCenter 6.0 U2 to 6.5, and after successfully upgrading the external PSC, the process failed at the vCenter. After some troubleshooting, I found out that I needed to add the user to the “Replace a Process Level Token” attribute under the Local Security Policy. You can find the Microsoft documentation here. The process is simple:

    1. Login into the vCenter Windows Server with your Administrator account.
    2. Open the Control Panel and click the Administrative Tools.
    3. Browse to and double click the Local Security Policy and then expand the Local Policies. 
    4. Click on User Rights Assignment and open the Replace a Process Level Token attribute.

    Screenshot 2019-06-19 11.28.47

    5. Click the Add User or Group button to add the service account.

  • Vembu now supports Hyper-V Cluster

    Logo_01

    With the release of Version 4.0, Vembu now extends support to Hyper-V cluster. Vembu already supports both physical and virtual environments, covering all your needs for backups and disaster recovery. Please check their website at https://www.vembu.com, and requested a demo to experience the different features here: https://www.vembu.com/vembu-product-demo/. There are a couple of new interesting features in version 4.0 that are worth trying, not to mention that the free tier comes with the protection of up to three VMs in your environment. One of these features is the Hyper-V cluster.

    Hyper-V Failover Cluster

    high-availability-with-v4.0

    To view the latest Webinars, including one on how to manage a High Available Cluster, check the upcoming webinars here: https://www.vembu.com/webinars/#