Category: Nutanix

  • Ransomware Detection Model: A Use Case for Nutanix Hyperconverged Infrastructure (AOS) and Azure Machine Learning Studio

    Ransomware Detection Model: A Use Case for Nutanix Hyperconverged Infrastructure (AOS) and Azure Machine Learning Studio

    Javier E. Rodriguez, PE
    School of Cybersecurity and Privacy
    Georgia Institute of Technology
    javirodz@gatech.edu

    Abstract

    This paper presents the design and implementation of a behavioral ransomware detector built with machine learning. The system models the input and output (I/O) patterns of a virtual machine in two states: at steady state, and while its files are being encrypted by a typical ransomware attack. The model relies on key performance indicators (KPIs) that are available in most modern storage arrays, together with Azure Machine Learning Studio in the Microsoft Azure cloud. This combination keeps the method accessible to practitioners who do not have specialized knowledge of ransomware internals or machine learning.

    Data collection takes place at the storage array level through the Nutanix distributed data cluster. Observing I/O at this layer makes the measurement invisible to adversarial ransomware running inside the guest operating system. Because the method is behavioral, it can be expressed as anomaly detection, which allows it to provide a general detection capability against previously unseen, zero day ransomware.

    The experiments show that, once a virtual machine reaches steady state I/O, the model reacts to the anomalies caused by active encryption with very high accuracy.

    1. Introduction

    According to several industry reports, including the CrowdStrike Global Threat Report [1] and guidance from the Cybersecurity and Infrastructure Security Agency (CISA) [2], ransomware remains one of the most visible cybersecurity risks. The practice will continue for as long as it stays profitable. Estimates of its cost vary, but they consistently exceed the billion dollar mark, and industry coverage describes a threat that keeps evolving in both scale and technique [3].

    The average ransom payment nearly doubled in the year preceding this study, yet that figure is small next to the cost of downtime. PurpleSec reported that the average cost of downtime per incident in 2020 was approximately $283,000 [4]. The growth in attacks reached every sector, public and private. Readers should treat that figure as a vendor reported estimate and verify it against a primary source before citing it independently.

    The central difficulty in detecting ransomware is that it uses the same libraries and system calls as legitimate applications and routine operating system tasks. By taking a generic approach built on off the shelf tools, the system described here aims to address that difficulty without depending on signatures specific to any one family.

    2. Background

    Ransomware is a class of malware that denies a user access to their data until a ransom is paid [5]. The threat actor demands payment in exchange for restoring the data, which may be anything held in the system’s storage. The goal is to prevent victims from carrying out their normal activities (see Figure 1).

    Figure 1. Steps in ransomware activity [6].

    Figure 1. Steps in ransomware activity [6].

    Ransomware is commonly divided into two basic types, locker and crypto, with hybrid variants in some cases [6]. The approach in this study targets crypto ransomware, which encrypts the victim’s original data and renders it unavailable. The scheme typically includes a ransom note with instructions for paying and for obtaining the key needed to decrypt the data. This is an attack against the availability of the system.

    In some cases the threat actor also exfiltrates the data and threatens to publish it unless the ransom is paid. That tactic attacks the confidentiality of the data and can expose victims to regulatory fines, for example when the data includes payment card information or medical history.

    2.1 Technology Overview

    Figure 2 shows the technology used in the laboratory setting. The top layer, labeled App, holds the virtual machine running the Windows 10 operating system. The left side of the figure shows the conceptual layout of a hyperconverged system, and the right side shows the physical equipment used in the experiments.

    Figure 2. The legacy three tier infrastructure consists of three layers: the compute layer, the storage area network or fabric (SAN), and the storage array or arrays [7].

    Figure 2. The legacy three tier infrastructure consists of three layers: the compute layer, the storage area network or fabric (SAN), and the storage array or arrays [7].

    Hyperconverged infrastructure (HCI) is a software defined, unified system that combines the conventional data center elements of storage, compute, networking, and management. It uses software and x86 servers in place of expensive, purpose built hardware, which reduces data center complexity and improves scalability through a single, simple console. The laboratory equipment used in these experiments is a Nutanix hyperconverged cluster running the Acropolis hypervisor. The second part of the laboratory environment runs in the Azure cloud and is described in the Analysis section.

    3. Literature Review

    Dozens of studies address ransomware detection using a wide range of techniques. A substantial body of work applies machine learning and dynamic analysis to the problem, including wrapper based feature selection [8], network traffic analysis [9], [10], software defined networking [11], behavioral classification of variants [12], layered machine learning defenses [13], finite state machine models [14], and broad surveys of the field [15], [6], [16], [17]. Behavior based automated malware analysis has also been studied in depth [18], [19], [20], [21]. Hypervisor based and disk or storage level monitoring has been explored as well [22], [23], [24], [25], along with self healing, ransomware aware file systems [26] and data centric stopping techniques [27]. The approach in this study aims to distinguish itself by collecting data only at the storage array level and by using the hypervisor to keep the attacker unaware that it is being observed. Two studies that rely on dynamic behavior are worth discussing in more detail.

    The detection system described by Kharraz et al. [28] is based on the disk access actions performed by a process. It observes the change in entropy between a read and a write to the same region of a file, the proportion of file content that is overwritten, and whether the process deletes files. It also collects metadata about disk access, including whether a process writes to many files and whether those files span very different types or come from a single application. It measures the time between write requests and assigns higher risk as that interval shortens. These features are combined into a risk score through a linear function whose weights are determined by recursive feature elimination.

    In a second study, Baek et al. [24] proposed a detection model based on a set of lightweight behavioral features that describe the overwriting pattern of ransomware, a pattern that is largely invariant across families.

    Figure 3. Ransomware overwriting pattern contrasted with valid applications [24].

    4. Methodology

    This section presents the design of the I/O pattern analyzer. By drawing on the key performance indicators present in most modern storage arrays, the detection model is independent of the operating system installed on top of the hypervisor. The design has two goals: first, to create an efficient monitoring tool, and second, to remain hidden beneath the operating system layer so that it resists ransomware evasion techniques.

    4.1 Threat Model

    The threat model considered in this experiment is an attacker who can infect the operating system of the virtual machine. The attacker has evaded the static detection techniques and has begun the encryption process. The attacker has no access, physical or remote, to the hypervisor. Framing the problem this way follows established threat modeling guidance for ransomware [29].

    4.2 Ransomware

    There are close to 400 families of ransomware. The behavioral characteristics of each family matter in the design of a detection model. The characteristics relevant to this experiment are the way the data is encrypted and the type of evasion techniques used. Other characteristics, such as the network flow and the attack vector, are outside the scope of this study.

    After a certain period, a guest operating system reaches a steady state of I/O access patterns. A typical application is unlikely to behave in the same way a malicious payload does, at least not continuously. Everything a ransomware executable does requires resources such as CPU and memory, and it requires access to files, because the primary goal of a crypto locker is to encrypt all of the data in a way that makes it unusable to the victim.

    I/O Access patternI/O CharacteristicsTypical Applications
    Streaming Reads100% Reads; Large contiguous requests; 1-64 concurrent requests. It may be threaded.Media Servers (Video-on-demand, etc.). Virtual Tape Libraries (VTL), Application Servers
    Streaming Writes100% Writes; Large contiguous requests; 1-64 concurrent requests. It may be threaded.Media Capture, VTL, Medical Imaging, Archiving, Backup, Video Surveillance, Reference Data
    OLTPTypically, 2KB to 16KB request sizes; Read modify, write, verify operations resulting in 2 reads for every write; Primarily random accesses. Large number of concurrent requests. When running SQL statements in parallel, Database will typically perform large random I/Os.Databases (SAP, Oracle, SQL), Online Transaction Servers
    File ServerModerate distribution of request sizes from 4KB to 64KB. However, 4KB and 64KB comprise 70% of requests; it is primarily random; Generally, four reads for every write operation, a large number of concurrent requests during peak operational periods.File and Printer Servers, e-mail (Exchange, Notes), Decision Support Systems
    Web ServerA wide distribution of request sizes from 512 bytes to 512KB; Primarily random accesses; a Large number of concurrent requests during peak operational periodsWeb Services, Blogs, RSS Feeds, Shopping Carts, Search Engines, Storage Services
    WorkstationsPrimarily small to medium request sizes; 80% sequential and 20% random; Generally, four reads for every writes operation. 1-4 concurrent requests.Business Productivity, Scientific/Engineering Applications

    Table 1. Application I/O characteristics by access pattern [30].

    Table 1 summarizes common application types together with their typical I/O patterns and behavior. Other characteristics, such as streaming versus batch access, serial versus random access, and the block size histogram, also change during a ransomware attack. Figures 4 and 5 show examples of how data processing and access patterns differ.

    Figure 4. Data processing model [30].

    Figure 5. Access pattern contrast [31].

    Kharraz et al. divide the characteristics of ransomware I/O access patterns into three main categories:

    The attacker overwrites the user’s file with the encrypted version.

    The attacker reads the file, writes a new encrypted file, and then deletes the original.

    The attacker reads the file, writes a new encrypted file, and then overwrites the original.

    Figure 6. I/O pattern categories according to Kharraz et al. [28].

    Most families use a specific file extension for the encrypted output. For example, some Mespinoza variants of ransomware use the .pysa extension. Taking these access patterns into account, some families list and then randomly encrypt the files, which is a more advanced evasion technique. Detecting this kind of malicious behavior reliably requires several orthogonal methods of monitoring, a point expanded in the Discussion and Limitations section.

    With the experimental setup ready, data collection began. A simulated ransomware script (see Appendix III) traverses the Documents folder in the Windows 10 test VM and encrypts, from top to bottom, every file with one of the following extensions:

    “.pptx”, “txt”, “csv”, “.db”, “.mdb”, “.log”, “.sav”, “.sql”, “.xml”,”.key”, “.cert”,

    “.pem”, “.doc”, “.pdf”, “.email”, “.eml”, “.msg”, “.oft”, “.ost”,

    “.pst”, “.vcf”, “.apk”, “.bat”, “.pl”, “ps1”, “.pl”, “.vsd”, “.vss”, “.vst”, “.vdx”,

    “.vsx”, “.vtx”, “.vsw”, “.vsl”, “.dot”, “.xls”, “.py”, “.jpg”, “.jpeg”, “.png”,

    “.pgp”, “.tiff”, “sys”, “.pfx”, “plist”, “.vmx”, “.gif”, “.lic”, “.kit”, “.ctx”,

    “.sh”, “.conf”, “.ttf”, “.ico”, “.exe”, “.dmg”, “kdbx”, “.java”, “.jar”, “.yml”, “.json”,

    “kdb”, “.dll”, “.img”, “.msi”, “.wsf”, “.htm”, “.php”, “.vb”, “.c”, “.pcap”

    A complete traversal of the Documents folder takes approximately seventeen minutes. Appendix IV shows a timestamped sequence of performance metric snapshots captured during a traversal.

    4.3 Testbed

    Configuring a laboratory setting involves several considerations. The first is to provide an environment that resembles production. In this case, several tools were used to populate a Windows 10 VM with the data needed for a ransomware attack. Building such a testbed is not trivial, and additional observations appear in the Future Work section. The design followed prudent practices for malware experiments [32], and drew on isolated analysis environments such as Cuckoo Sandbox [33], [34].

    For this scenario, the operating system is assumed to be free of ransomware during the time it takes to reach steady state. That period is when the monitor reads the I/O patterns to create a clean baseline.

    The operating system used for these experiments is Windows 10 Enterprise. Figure 7 shows the layout of the laboratory. The guest operating system runs on a Type 1 hypervisor, which in this case is Nutanix Acropolis. The hypervisor isolates the guest VM and prevents the ransomware from reaching anything outside the experimental environment.

    Figure 7. Analysis layout with a Type 1 hypervisor.

    Figure 7. Analysis layout with a Type 1 hypervisor.

    To populate the test VM with files, two Python programs were developed to generate data. The first program, shown in Figure 8, accepts a root folder path as a starting point and creates folders to a chosen depth.

    def gen_tree(depth, parent_dir):
    while depth > 0:
    depth = depth - 1
    new_directory = random_line('words.txt')
    path = os.path.join(parent_dir, new_directory)
    try:
    os.mkdir(path)
    except OSError as error:
    pass
    parent_dir = path
    return path

    Figure 8. Python routine that creates a folder tree using common English words.

    The code uses a list of the one thousand most common words in the English language [35]. According to Kharraz et al. [28], some ransomware variants compute the entropy of a file or folder name and will not trigger if the name appears too random, so realistic names matter.

    For each path created, a second Python program generates Word files using the python-docx library [36]. To add images, two techniques were combined, one from Arrington [37] and one from Zita [38]. Finally, to increase the data volume, older documents were added, including PowerPoint presentations, PDF files, and additional images. Because only simulated ransomware was used, there was no risk of anyone stealing real data.

    The sample data consisted of 16,447 files (see Appendix I). The final number of encrypted files was lower, approximately 12,000, because the system stalled on very large files such as .zip and .ova archives.

    A second important consideration is the hardware isolation of the system in which the ransomware is triggered. Hardware isolation refers mainly to the network and to the ability of the monitoring environment to inject the ransomware without any risk of spreading it. To close network access, an isolated virtual switch with no uplink connections was used. The setup is flexible enough to move eth0 to a switch with internet access when software needs to be added to the operating system. In Figure 9, the br0 virtual switch has physical uplinks to a physical switch, while br1 is isolated.

    Figure 9. Laboratory network diagram.

    Because the test VMs run in a virtual environment, the console is available at any time without the risk of spreading the ransomware. With the environment and configuration described, the next section covers the key performance indicators that are available and how the data is collected.

    4.4 Features

    Feature identification is a broad subject. Features are the foundation of the dataset, and the dataset is only as useful as the features selected. The insight gained from the observations improves when the features chosen are well suited to the problem. This experiment had a rich set of features available; Appendix II lists them in full.

    Dataset quality improves when features are selected through a formal process such as feature engineering [39]. In this case, a combination of heuristics and the findings of the research papers reviewed for this problem guided the selection. Table 2 lists the features used.

    #Selected feature
    1ctl_random_ops_per_sec
    2ctl_read_io_bandwidth_kBps
    3ctl_write_io_bandwidth_kBps
    4ctl_num_read_iops
    5ctl_num_write_iops
    6hv_avg_read_io_latency_usecs
    7hv_avg_write_io_latency_usecs
    8ctl_total_read_io_size_kbytes
    9ctl_read_size_histogram_4kB
    10ctl_read_size_histogram_8kB
    11ctl_read_size_histogram_16kB
    12ctl_read_size_histogram_32kB
    13ctl_read_size_histogram_64kB
    14ctl_read_size_histogram_512kB
    15ctl_read_size_histogram_1024kB
    16ctl_write_size_histogram_4kB
    17ctl_write_size_histogram_8kB
    18ctl_write_size_histogram_16kB
    19ctl_write_size_histogram_32kB
    20ctl_write_size_histogram_64kB
    21ctl_write_size_histogram_512kB
    22ctl_write_size_histogram_1024kB

    Table 2. Features selected for the detection model.

    The intuition behind this selection is that, during a ransomware attack, the I/O statistics rise above their normal levels and the characteristics of the steady state I/O pattern change. The two clearest signals were the block size and the randomness of access. For a complete list of candidate features, see Appendix II.

    4.5 Dataset

    A dataset is a collection of data samples. The dataset in this experiment contains measurements collected every 120 seconds through a REST API. There are several ways to collect this data, as shown in Figure 10. A REST API request was chosen because the results can be written to a comma separated value (.csv) file for use in training. Most modern storage arrays expose the same measurements, so the results apply to enterprises of any size without vendor lock in.

    Figure 10. Monitoring tools for the Nutanix Acropolis cluster.

    Figure 10 shows Prism, the built in monitoring tool, which includes I/O and network flow monitoring. The hyperconverged system provides an HTML5 user interface, a REST API, and a command line utility. The experiment assumes that the operating system, in this case Windows 10 Enterprise, has reached a steady state I/O pattern and is free of any ransomware infection.

    When building a machine learning dataset, the ground truth data is split into a training dataset and a testing dataset. The algorithm is trained on the training data and then evaluated on its ability to perform on the testing data [40].

    Figure 11. REST API access to the Nutanix data platform [41].

    To retrieve the performance indicators, the Nutanix cluster is queried with a request that includes:

    the unique identifier of the virtual disk (UUID);

    the metric, or KPI, being requested;

    the start time and end time in microseconds, using the 24 hour Unix epoch format; and

    the interval in seconds, where the minimum for this version of Nutanix is 120 seconds.

    Appendix V contains the code used to retrieve the KPI through the REST API.

    One of the main challenges in behavioral detection is distinguishing a valid application from an actual ransomware attack. Some families go further and become adversarial by using several evasion techniques. One technique that would make this approach less robust [42] is for the malware to observe its environment and imitate normal behavior.

    To model a valid application workload, the experiment used DISKSPD, a command line tool for micro benchmarking [43], [44]. The following options were used:

    diskspd b8K d30 o4 t8 h r w25 L Z1G c20G C:\iotest.dat > iotestResults.txt

    This command runs a 30 second random I/O test against a 20 GB file on the C: drive, with a 25 percent write and 75 percent read ratio and an 8 KB block size. It uses eight worker threads, each with four outstanding I/Os, and a write entropy seed of 1 GB, and it saves the results to a text file. The equivalent utility on Linux is fio [45], [46].

    The DISKSPD emulator was used to model a SQL database as a representative application workload. The Future Work section returns to the need to model many suitable applications in order to build a more robust model.

    5. Analysis

    With the data collected (see Appendix VI for an example), it must be prepared before it can train the model. Two columns were added. The first is a VM identifier, which keeps the model ready for future experiments with additional test VMs. The second is the target metric, a column that indicates whether the data was collected during a ransomware attack. The value is zero for a normal operating system and one for an operating system under a ransomware attack. The word controller was shortened to ctl_, because the dataset import process appears to limit the length of a feature name and the characters it can contain.

    5.1 Azure Machine Learning

    There are two ways to apply Azure Machine Learning here. In the first, the collected data trains a model that classifies an operating system as either clean or infected. This is a binary classification problem, which can use Azure Automated Machine Learning. In the second, the collected data serves as a baseline and Azure Anomaly Detection identifies departures from it. Open source workbenches such as WEKA [47] offer comparable modeling capabilities, but a managed cloud service keeps the workflow accessible without local setup [48].

    5.2 Automated Machine Learning

    Automated machine learning, also called automated ML or AutoML, automates the time consuming, iterative tasks of model development. It lets data scientists, analysts, and developers build models at scale with efficiency and productivity while preserving model quality [49]. The automated workflow used here was as follows:

    A .csv file with the collected data was uploaded. The data includes the I/O information of the test VM both with and without ransomware [50].

    The target metric for the classification is the Ranso column.

    Azure Automated ML evaluated several algorithms, trained the corresponding models, and recommended the best model based on accuracy. This process is time consuming.

    The recommended pipeline used MaxAbsScaler with a random forest [51].

    The data was flagged as imbalanced, most likely because there were far more samples without ransomware than with it.

    Automated ML ran for close to an hour and recommended a random forest (Figures 12 and 13).

    Figure 12. Top ranked algorithms reported by Azure Automated ML.

    Figure 13. Lowest ranked algorithms reported by Azure Automated ML.

    Figure 14. Supervised learning pipeline using a two class decision forest.

    In Figure 14, the trained model uses the two class decision forest. The steps are:

    upload the .csv to create a dataset, in this case Win10WithRanso;

    normalize the data using min and max scaling;

    split the data into 70 percent for training and 30 percent for evaluation;

    train a model using the two class decision forest algorithm; and

    score and evaluate the model (see the Findings section for details).

    At this point there is a trained model that can detect at least the type of ransomware in which encryption proceeds by reading, overwriting, and renaming the file. Because there are many families of ransomware, broader coverage would require additional models.

    5.3 Azure Anomaly Detection

    Because a ransomware event is not common, collecting data about it is difficult, and by the nature of malicious activity the datasets are imbalanced. To handle imbalanced data, Azure Machine Learning provides a category called anomaly detection.

    The collected data fits that category well: it is numerical data gathered as a uniformly spaced time series. Azure ML can detect trends and spikes and report the changes as anomaly scores. It uses principal component analysis (PCA), a technique often used in exploratory data analysis because it reveals the inner structure of the data and explains its variance [52].

    Figure 15. Learning pipeline for the anomaly detection approach.

    Figure 15. Learning pipeline for the anomaly detection approach.

    Once the model is trained with data collected while the operating system has no ransomware, future KPI readings can be evaluated through a deployed API. The Python code below tests the model with new data, and the JSON sample that follows shows the source KPI data.

    def test_model(sample_file_path = '_samples.json'):
    service_name = 'ransomaly'
    ws = Workspace.get(
    name='RansoML',
    subscription_id='e7af3a72-63c8-4a9c-a78c-d28c017f238a',
    resource_group='Ranso'
    )
    service = Webservice(ws, service_name)
    with open(sample_file_path, 'r') as f:
    sample_data = json.load(f)
    score_result = service.run(json.dumps(sample_data))
    print(f'Inference result = {score_result}')
    return score_result

    Figure 16. Python source used to query the anomaly model.

    [
    {
    "VM": 1,
    "ctl_random_ops_per_sec": 612,
    "ctl_read_io_bandwidth_kBps": 3438,
    "ctl_write_io_bandwidth_kBps": 1187,
    "ctl_num_read_iops": 428,
    "ctl_num_write_iops": 145,
    "hv_avg_read_io_latency_usecs": 0,
    "hv_avg_write_io_latency_usecs": 0,
    "ctl_total_read_io_size_kbytes": 412656,
    "ctl_read_size_histogram_4kB": 0,
    "ctl_read_size_histogram_8kB": 41125,
    "ctl_read_size_histogram_16kB": 0,
    "ctl_read_size_histogram_32kB": 43,
    "ctl_read_size_histogram_64kB": 0,
    "ctl_read_size_histogram_512kB": 0,
    "ctl_read_size_histogram_1024kB": 0,
    "ctl_write_size_histogram_4kB": 127,
    "ctl_write_size_histogram_8kB": 13831,
    "ctl_write_size_histogram_16kB": 5,
    "ctl_write_size_histogram_32kB": 13,
    "ctl_write_size_histogram_64kB": 8,
    "ctl_write_size_histogram_512kB": 0,
    "ctl_write_size_histogram_1024kB": 1280,
    "Ranso": 0
    },

    Figure 17. Sample JSON file with source KPI data.

    6. Findings

    This section interprets the data and points to directions for further research. The results are presented with a confusion matrix, the standard way to evaluate a classification model [53]. In these matrices, cases where both the predicted and actual values are one (true positives) appear at the top left, and cases where both the predicted and actual values are zero (true negatives) appear at the bottom right.

    Data was collected for two ransomware encryption events. The first dataset contained 1,441 rows collected while the operating system was normal and nine rows collected during encryption. Azure Machine Learning trained the model by splitting the data into 70 percent for training and 30 percent for evaluation. Figure 18 shows the results.

    Figure 18. Confusion matrix for the first run.

    As Figure 18 shows, the model produced a 100 percent true positive rate, and on only one occasion it predicted no encryption while encryption was in progress. In a second, fully independent experiment, the model was retrained using 63 rows collected during encryption. This time the output was 100 percent true positives and 100 percent true negatives, as shown in Figure 19.

    Figure 19. Confusion matrix for the second run.

    Figure 19. Confusion matrix for the second run.

    The more actions that are considered, and the more of them that are present during ransomware activity, the higher the identification rate. Collecting all of these actions, however, requires letting the ransomware run freely long enough to encrypt and destroy many files. In these experiments, the ransomware encrypted approximately seven hundred files per minute.

    A second model was configured with Azure anomaly detection. For ransomware encryption, it identified the anomaly in 100 percent of cases. The anomaly model was trained on data from the operating system without ransomware. Table 3 shows the output of the anomaly model on the collected data.

    RansoScored LabelProbability
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196
    110.938535
    110.927582
    110.923774
    110.93596
    110.914103
    110.930306
    110.911739
    110.938852
    110.761196

    Table 3. Output of the anomaly model on the collected data.

    The decision threshold can be tuned between zero and one; the default is 0.5. In this experiment the lowest certainty probability was 0.76, which stayed well above the default threshold.

    7. Discussion and Limitations

    Both trained models, the binary classifier and the anomaly detector, effectively detect an attack. Once detection occurs, rapid response techniques and good operational practices can support recovery. A short script can take a snapshot of the system as soon as an anomaly is detected, and on most modern storage arrays a snapshot does not affect performance.

    After several weeks of research, reading, and testing, a number of limitations of this approach became clear:

    The model was trained on the behavior of one specific VM, so it is not a generic model. Addressing this would require an automated training process that builds one model per VM.

    Collecting data from a VM that is in production is difficult. In the laboratory it was possible to infect the test VM and take it down to collect data, but a production approach would need to clone the production VM, infect the clone, aggregate the data from both the production VM and the infected clone into a dataset, train and deploy the model, and then periodically update the model by repeating those steps.

    The number of observations in these experiments is small. To confirm the encouraging early findings, the observations and the collected data should be extended to many operating systems running multiple workloads.

    For the two reasons above, the anomaly detector is likely a better choice than the two class algorithm.

    8. Future Work

    This section offers a few ideas for stronger protection and higher detection accuracy. Rather than relying on a single way to detect ransomware after the static defenses have been defeated, the proposal is to combine several subsystems that together form a layered defense around the environment:

    I/O to the storage array. This is the approach presented in this study. It would be worth adding both per VM and total storage array performance, since that combination did not appear in the reviewed material.

    File decoys. A honey file technique helps to reduce false positive results [54], [55].

    Compression and deduplication. Both measurements drop during encryption, because encrypted files are poor candidates for compression and deduplication. The idea is promising, although by the time the change is visible it may be too late to stop the encryption.

    Backup verification. Most attacks try to stop the backup system; the challenge is that backup systems vary from one environment to another.

    Network communication. This signal supports the overall strategy and is very effective when combined with the other layers.

    There is also a newer way to consume storage in a virtualized environment. The underlying storage is divided into chunks using virtual volumes (VVols) [56]. As the limitations show, it would help for the system to be aware of the specific files the ransomware accesses. Correlating file metadata with VVol utilization could give the model more insight and raise confidence in detection.

    Figure 20. Proposed system for future work.

    Figure 20. Proposed system for future work.

    In Figure 20, the proposed system has two dynamic behavior monitors on the left. When an anomaly occurs, the message queue receives the alert. The top right shows a backup process monitor, and the lower right shows the honey file, or canary file, check. The bottom center shows the two outputs for a positive ransomware detection: on the left, the process that snapshots the system, and on the right, the alert module.

    Acknowledgments

    I thank my family for their support and patience during this research. I also thank Dr. Mustaque Ahamad for his guidance during the semester; his feedback and recommendations helped me meet the learning objectives of the course. Finally, I thank my fellow students, whose positive attitude and strong work in the weekly progress reports kept motivation high.

    Appendix

    Appendix I. File Extension Detailed Count

    The sample dataset used to exercise the simulated ransomware contained the file types listed below, grouped by extension with total size and file count.

    File ExtensionTotal Size(Mb)File Count
    Total107.846421
    .al0.6721
    .at0.0532
    .backup0.1212
    .bak34.4167
    .basex0.00710
    .bash_history0.011
    .bashrc0.0031
    .bat0.03517
    .bin3.0996
    .boot01
    .bz20.8321
    .c2.4627
    .c320.3384
    .cat0.7065
    .cfg0.2886
    .changed01
    .class1.942794
    .clb0.0462
    .com0.1211
    .common0.0051
    .conf0.696
    .config0.0873
    .controlio0.1725
    .cpgz24.6451
    .cpp0.0291
    .crash0.0241
    .crdownload292.1163
    .crit01
    .css0.14143
    .csv3.082114
    .ctd0.0094
    .ctx0.5755
    .db0.4061
    .deb57.3755
    .debug0.1764
    .default01
    .der0.0011
    .dir0.0341
    .diskdefines01
    .dll0.7872
    .dmg88.353
    .doc15.56932
    .docm0.3162
    .docx518.988486
    .dotx0.3983
    .drt0.7761
    .dtd0.0883
    .dump0.0312
    .dylib0.0732
    .EFI2.3672
    .eml0.0532
    .ena0.463
    .ent0.0293
    .eps14.63717
    .epub5.3891
    .err0.0196
    .exe2074.40227
    .factoryio0.3863
    .FCD04
    .flake80.0011
    .gif0.11943
    .gpg5.2523
    .grp20.7642
    .gz1374.106205
    .h0.065
    .hpp0.0081
    .htc0.0022
    .htm0.0283
    .html5.55725
    .icns0.2492
    .ico0.0492
    .ics0.0196
    .img9.6045
    .in01
    .info02
    .ini0.0027
    .input_i0.3122
    .iso20275.22116
    .jar156.356342
    .jnlp0.05615
    .jpeg3.90618
    .jpg157.814162
    .jpg_large0.3251
    .js4.105106
    .json24.039164
    .kdb0.0084
    .kdbx0.03813
    .keystream0.8553
    .kit0.0332
    .lbb0.2793
    .len07
    .lic0.20993
    .license0.0011
    .lock01
    .log204.64302
    .lst0.018
    .manifest0.0713
    .md0.05315
    .md502
    .mgmtd0.711
    .mod2.012236
    .mp4493.2191
    .mpp0.1331
    .msg0.0921
    .msi125.7177
    .names0.0041
    .nar107.01932
    .ndp-proxy0.0061
    .netconfig0.0111
    .netrwhist01
    .nib0.34622
    .notice01
    .nvram0.0711
    .old1881.35711
    .omsg0.0841
    .one0.8361
    .out31.37927
    .ova20349.0799
    .pak92.2252
    .pcap0.0031
    .pcf0.0012
    .pdf1264.3231457
    .pem0.0032
    .pf20.0051
    .pfx0.0021
    .pg_dump0.0171
    .pkg16.1062
    .plist0.0052
    .png108.4061415
    .policy0.0131
    .potx5.9451
    .ppt34.6544
    .pptx529.425171
    .profile01
    .properties0.0248
    .psd1.42710
    .pxd0.0053
    .py6.193647
    .pyc0.5463
    .pyd4.31316
    .pyi0.463184
    .pyx0.0853
    .rar5.2781
    .rdp0.0116
    .rll1.3571
    .rpc0.0412
    .rpm1764.36527
    .rpm-utils0.0021
    .rsrc0.0011
    .rtf1.823102
    .run140.4761
    .s0.0182
    .sample0.0212
    .sb1987.0472
    .SET0.0373
    .sh223.58720
    .SHData0.29512
    .size01
    .slf0.0077
    .so12.77259
    .sql0.0041
    .sqlite1.1251
    .st0.0139
    .strings02
    .symbolMap50.9156
    .sys0.0661
    .tar933.6215
    .template1.8858
    .tex01
    .tgz1520.1227
    .thrift0.0011
    .tif2.0373
    .tiff5.49975
    .TORRENT0.0181
    .ts0.22114
    .ttf3.28821
    .txt91.105567
    .url01
    .vdx8.4487
    .vib103.0356
    .viminfo0.0111
    .vmdk26319.8764
    .vmsd02
    .vmsn0.0281
    .vmx0.0062
    .vmxf0.0042
    .vscodeignore0.00141
    .vsd116.92850
    .vsdx11.48214
    .vss216.99413
    .vssx5.8462
    .war3.8332
    .warn02
    .x320.1991
    .x640.21
    .xls64.106203
    .xlsm98.673121
    .xlsx242.441415
    .xml70.8624454
    .xq8.211735
    .xqm0.19631
    .xsd0.12220
    .xslt0.0012
    .yml0.0011
    .zip16207.155232

    Appendix II. Available Features

    The Nutanix platform exposes the performance indicators below at the VM, cluster, and storage container levels. The subset used for the model appears in Table 2.

    VMClusterStorage Container
    CPU Usage (%)CPU Usage (%)Storage Controller IOPS (IOPS)
    CPU Ready Time (%)Memory Usage (%)Storage Controller Read IOPS (IOPS)
    Memory Usage (%)Controller IOPS (IOPS)Storage Controller Write IOPS (IOPS)
    Storage Controller IOPS (IOPS)Controller Read IOPS (IOPS)Storage Controller Latency (ms)
    Storage Controller Read IOPS (IOPS)Controller Write IOPS (IOPS)Storage Controller Read Latency (ms)
    Storage Controller Write IOPS (IOPS)Controller AVG Latency (ms)Storage Controller Write Latency (ms)
    Storage Controller Latency (ms)Controller AVG Read Latency (ms)Storage Controller I/O Bandwidth (Mbps)
    Storage Controller Read Latency (ms)Controller AVG Write Latency (ms)Storage Controller Read Bandwidth (Mbps)
    Storage Controller Write Latency (ms)Controller I/O Bandwidth (Mbps)Storage Controller Write Bandwidth (Mbps)
    Storage Controller I/O Bandwidth (Mbps)Controller Read Bandwidth (Mbps)
    Storage Controller Read Bandwidth (Mbps)Controller Write Bandwidth (Mbps)
    Storage Controller Write Bandwidth (Mbps)
    Disk Usage (GiB)Virtual Disk
    Disk Usage (%)Random I/O (%)
    Snapshot Usage (GiB)Read Source Cache (KBps)
    Shared Data (GiB)Read Working Set size (MiB)
    I/O Working Set size (MiB)Write Working Set size (MiB)
    Read I/O Working Set size (MiB)Union Working Set Size
    Write I/O Working Set size (MiB)
    Read Size Distribution (bytes/%)
    Write Size Distribution (bytes/%)
    Network Receive Packets Dropped (# packets)
    Network Transmit Packets Dropped (# packets)
    Network Rx (KiB)
    Network Tx (KiB)

    Appendix III. Simulated Ransomware Script (Python)

    The script below traverses a target folder and, for each file whose extension matches the encryption list, encrypts the file in place using a symmetric key. It was used only against the isolated test VM, and the equivalent PowerShell technique is described by Rayner [57].

    import os
    from cryptography.fernet import Fernet
    def encrypt_file(filename):
    # process one file here
    #Generate a key
    key = Fernet.generate_key()
    #Save the key to the file my_key.key
    with open('my_key.key', 'wb') as my_key:
    my_key.write(key)
    # Initialize fernet object
    fernet_object = Fernet(key)
    # Read the file
    with open(filename, 'rb') as original_file:
    original = original_file.read()
    # Encrypt the file
    encrypted = fernet_object.encrypt(original)
    # Overwrite the file
    try:
    with open(filename, 'wb') as encrypted_file:
    encrypted_file.write(encrypted)
    except:
    pass
    def decrypt_file(filename):
    # Read the key from the file "my_key.key"
    with open('my_key.key', 'rb') as my_key:
    key = my_key.read()
    # Initialize fernet object
    fernet_object = Fernet(key)
    # Read the encrypted file
    with open(filename, 'rb') as encrypted_file:
    encrypted = encrypted_file.read()
    # Decrypt the file
    decrypted = fernet_object.decrypt(encrypted)
    # Overwrite the file
    with open(filename, 'wb') as decrypted_file:
    decrypted_file.write(decrypted)
    def get_file_list(root_folder):
    file_list = []
    # for root, dirs, files in os.walk(root_folder, topdown=False): #to list bottom-up
    for root, dirs, files in os.walk(root_folder):
    for name in files:
    #print("Filename ", os.path.join(root, name))
    file_list.append(os.path.join(root, name))
    # for folder in dirs:
    # print("Folder :",os.path.join(root, folder))
    return file_list
    def test_file_extension(file_name):
    encryptable = False
    extensions = [".pptx", "txt", "csv", ".db", ".mdb", ".log", ".sav", ".sql", ".xml",".key", ".cert", ".pem", ".doc", ".pdf", ".email", ".eml", ".msg", ".oft", ".ost", ".pst", ".vcf", ".apk", ".bat", ".pl", "ps1", ".pl", ".vsd" , ".vss" , ".vst" , ".vdx" , ".vsx" , ".vtx" , ".vsw" , ".vsl", ".dot", ".xls", ".py", ".jpg", ".jpeg", ".png", ".pgp", ".tiff", "sys", ".pfx", "plist", ".vmx", ".gif", ".lic", ".kit", ".ctx", ".sh", ".conf", ".ttf", ".ico", ".exe", ".dmg", "kdbx", ".java", ".jar", ".yml", ".json", "kdb", ".dll", ".img", ".msi", ".wsf", ".htm", ".php", ".vb", ".c", ".pcap"]
    for ext in extensions:
    if ext in file_name.rpartition('\\')[2]:
    encryptable = True
    return encryptable
    if __name__ == '__main__':
    root_folder = 'C:\\Users\\Win\\Documents\\'
    if(os.path.exists(root_folder)):
    file_list = get_file_list(root_folder)
    count = 0
    #'''
    for file_name in file_list:
    #print(file_name)
    if test_file_extension(file_name):
    print(file_name)
    #encrypt_file(file_name)
    #os.rename(file_name, file_name + ".pysa")
    #'''
    '''
    # To decrypt: uncomment lines 65 and 70 and comment lines 72 and 78
    for file_name in file_list:
    decrypt_file(file_name)
    print(file_name.split('.'))
    #os.rename(file_name, file_name.split('.pysa')
    '''
    else:
    print("Folder does not exist")

    Appendix IV. Sequence of Graphical Data During Ransomware Encryption

    The snapshots below show the Prism performance metrics captured at successive timestamps while the simulated ransomware encrypted the Documents folder.

    Figure A4.1. Performance metrics snapshot 1 of 9 during encryption.

    Figure A4.2. Performance metrics snapshot 2 of 9 during encryption.

    Figure A4.3. Performance metrics snapshot 3 of 9 during encryption.

    Figure A4.4. Performance metrics snapshot 4 of 9 during encryption.

    Figure A4.5. Performance metrics snapshot 5 of 9 during encryption.

    Figure A4.6. Performance metrics snapshot 6 of 9 during encryption.

    Figure A4.7. Performance metrics snapshot 7 of 9 during encryption.

    Figure A4.8. Performance metrics snapshot 8 of 9 during encryption.

    Figure A4.9. Performance metrics snapshot 9 of 9 during encryption.

    Appendix V. Python Code to Retrieve KPI Using the REST API

    import pprint
    import json
    import os
    import random
    import time
    import requests
    import sys
    import traceback
    # This block initializes the parameters for the request.
    class AHVRestApi():
    def __init__(self):
    # Initializes the options and the logfile from GFLAGS.
    self.serverIpAddress = "NUTANIX SERVER IP ADDRESS"
    self.username = "USERNAME"
    self.password = "PASSWORD"
    # Base URL at which REST services are hosted in Prism Gateway.
    BASE_URL = 'https://%s:9440/api/nutanix/v2.0/'
    self.base_url = BASE_URL % self.serverIpAddress
    self.session = self.get_server_session(self.username, self.password)
    def getVirtualDiskInformation(self, virtual_disk_id, start_time_usecs, end_time_usecs, interval_secs, metric ):
    URL = self.base_url + "virtual_disks/"+virtual_disk_id+"/stats/?metrics="+metric+ \
    "&start_time_in_usecs="+start_time_usecs+"" \
    "&end_time_in_usecs="+end_time_usecs+"" \
    "&interval_in_secs="+interval_secs
    serverResponse = self.session.get(URL)
    return json.loads(serverResponse.text)
    if __name__ == "__main__":
    try:
    ahvRestApi = AHVRestApi()
    ckoo_virtual_disk_id = 'c2193bad-29f2-4156-94d8-7bfc928f25c0'
    #win10_virtual_disk_id = '8a337f0a-d6d4-4157-a26a-93729680fb70' #old id
    win10_virtual_disk_id = '5065fba7-0671-409c-a746-eba05c38dda9'
    win2019_virtual_disk_id = 'd2e69200-82c8-4f7f-bc4a-8de856f905cc'
    #start_time_usecs = 1614429000000000 #Saturday, February 27, 2021 7:30:00 AM GMT-05:00
    #start_time_usecs = 1614774600000000 #Saturday, March 3, 2021 7:30:00 AM GMT-05:00
    #start_time_usecs = 1615077000000000 #Saturday, March 6, 2021 7:30:00 AM GMT-05:00
    start_time_usecs = 1616247600000000 #Wed, March 17, 2021 10:45:00 AM GMT-05:00
    end_time_usecs = 1616248620000000 #Wed, March 17, 2021 1:15:00 PM GMT-05:00
    interval_secs = "120"
    metrics = ["controller.random_ops_per_sec",
    "controller_read_io_bandwidth_kBps",
    "controller_write_io_bandwidth_kBps",
    "controller_num_read_iops",
    "controller_num_write_iops",
    "hypervisor_avg_read_io_latency_usecs",
    "hypervisor_avg_write_io_latency_usecs",
    "controller_total_read_io_size_kbytes",
    "controller.read_size_histogram_4kB",
    "controller.read_size_histogram_8kB",
    "controller.read_size_histogram_16kB",
    "controller.read_size_histogram_32kB",
    "controller.read_size_histogram_64kB",
    "controller.read_size_histogram_512kB",
    "controller.read_size_histogram_1024kB",
    "controller.write_size_histogram_4kB",
    "controller.write_size_histogram_8kB",
    "controller.write_size_histogram_16kB",
    "controller.write_size_histogram_32kB",
    "controller.write_size_histogram_64kB",
    "controller.write_size_histogram_512kB",
    "controller.write_size_histogram_1024kB" ]
    with open("data.txt",'w') as my_file:
    for metric in metrics:
    win10_virtual_disk = ahvRestApi.getVirtualDiskInformation(win10_virtual_disk_id, str(start_time_usecs), str(end_time_usecs), interval_secs, metric)
    this_value = win10_virtual_disk['stats_specific_responses'][0]['values']
    print(metric + "," + str(this_value) + "\n")
    my_file.write(metric + "," + str(this_value) + "\n")
    except Exception as ex:
    print(ex)
    ex
    sys.exit(1)

    Appendix VI. Collected Data

    Figure A6.1 shows an example of the data collected through the REST API and prepared for training.

    Figure A6.1. Example of the collected and prepared dataset.

    Appendix VII. DISKSPD Output

    Command Line: C:\DISKSPD\x86\diskspd.exe -b8k -d30 -o4 -t4 -h -r -w25 -Z1G -L -c20G c:\iotest.dat
    Input parameters:
    timespan:   1
    -------------
    duration: 30s
    warm up time: 5s
    cool down time: 0s
    measuring latency
    random seed: 0
    path: 'c:\iotest.dat'
    think time: 0ms
    burst size: 0
    software cache disabled
    hardware write cache disabled, writethrough on
    write buffer size: 1073741824
    performing mix test (read/write ratio: 75/25)
    block size: 8192
    using random I/O (alignment: 8192)
    number of outstanding I/O operations: 4
    thread stride size: 0
    threads per file: 4
    using I/O Completion Ports
    IO priority: normal
    System information:
    computer name: Win
    start time: 2021/02/27 13:53:11 UTC
    Results for timespan 1:
    *******************************************************************************
    actual test time: 30.01s
    thread count: 4
    proc count: 2
    CPU |  Usage |  User  |  Kernel |  Idle
    -------------------------------------------
       0|  23.02%|   7.92%|   15.10%|  76.98%
       1|  24.43%|  14.64%|    9.79%|  75.57%
    -------------------------------------------
    avg.|  23.72%|  11.28%|   12.45%|  76.28%
    Total IO
    thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |        84271104 |        10287 |       2.68 |     342.73 |   11.662 |    16.062 | c:\iotest.dat (20GiB)
         1 |        81010688 |         9889 |       2.57 |     329.47 |   12.127 |    17.012 | c:\iotest.dat (20GiB)
         2 |        84172800 |        10275 |       2.67 |     342.33 |   11.676 |    16.164 | c:\iotest.dat (20GiB)
         3 |        80904192 |         9876 |       2.57 |     329.04 |   12.142 |    17.595 | c:\iotest.dat (20GiB)
    -----------------------------------------------------------------------------------------------------
    total:         330358784 |        40327 |      10.50 |    1343.58 |   11.897 |    16.710
    Read IO
    thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |        62570496 |         7638 |       1.99 |     254.48 |   11.267 |    16.182 | c:\iotest.dat (20GiB)
         1 |        60710912 |         7411 |       1.93 |     246.91 |   11.872 |    16.114 | c:\iotest.dat (20GiB)
         2 |        63102976 |         7703 |       2.01 |     256.64 |   11.461 |    16.900 | c:\iotest.dat (20GiB)
         3 |        60448768 |         7379 |       1.92 |     245.85 |   12.000 |    18.401 | c:\iotest.dat (20GiB)
    -----------------------------------------------------------------------------------------------------
    total:         246833152 |        30131 |       7.84 |    1003.88 |   11.645 |    16.920
    Write IO
    thread |       bytes     |     I/Os     |    MiB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
    -----------------------------------------------------------------------------------------------------
         0 |        21700608 |         2649 |       0.69 |      88.26 |   12.802 |    15.654 | c:\iotest.dat (20GiB)
         1 |        20299776 |         2478 |       0.65 |      82.56 |   12.891 |    19.429 | c:\iotest.dat (20GiB)
         2 |        21069824 |         2572 |       0.67 |      85.69 |   12.321 |    13.705 | c:\iotest.dat (20GiB)
         3 |        20455424 |         2497 |       0.65 |      83.19 |   12.560 |    14.952 | c:\iotest.dat (20GiB)
    -----------------------------------------------------------------------------------------------------
    total:          83525632 |        10196 |       2.65 |     339.70 |   12.643 |    16.050
    total:
      %-ile |  Read (ms) | Write (ms) | Total (ms)
    ----------------------------------------------
        min |      0.442 |      1.430 |      0.442
       25th |      7.628 |      8.612 |      7.870
       50th |      9.215 |     10.198 |      9.463
       75th |     10.993 |     11.980 |     11.277
       90th |     14.605 |     15.977 |     14.952
       95th |     22.197 |     24.319 |     22.712
       99th |     68.312 |     70.150 |     68.543
    3-nines |    285.154 |    274.683 |    285.154
    4-nines |    468.722 |    467.886 |    468.722
    5-nines |    473.159 |    472.866 |    473.159
    6-nines |    473.159 |    472.866 |    473.159
    7-nines |    473.159 |    472.866 |    473.159
    8-nines |    473.159 |    472.866 |    473.159
    9-nines |    473.159 |    472.866 |    473.159
        max |    473.159 |    472.866 |    473.159

    References

    [1] CrowdStrike, 2020 Global Threat Report. Sunnyvale, CA, USA: CrowdStrike, Inc., 2020.

    [2] Cybersecurity and Infrastructure Security Agency, “Protecting against ransomware,” Security Tip ST19-001, Apr. 11, 2019. [Online]. Available: https://www.cisa.gov/news-events/news/protecting-against-ransomware

    [3] The Hacker News, “Everything you need to know about evolving threat of ransomware,” thehackernews.com, Feb. 2021. [Online]. Available: https://thehackernews.com/2021/02/everything-you-need-to-know-about.html

    [4] PurpleSec, “Ransomware statistics, data, and trends,” 2021. [Online]. Available: https://purplesec.us/resources/cyber-security-statistics/ransomware/

    [5] G. Hull, H. John, and B. Arief, “Ransomware deployment methods and analysis: Views from a predictive model and human responses,” Crime Science, vol. 8, no. 2, 2019, doi: 10.1186/s40163-019-0097-9.

    [6] E. Berrueta, D. Morato, E. Magana, and M. Izal, “A survey on detection techniques for cryptographic ransomware,” IEEE Access, vol. 7, pp. 144925-144944, 2019, doi: 10.1109/ACCESS.2019.2945839.

    [7] B. Scott, “Case for HCI in the modern datacenter,” MyPureSupport Community, 2017. [Online]. Available: https://community.mypuresupport.com/case-for-hci-over-legacy-3-tier/

    [8] M. S. Abbasi, H. Al-Sahaf, and I. Welch, “Particle swarm optimization: A wrapper-based feature selection method for ransomware detection and classification,” in Applications of Evolutionary Computation (EvoApplications 2020), Lecture Notes in Computer Science, vol. 12104. Cham, Switzerland: Springer, 2020, pp. 181-196, doi: 10.1007/978-3-030-43722-0_12.

    [9] O. M. K. Alhawi, J. Baldwin, and A. Dehghantanha, “Leveraging machine learning techniques for Windows ransomware network traffic detection,” in Cyber Threat Intelligence, Advances in Information Security, vol. 70. Cham, Switzerland: Springer, 2018, pp. 93-106, doi: 10.1007/978-3-319-73951-9_5.

    [10] R. Moussaileb, N. Cuppens, J.-L. Lanet, and H. Le Bouder, “Ransomware network traffic analysis for pre-encryption alert,” in Foundations and Practice of Security (FPS 2019), Lecture Notes in Computer Science, vol. 12056. Cham, Switzerland: Springer, 2020, pp. 20-38, doi: 10.1007/978-3-030-45371-8_2.

    [11] G. Cusack, O. Michel, and E. Keller, “Machine learning-based detection of ransomware using SDN,” in Proc. 2018 ACM Int. Workshop on Security in Software Defined Networks & Network Function Virtualization (SDN-NFV Sec), 2018, pp. 1-6, doi: 10.1145/3180465.3180467.

    [12] H. Daku, P. Zavarsky, and Y. Malik, “Behavioral-based classification and identification of ransomware variants using machine learning,” in Proc. 2018 17th IEEE Int. Conf. Trust, Security and Privacy in Computing and Communications / 12th IEEE Int. Conf. Big Data Science and Engineering (TrustCom/BigDataSE), 2018, pp. 1560-1564, doi: 10.1109/TrustCom/BigDataSE.2018.00224.

    [13] S. K. Shaukat and V. J. Ribeiro, “RansomWall: A layered defense system against cryptographic ransomware attacks using machine learning,” in Proc. 2018 10th Int. Conf. Communication Systems & Networks (COMSNETS), 2018, pp. 356-363, doi: 10.1109/COMSNETS.2018.8328219.

    [14] G. Ramesh and A. Menen, “Automated dynamic approach for detecting ransomware using finite-state machine,” Decision Support Systems, vol. 138, art. 113400, 2020, doi: 10.1016/j.dss.2020.113400.

    [15] B. A. S. Al-rimy, M. A. Maarof, and S. Z. M. Shaid, “Ransomware threat success factors, taxonomy, and countermeasures: A survey and research directions,” Computers & Security, vol. 74, pp. 144-166, 2018, doi: 10.1016/j.cose.2018.01.001.

    [16] D. W. Fernando, N. Komninos, and T. Chen, “A study on the evolution of ransomware detection using machine learning and deep learning techniques,” IoT, vol. 1, no. 2, pp. 551-604, 2020, doi: 10.3390/iot1020030.

    [17] O. Or-Meir, N. Nissim, Y. Elovici, and L. Rokach, “Dynamic malware analysis in the modern era: a state of the art survey,” ACM Computing Surveys, vol. 52, no. 5, art. 88, pp. 1-48, 2019, doi: 10.1145/3329786.

    [18] A. Mohaisen, O. Alrawi, and M. Mohaisen, “AMAL: High-fidelity, behavior-based automated malware analysis and classification,” Computers & Security, vol. 52, pp. 251-266, 2015, doi: 10.1016/j.cose.2015.04.001.

    [19] D. Sgandurra, L. Munoz-Gonzalez, R. Mohsen, and E. C. Lupu, “Automated dynamic analysis of ransomware: Benefits, limitations and use for detection,” arXiv:1609.03020, Sep. 2016.

    [20] M. E. Ahmed, H. Kim, S. Camtepe, and S. Nepal, “Peeler: Profiling kernel-level events to detect ransomware,” in Computer Security: ESORICS 2021, Lecture Notes in Computer Science, vol. 12972. Cham, Switzerland: Springer, 2021, pp. 240-260, doi: 10.1007/978-3-030-88418-5_12.

    [21] A. Y. Huang, “Towards robust malware detection,” M.Eng. thesis, Dept. Electr. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, MA, USA, 2018.

    [22] A. Fattori, A. Lanzi, D. Balzarotti, and E. Kirda, “Hypervisor-based malware protection with AccessMiner,” Computers & Security, vol. 52, pp. 33-50, 2015, doi: 10.1016/j.cose.2015.03.007.

    [23] N. Paul, S. Gurumurthi, and D. Evans, “Towards disk-level malware detection,” in Proc. Workshop on Code Based Software Security Assessments (CoBaSSA), 2005.

    [24] S. Baek, Y. Jung, A. Mohaisen, S. Lee, and D. Nyang, “SSD-Insider: Internal defense of solid-state drive against ransomware with perfect data recovery,” in Proc. 2018 IEEE 38th Int. Conf. Distributed Computing Systems (ICDCS), 2018, pp. 875-884, doi: 10.1109/ICDCS.2018.00089.

    [25] W. Xie, N. Chen, and B. Chen, “Poster: Incorporating malware detection into flash translation layer,” in Proc. 2020 IEEE Symp. Security and Privacy (Poster Session), 2020.

    [26] A. Continella, A. Guagnelli, G. Zingaro, G. De Pasquale, A. Barenghi, S. Zanero, and F. Maggi, “ShieldFS: A self-healing, ransomware-aware filesystem,” in Proc. 32nd Annu. Computer Security Applications Conf. (ACSAC), 2016, pp. 336-347, doi: 10.1145/2991079.2991110.

    [27] N. Scaife, H. Carter, P. Traynor, and K. R. B. Butler, “CryptoLock (and drop it): Stopping ransomware attacks on user data,” in Proc. 2016 IEEE 36th Int. Conf. Distributed Computing Systems (ICDCS), 2016, pp. 303-312, doi: 10.1109/ICDCS.2016.46.

    [28] A. Kharraz, W. Robertson, D. Balzarotti, L. Bilge, and E. Kirda, “Cutting the Gordian knot: A look under the hood of ransomware attacks,” in Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2015), Lecture Notes in Computer Science, vol. 9148. Cham, Switzerland: Springer, 2015, pp. 3-24, doi: 10.1007/978-3-319-20550-2_1.

    [29] D. Sebayan, “How threat modeling can prevent your next ransomware attack,” ThreatModeler, 2019. [Online]. Available: https://threatmodeler.com/

    [30] Datacadamia, “I/O: workload (access pattern),” 2019. [Online]. Available: https://datacadamia.com/io/access_pattern

    [31] J. Layton, “IO patterns: what you do not know can hurt you,” Enterprise Storage Forum, 2013. [Online]. Available: https://www.enterprisestorageforum.com/management/io-patterns-what-you-dont-know-can-hurt-you/

    [32] C. Rossow, C. J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, and M. van Steen, “Prudent practices for designing malware experiments: Status quo and outlook,” in Proc. 2012 IEEE Symp. Security and Privacy, 2012, pp. 65-79, doi: 10.1109/SP.2012.14.

    [33] Cuckoo Foundation, “Preparing the host: Cuckoo Sandbox v2.0.7 book,” 2019. [Online]. Available: https://cuckoo.readthedocs.io/en/latest/installation/host/

    [34] D. Murchison, “Home lab series: Cuckoo Sandbox on ESXi,” murchisd.github.io, Jan. 25, 2019. [Online]. Available: https://murchisd.github.io/pr0j3cts/2019/01/25/Cuckoo-Sandbox-and-ESXi.html

    [35] EF Education First, “1000 most common words in English,” 2015. [Online]. Available: https://www.ef.com/wwen/english-resources/english-vocabulary/top-1000-words/

    [36] S. Canny, “python-docx documentation,” 2013. [Online]. Available: https://python-docx.readthedocs.io/

    [37] A. Arrington, “Automate Google image downloads with Python,” Medium, Apr. 19, 2020. [Online]. Available: https://medium.com/@austin_9875/automate-google-image-downloads-with-python-91b633130ba9

    [38] C. Zita, “How to download Google images using Python (2021),” Level Up Coding (Medium), Jan. 25, 2021. [Online]. Available: https://levelup.gitconnected.com/how-to-download-google-images-using-python-2021-82e69c637d59

    [39] DataRobot, “Feature variables,” DataRobot AI Wiki, 2019. [Online]. Available: https://www.datarobot.com/wiki/

    [40] W. Arbash, “Dataset vs ground-truth dataset,” wao.ai, 2019. [Online]. Available: https://wao.ai/blog/dataset-vs-ground-truth-dataset

    [41] Nutanix, “API reference,” Nutanix.dev, 2020. [Online]. Available: https://www.nutanix.dev/api-reference/

    [42] Wikipedia contributors, “Robustness (computer science),” Wikipedia, The Free Encyclopedia, 2019. [Online]. Available: https://en.wikipedia.org/wiki/Robustness_(computer_science)

    [43] G. Berry, “Using Microsoft DiskSpd to test your storage subsystem,” SQLPerformance.com, Aug. 4, 2015. [Online]. Available: https://sqlperformance.com/2015/08/io-subsystem/diskspd-test-storage

    [44] J. Yi, “Use DISKSPD to test workload storage performance,” Azure Stack HCI Documentation, Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure-stack/hci/manage/diskspd-overview

    [45] B. Sjerps, “Pinpointing I/O bottlenecks on Linux,” Dirty Cache, Mar. 4, 2011. [Online]. Available: https://bartsjerps.wordpress.com/2011/03/04/io-bottleneck-linux/

    [46] Wikipedia contributors, “Memory access pattern,” Wikipedia, The Free Encyclopedia, 2020. [Online]. Available: https://en.wikipedia.org/wiki/Memory_access_pattern

    [47] G. Holmes, A. Donkin, and I. H. Witten, “WEKA: A machine learning workbench,” in Proc. 2nd Australia and New Zealand Conf. Intelligent Information Systems (ANZIIS), 1994, pp. 357-361.

    [48] Microsoft, “Create machine learning models,” Microsoft Learn Training, 2020. [Online]. Available: https://learn.microsoft.com/training/paths/create-machine-learn-models/

    [49] Microsoft, “What is automated machine learning (AutoML)?,” Azure Machine Learning Documentation, Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure/machine-learning/concept-automated-ml

    [50] Microsoft, “Tutorial: Train a classification model with no-code automated ML in the Azure Machine Learning studio,” Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml

    [51] F. Lazzeri, “How to select algorithms for Azure Machine Learning,” Microsoft Learn, 2020. [Online]. Available: https://learn.microsoft.com/azure/machine-learning/how-to-select-algorithms

    [52] Microsoft, “PCA-based anomaly detection (ML Studio classic),” Azure Machine Learning Studio Module Reference, 2019. [Online]. Available: https://learn.microsoft.com/previous-versions/azure/machine-learning/studio-module-reference/pca-based-anomaly-detection

    [53] Microsoft, “Train and evaluate classification models,” Microsoft Learn Training, 2020. [Online]. Available: https://learn.microsoft.com/training/modules/train-evaluate-classification-models/

    [54] C. Moore, “Detecting ransomware with honeypot techniques,” in Proc. 2016 Cybersecurity and Cyberforensics Conf. (CCC), 2016, pp. 77-81, doi: 10.1109/CCC.2016.14.

    [55] Kaspersky, “What is a honeypot?,” Kaspersky Resource Center, 2020. [Online]. Available: https://usa.kaspersky.com/resource-center/threats/what-is-a-honeypot

    [56] C. Hosterman, “The case for vVols and ransomware,” codyhosterman.com, Mar. 17, 2020. [Online]. Available: https://www.codyhosterman.com/2020/03/the-case-for-vvols-and-ransomware/

    [57] T. Rayner, “Simulating a ransomware attack with PowerShell,” CanITPro Blog, Microsoft TechNet, Jan. 27, 2016. [Online]. Available: https://learn.microsoft.com/archive/blogs/canitpro/simulating-a-ransomware-attack-with-powershell

  • Deploying Nutanix AHV with Pure Storage FlashArray: A Practical Field Guide

    Deploying Nutanix AHV with Pure Storage FlashArray: A Practical Field Guide

    Author: Javier Rodriguez, Managing Technical Architect, ePlus Technology  |  javier.rodriguez@eplus.com

    Why This Architecture Matters Now

    For years, Nutanix was almost synonymous with HCI, where compute and storage live together in the same nodes. That model works exceptionally well for general-purpose workloads, but it has always had a ceiling: when you need more storage capacity or performance, you also have to buy more compute, whether you need it or not.

    The formal partnership between Nutanix and Pure Storage, announced at .NEXT 2025 in May, changes that equation. Nutanix AHV can now run as a compute only platform backed by Pure Storage FlashArray over NVMe/TCP. Each Nutanix AOS vDisk maps directly to a FlashArray volume, which means per VM granularity for snapshots, quality-of-service controls, and replication. You get the operational simplicity of Prism as a unified management plane while the FlashArray handles the storage heavy lifting underneath.

    This post walks through what it actually takes to build that environment, based on the Cisco FlashStack with Nutanix Installation Field Guide (v1.0, December 2025) and supplemental information from Nutanix and Pure Storage documentation.


    Architecture Overview

    The deployment model covered here uses Cisco UCS servers (X-series, C-series, or B-series) as compute only nodes, managed by Cisco Intersight in Intersight Managed Mode (IMM). The nodes connect to Cisco UCS Fabric Interconnects (FIs), and those FIs connect upstream to top of rack switches. The Pure Storage FlashArray sits off to the side as a dedicated external storage array, connected to those same ToR switches over NVMe/TCP.

    • No local storage is used for AOS datastores. The nodes are diskless or have local drives used only for the hypervisor boot.
    • Storage traffic travels over dedicated VLANs and dedicated vNIC pairs, separate from management and guest VM traffic.
    • The Nutanix Controller VM (CVM) on each node handles the NVMe/TCP initiator connections to the FlashArray automatically. Administrators do not need to manually configure NVMe initiators.
    • Prism Central (or Prism Element) is the primary management interface for the cluster, while Cisco Intersight manages the UCS hardware layer.

    Software Version Requirements

    Before any hardware gets racked, confirm that all components meet the minimum software versions. Using mismatched versions is one of the most common causes of failed deployments.

    ComponentMinimum VersionNotes
    Nutanix AOS7.5 or later
    Nutanix AHV11.0 or later
    Foundation Central1.10 onlyDo not use 2.x at this time
    Prism Central7.5 or laterRequired for Licensing
    Nutanix LCM3.3Included with AOS 7.5
    Pure Storage Purity/FA6.10.3 or laterUpgrade must be done before installation begins
    Cisco Fabric Interconnect4.3(4.240066) or later
    Cisco Intersight Virtual Appliance1.1.5-1 or laterOlder CVA/PVA versions will cause failures
    Cisco UCS X210c-M7 Firmware5.4(0.250048) or later
    Cisco UCS C-series M6/M7 Firmware4.3(6.250053) or later
    Cisco UCS B-series M5/M6 Firmware5.3(0.250021) or later

    Important: Only Foundation Central version 1.10 should be used. The Appliance VM version 2.x is explicitly not supported for this deployment type. Do not use it.


    IP Address Planning

    IP address planning should be completed before any configuration begins. Retrofitting addressing after the fact wastes time and introduces risk.

    Infrastructure

    • 2 addresses for the Fabric Interconnects
    • 1 address for the Foundation Central Appliance VM
    • 1 optional address for Prism Central / Foundation Central VM

    Per Nutanix Host (five addresses each)

    1. AHV hypervisor management address
    2. Controller VM (CVM) management address
    3. CIMC management address (assigned as a pool in Intersight)
    4. Storage interface address, VLAN 1 (assigned as a pool in Prism Element)
    5. Storage interface address, VLAN 2 (assigned as a pool in Prism Element)

    Pure Storage FlashArray (seven addresses)

    • 1 per controller management interface
    • 1 roaming array management address
    • 1 per NVMe/TCP storage interface (minimum 4 across two controllers and two VLANs)

    Storage addresses must be Layer 2 adjacent to the hosts and cannot traverse a router. Use two separate storage VLANs, one for the A side controller interfaces and one for the B side.


    Step 1: Pure Storage FlashArray Configuration

    The FlashArray setup is best done via CLI. Some configuration tasks cannot be completed through the Purity GUI. This assumes the array is already racked, cabled, powered on, and reachable on its management network, with Purity/FA 6.10.3 or later already running.

    Enable and Configure NVMe/TCP Interfaces

    Four Ethernet interfaces need to be assigned to NVMe/TCP. The example below uses interfaces eth10 and eth11 on each controller.

    # Enable the four storage interfaces
    purenetwork eth enable ct0.eth10
    purenetwork eth enable ct0.eth11
    purenetwork eth enable ct1.eth10
    purenetwork eth enable ct1.eth11
    # Assign addresses, MTU 9000, and NVMe/TCP service
    purenetwork eth setattr --address 10.1.61.100/24 --mtu 9000 --servicelist nvme-tcp ct0.eth10
    purenetwork eth setattr --address 10.1.62.100/24 --mtu 9000 --servicelist nvme-tcp ct0.eth11
    purenetwork eth setattr --address 10.1.61.101/24 --mtu 9000 --servicelist nvme-tcp ct1.eth10
    purenetwork eth setattr --address 10.1.62.101/24 --mtu 9000 --servicelist nvme-tcp ct1.eth11

    Use MTU 9000 (jumbo frames) wherever possible. If jumbo frames are not supported end to end in your network, set MTU to 1500 and ensure consistency across all components.

    Create the Realm, Pod, and Administrative User

    # Create the Realm for RBAC segmentation
    purerealm create <realm_name>
    # Create a Pod within the Realm for this Nutanix cluster
    purepod create <realm_name>::<pod_name>
    # Create the management access policy granting admin rights to the Realm
    purepolicy management-access create --role admin --realm <realm_name> <policy_name>
    # Create the user Nutanix will use to authenticate to the array
    pureadmin create <username> --access-policy <policy_name>

    Important: Set a storage quota on the Pod after creation. Without a quota, the array will not accurately report available storage to Nutanix.


    Step 2: Cisco UCS and Intersight Configuration

    Configure Fabric Interconnect A first via serial console or HTTPS Express Setup, setting the management mode to Intersight. After FI-A shows a login prompt, configure FI-B. FI-B will detect the peer and prompt to join the cluster.

    Once the FIs are up, log into Cisco Intersight and claim the UCS domain using the Device ID and Claim Code from the FI web console. Create Resource Groups and an Organization, create and deploy a Domain Profile to the Fabric Interconnects, and set the System QoS Best Effort MTU to 9216 to allow jumbo frames.

    This is where compute only Nutanix deployments require careful attention. Each server needs at least two vNIC pairs: an infrastructure pair for AHV and CVM management traffic, and a dedicated storage pair carrying the NVMe/TCP storage VLANs.

    Infrastructure vNIC naming is case sensitive. Foundation Central will reject the deployment if these names are wrong. For a single VIC server use ntnx-infra-1-A on Slot MLOM, PCIe Order 0, Fabric A, Failover disabled and ntnx-infra-1-B on Slot MLOM, PCIe Order 1, Fabric B, Failover disabled.


    Step 3: Foundation Central Deployment

    For first-time cluster deployments with no existing Nutanix infrastructure, the Appliance VM is the simplest path. Deploy it with 2 vCPUs and 4 GB RAM with a static IP address. DHCP is not supported. Run the setup script from the local console after booting and access the GUI at https://<FC_IP>:9440.

    Upload the AOS installation package, its metadata JSON file, and the AHV ISO via API calls to the Appliance VM. Retrieve the hosted file URLs by browsing to http://<FC_IP>:8053/files/images and enter those URLs into the cluster deployment wizard.


    Step 4: Nutanix Cluster Deployment

    Connect Foundation Central to Cisco Intersight by entering the API Key ID and Secret Key under Settings. The API key user must have at minimum Server Administrator privileges in the relevant Intersight Organization.

    1. Onboard the Cisco UCS servers by selecting Intersight Managed Mode and choosing the target nodes
    2. Select the onboarded nodes and click Create Cluster
    3. Select Compute Cluster (not HCI Cluster, since there is no local storage)
    4. Configure the infrastructure vNIC pair and at least one dedicated storage vNIC pair
    5. Assign IP addresses and hostnames. Use Bulk Configuration to set sequential addresses efficiently
    6. Enter the download URLs for AOS, the AOS metadata file, and the AHV ISO
    7. Set NTP servers, DNS servers, and timezone
    8. Select the Foundation Central API Key and click Create Deployment

    Deployments without firmware changes typically complete in 75 to 90 minutes. If firmware upgrades are required, add 60 to 90 minutes.


    Step 5: External Storage Connectivity

    After the Nutanix cluster is up and accessible in Prism Element, select I’ll Do This Later when prompted to set up external storage. The virtual switch configuration must be done first.

    1. Edit the default virtual switch vs0 to remove any storage vNICs, leaving only the infrastructure vNIC pairs as uplinks
    2. Create a new dedicated storage virtual switch, assign it the storage vNIC pair, set MTU to 9000, and Bond Type to Active-Active with MAC pinning
    3. Create one External Storage Interface per storage VLAN, associated with the new storage virtual switch, with an IP pool large enough for one address per node plus room for growth. Enable the External Storage option and set MTU to 9000.

    Before attaching the array, verify jumbo frame connectivity from the CVMs to all four FlashArray storage interfaces:

    ping -M do -s 8972 10.1.61.100
    ping -M do -s 8972 10.1.62.100
    ping -M do -s 8972 10.1.61.101
    ping -M do -s 8972 10.1.62.101

    All four tests should complete with 0% packet loss. From Prism Element, click Attach External Storage, select Pure Storage FlashArray, enter the clustered management IP, the Realm administrative username and password, select the Realm and Pod, and click Attach. The connection typically completes within 30 to 60 seconds.


    Step 6: Post-Installation Tasks

    Change the default passwords on three accounts on AHV (root, admin, and nutanix) and on the CVM nutanix account. Run the NCC password health check after: ncc health_checks system_checks default_password_check

    Run a full NCC health check from Prism Element and resolve all failures and warnings before the cluster goes into production. LCM 3.3 ships with AOS 7.5. Run an inventory job to see available Nutanix software updates. Note that LCM will not perform server firmware updates for compute only nodes connected to external storage.


    Things Worth Calling Out

    vNIC naming. The ntnx-infra-1-A and ntnx-infra-1-B names in the LAN Connectivity Policy are case sensitive. A single capitalization error will cause the deployment to fail at validation. Fix it in the Intersight policy and resubmit.

    Foundation Central version. Version 1.10 only. The Appliance VM version 2.x exists and is available, but it does not work with Cisco UCS hardware in this context. Do not use it.

    Jumbo frames. The MTU 9000 setting at the vNIC level and the virtual switch level only permits jumbo frames to pass; it does not enforce them. All switching infrastructure between the hosts and the FlashArray interfaces must also support 9000 byte frames. Use the ping test above to verify before attaching the array.

    Storage VLAN design. Use two separate storage VLANs, one for the A side controller interfaces and one for the B side. Storage addresses within each VLAN must be in the same Layer 2 domain as the hosts and cannot be routed.

    Pod quota. Without setting a storage quota on the Pure Storage Pod, Nutanix will not accurately display available storage capacity. Set it immediately after verifying the Realm and Pod are created.


    Summary

    The Nutanix AHV compute only model backed by Pure Storage FlashArray over NVMe/TCP represents a meaningful shift in how converged infrastructure is deployed. It separates the scaling concerns for compute and storage, delivers per VM storage granularity at the array level, and maintains a single management plane through Prism for day to day operations.

    The installation process involves more moving parts than a traditional HCI cluster. Cisco Intersight, Foundation Central, Purity CLI, and Prism Element all play distinct roles, and the sequencing matters. Following the steps in order and confirming each layer before moving to the next is the most reliable path to a successful deployment.

    For questions about this architecture or assistance planning a deployment, reach out at javier.rodriguez@eplus.com.

  • Migrating Virtualized Workloads to Nutanix AHV: A Phased Approach That Works in Production

    Migrating Virtualized Workloads to Nutanix AHV: A Phased Approach That Works in Production


    Every VMware to Nutanix AHV migration project comes with the same fundamental tension: you want to move workloads to a better platform without disrupting the people who depend on those workloads every day. The good news is that Nutanix Move, when paired with a well defined phased methodology, handles that tension well. This post walks through how we approach these engagements at ePlus, covering the migration mechanics, database specific considerations, and the operational steps that close out each phase cleanly.

    How Nutanix Move Works


    Nutanix Move is a cross hypervisor mobility tool that automates VM migrations from VMware ESXi, Hyper-V, or public cloud sources to Nutanix AHV. The core model is straightforward: Seed, Sync, Cutover.

    1. Discovery: Connect Move to the source environment (vCenter, standalone ESXi, or Hyper-V) and the target Nutanix cluster. Move inventories the VMs and validates compatibility before anything touches production data.
    2. Data Seeding: Move creates a placeholder VM on the AHV side and begins copying virtual disks from the source. This initial seed runs in the background while the source VM stays live.
    3. Changed Block Tracking (CBT): After the initial copy, Move uses CBT to replicate only blocks that have changed since the last sync. This keeps the replication delta small and the eventual cutover window short.

    Why Daytime Replication Is Safe

    A common concern when planning migrations is whether running replication during business hours will hurt production performance. In practice, it does not, and here is why.

    Non Disruptive Snapshots
    Move uses native snapshot mechanisms (VMware CBT, for example) to read source data. The VM stays powered on and users experience no interruption.
    Network Throttling
    Move supports bandwidth throttling on migration traffic so replication does not compete with production traffic on shared links during peak hours.
    Background Operation
    The seeding phase is a background task. End users are fully isolated from the process because their application is still running on the source hypervisor.
    Incremental Efficiency
    After the initial seed, subsequent syncs only move changed blocks, so the bandwidth consumption of ongoing replication is a fraction of the initial transfer.

    The Cutover Process

    The cutover is the only step that involves any downtime, and even that window is typically measured in minutes per VM. The sequence is deterministic and should be documented in the project plan before any work begins.

    1. Final Sync: Move performs one last incremental sync to capture the most recent changed blocks.
    2. Graceful Shutdown: The source VM is powered off cleanly, not forcefully terminated.
    3. Final Delta: A final incremental pass captures any blocks written during the shutdown sequence.
    4. Activation: Move installs the required VirtIO drivers for AHV, optionally reconfigures IP addressing, and powers the VM on within the Nutanix cluster.

    Practical Note
    For most general purpose VMs, the combined downtime from final sync through power on on AHV is under five minutes. Database VMs with large in flight transactions may take slightly longer depending on the final delta size.

    Rollback Strategy

    One of the most important things to communicate to stakeholders before a cutover is that rollback is not a complex recovery procedure. It is simply reversing a power state.

    Because Move does not delete or modify the source VM during cutover (it only powers it off and disconnects its network interface), the path back to the original state requires no data restoration. If a migrated VM does not perform as expected on AHV, the steps are:

    1. Power off the VM on the Nutanix AHV side.
    2. Reconnect the network interface on the source VM.
    3. Power on the source VM in the original environment.

    The source disks remain completely untouched throughout the process, so this rollback takes seconds rather than hours. It also means stakeholder sign off on a cutover carries much lower risk than it would in a traditional migration approach.

    Special Migration Scenarios

    Not every VM is a candidate for a straightforward Move migration. A few categories require a different approach:

    • Legacy Operating Systems: Windows Server 2003 and older Linux kernels with unsupported kernel versions are explicitly unsupported by modern versions of Nutanix Move and the standard AHV VirtIO driver set. These workloads cannot use the standard Move migration path and require an alternative approach such as a cold clone, a bare metal backup restoration, or an application level migration to a newly provisioned VM.
    • Physical Hardware Pass through: VMs with PCI pass through devices or Raw Device Mappings (RDMs) require manual reconfiguration on the target side.
    • Shared Disk Clustering: Certain older Oracle RAC or MSCS configurations that rely on shared SCSI bus emulation need architectural review before migration.

    For these cases, the alternatives range from a manual cold clone, to an application level migration, to a fresh OS installation with data restoration from backup. The right path depends on the workload, and that decision should be made during technical discovery before the project schedule is finalized.


    Database Migration Methodology

    Databases deserve a separate treatment because the consequences of a failed migration, or even a migration that succeeds but lands on a poorly configured target, are higher than for stateless application servers. We cover both Microsoft SQL Server and Oracle here.

    Storage Architecture for Database VMs

    Nutanix gives database workloads two primary storage paths: native vDisks and Nutanix Volume Groups.

    • Native vDisks are the default for AHV VMs and are simple to manage through Prism. Starting with AOS 6.x, the Autonomous Extent Store (AES) improved local sharding for native vDisks, so they are no longer as constrained as they were in earlier releases. That said, a single CVM still serves as the primary I/O path for a given vDisk, which means very high throughput workloads can reach a performance ceiling at the CVM level.
    • Nutanix Volume Groups (VG) are collections of vDisks presented as block devices. For AHV, VGs can be direct attached, appearing as native SCSI devices to the guest OS. When Volume Group Load Balancing (VGLB) is enabled, the system shards vDisks across all CVMs, removing the single CVM I/O path and allowing the database to draw on the aggregate throughput of the entire cluster’s Stargate processes.

    iSER Support: For the highest performance requirements, Nutanix supports iSER (iSCSI Extensions for RDMA), which bypasses the TCP/IP stack entirely to reduce latency and CPU overhead between the guest and the CVM. This is worth evaluating for latency sensitive OLTP workloads.

    AHV Specific Tuning for Databases

    Several AHV configuration decisions have a direct and measurable impact on database performance.

    • vCPU to pCPU Ratio: For production databases, size assuming 1 vCPU equals 1 physical core, not one hyperthreaded thread. Oversubscription introduces CPU Ready Time, which is particularly harmful to latency sensitive query workloads. Target below 5% CPU Ready.
    • Memory Reservations: Reserve 100% of assigned VM memory for SQL Server and Oracle VMs. AHV memory reclamation through ballooning or swapping can cause significant and hard to diagnose latency spikes in database workloads.
    • Huge Pages: AHV uses 2 MB Huge Pages to reduce Translation Lookaside Buffer (TLB) pressure. Ensure the guest OS is configured to use large page allocations to take advantage of this.
    • vNUMA: For VMs larger than a single physical socket, enable vNUMA and match the virtual topology to the physical hardware. This allows the database engine to schedule threads and memory access with NUMA awareness. Disable CPU hot add, as enabling it disables vNUMA and can cause performance degradation of up to 30%.

    AOS Features That Matter for Databases

    Data Locality
    AOS stores a VM’s data on the same physical node where the VM runs. Read I/O is served locally without network traversal, which reduces database read latency materially.
    AHV Turbo (Frodo I/O Path)
    Bypasses traditional QEMU emulation with a multi-queue I/O path that scales with the number of vCPUs, delivering higher I/O capacity and lower CPU overhead for storage intensive workloads.
    Nutanix Blockstore
    A block management system that moves device interactions into user space, eliminating context switching and kernel driver overhead for data disks.
    VGLB for OLAP
    Volume Group Load Balancing distributes I/O across all CVMs in the cluster. Critical for high throughput OLAP and reporting workloads that can saturate a single CVM.

    Microsoft SQL Server Migration Options

    There are three viable paths for SQL Server migrations, and the right choice depends on the deployment type and the acceptable downtime window.

    • Nutanix Move: The simplest path for standalone instances. Move handles disk conversion to AHV RAW format, VirtIO driver injection, and IP configuration. Best suited for standalone instances where a brief cutover window is acceptable.
    • Always On Availability Groups: Build a new SQL VM on AHV, join it to the existing Windows Server Failover Cluster (WSFC), and add it as a new secondary AG replica. Once synchronized, perform a planned manual failover to promote the Nutanix based node, then decommission the old nodes. This approach reduces cutover risk for business critical SQL workloads and can achieve near zero application downtime.
    • Backup and Restore: Take a full backup of the source database, restore it on a pre staged SQL VM on AHV using WITH NORECOVERY, and during the cutover window take a tail log backup, restore it with WITH RECOVERY, and redirect applications to the new instance.

    Oracle Migration Options

    • Nutanix Move: Recommended for migrating the Oracle VM as is from vSphere to AHV when the VM itself is in Move’s compatibility matrix. Move handles VirtIO driver injection automatically.
    • RMAN Active Duplication: Use Oracle Recovery Manager to perform an active duplication from the source to a new Oracle VM on AHV. The source database remains online until the final switchover, minimizing the downtime window.
    • Data Guard: Set up a physical standby on the Nutanix cluster, synchronize it via RMAN, and then perform a Data Guard switchover to promote the Nutanix instance to primary. This is the lowest risk option for Oracle databases with strict RPO/RTO requirements.
    • Oracle RAC with Nutanix Volumes: For RAC deployments, Nutanix Volumes provide the shared block storage required by clusterware. Volume Groups should be attached via iSCSI and configured with SCSI-3 Persistent Reservations.

    SQL Server Best Practices on AHV

    These configurations should be treated as baseline for any production SQL Server on Nutanix, whether migrated or newly deployed.

    Storage Layout

    • Use at least four vDisks to distribute data files, log files, TempDB, and the OS independently.
    • Format all data and log volumes with a 64 KB NTFS allocation unit size.
    • Do not use Windows Dynamic Disks or in guest volume managers. Add vDisks directly to the VM instead.
    • Keep OS, SQL binaries, user database data, logs, and TempDB on separate volumes.

    Instance Level Tuning

    • Instant File Initialization (IFI): Grant the SQL Server service account the “Perform Volume Maintenance Tasks” privilege to enable IFI. This eliminates zero initialization overhead during data file creation and auto growth events. IFI applies only to data files (.mdf and .ndf). Log files (.ldf) are always zero initialized regardless of this setting. Starting with SQL Server 2016, IFI can also be enabled directly from the installation wizard.
    • Lock Pages in Memory (LPIM): Enable LPIM to prevent Windows from paging the SQL Server buffer pool to disk. Max Server Memory must be set correctly before enabling LPIM to avoid starving the guest OS.
    • Max Server Memory: For mid to large VMs, leave 6 to 8 GB for the OS. For VMs under 32 GB of RAM, 4 GB is often sufficient. A practical formula: reserve 10% of total RAM for the OS, with a ceiling of around 8 GB unless SSIS or SSRS also run on the same instance.
    • MAXDOP: Set MAXDOP to the number of logical cores within a single vNUMA node. For SQL Server 2016 and later, the updated guidance is to use either 8 or the number of cores per NUMA node, whichever is smaller.
    • Cost Threshold for Parallelism (CTFP): Increase from the default of 5 to at least 50. OLTP workloads land at 50. Hybrid environments sometimes use a value in the 25 to 50 range.
    • TempDB: Match the number of data files to the logical processor count when that count is 8 or fewer. Start at 8 data files when the logical processor count exceeds 8. Only increase beyond 8 (in increments of 4) if PAGELATCH_UP or PAGELATCH_SH waits confirm actual contention.

    SQL Server Baseline Configuration Summary

    SettingRecommended BaselineReason
    IFIEnabledEliminates zero initialization overhead for data files during creation and auto growth.
    LPIMEnabledPrevents Windows from reclaiming the SQL Server buffer pool. Requires Max Server Memory to be set first.
    Max Server MemoryTotal RAM minus 4 to 8 GB (or 10% of total RAM)Prevents SQL Server from starving the guest OS.
    MAXDOP8 or cores per NUMA node, whichever is smallerKeeps parallel query execution within a single NUMA domain.
    CTFP50 (or 25 to 50 for hybrid workloads)Prevents low cost queries from triggering parallelism on modern multi core hardware.
    TempDBMatch logical processor count up to 8; increase by 4 only when contention is confirmedReduces allocation contention. All files must be equally sized with identical growth settings.

    Oracle Best Practices on AHV

    Oracle on Nutanix AHV benefits from the same platform level advantages as any other workload, but the database engine has enough specific tuning requirements that it warrants its own treatment.

    Memory Allocation: SGA and PGA

    Reserve approximately 10 percent of the total VM memory for the guest OS and file cache. Of the remaining 90 percent, allocate 80 percent to the System Global Area (SGA) and the remaining 20 percent to the Program Global Area (PGA). Memory reservations should be set to 100% of the assigned VM memory. Memory overcommit is not recommended for Oracle workloads.

    Storage Layout and Disk Groups

    NDB provisions multiple vDisks spread across ASM disk groups to maximize throughput across the Distributed Storage Fabric. The two primary disk groups are DATADG for database data files and RECODG for redo logs and archive files. For Oracle RAC, a third disk group CRSDG is required for Grid Infrastructure and clusterware files.

    Disk GroupSmall or Medium (500 GB and under)Large (501 GB and above)
    CRSDG (RAC only)3 vDisks3 vDisks
    DATADG4 vDisks8 vDisks
    RECODG2 vDisks4 vDisks

    ASM Configuration Options

    Nutanix supports ASMFD (ASM Filter Driver), ASMLIB, and udev rules for ASM disk mappings. ASMFD is the preferred method on modern Linux distributions. All ASM disks should be placed on vDisks in an AOS storage container with inline compression enabled and deduplication disabled.

    Network Design for Oracle RAC

    Oracle RAC requires a public network for client connections and a private interconnect for cache fusion on separate VLANs. Mixing them on the same VLAN introduces the risk of cache fusion traffic competing with client traffic. When using NDB to provision Oracle RAC, NDB manages IP address assignment across public, private, and virtual (scan and VIP) network types.

    RAC and Nutanix Volumes: Oracle RAC requires shared storage for the CRSDG disk group. On AHV, this is provided through Nutanix Volume Groups attached via iSCSI with SCSI-3 Persistent Reservations enabled. This is a prerequisite for RAC clusterware to function correctly.

    Oracle Patching with NDB

    NDB uses an out of place patching model for Oracle. Rather than patching a running Oracle home directly, the process involves provisioning a new database VM from an existing software profile, manually applying the patch set to that VM, and then creating a new software profile version from the patched VM. Once published, that version becomes available to all Oracle VMs managed by NDB. Patching can be performed in either a rolling or non rolling fashion for Oracle RAC environments.

    Time Machine Backup and Recovery for Oracle

    NDB Time Machine creates application consistent snapshots of Oracle databases along with copies of transaction log files. An SLA attached to the time machine controls snapshot frequency and retention. Point in time recovery is available as long as both a base snapshot and the covering transaction logs exist for the target timestamp. NDB restores the vDisks from the appropriate snapshot and then applies log files forward to bring the database to a consistent state.

    Decommissioning Protocol

    The migration is not complete when the VM powers on successfully on AHV. A structured decommissioning process ensures the legacy environment is cleaned up safely.

    StepActionOwner
    1Source VMs remain powered off with NIC disconnected for a 48 to 72 hour burn in period to prevent IP conflicts.Infrastructure Team
    2Confirm with Application Owners that performance and stability on AHV is acceptable after the burn in period.Project Lead
    3Archive a final backup of the source VM according to the organization’s retention policy before deletion.Backup Admin
    4Remove the VM from the source cluster inventory.Infrastructure Team
    5Update the CMDB or asset tracker to reflect the VM’s new hypervisor and decommission the legacy record.IT Operations

    Technical Discovery Requirements

    The quality of the discovery work done before migration determines how smooth everything else goes. At a minimum, the following information should be gathered before any migration plan is finalized.

    General Infrastructure

    • Specific vSphere version and ESXi build number in use on source hosts
    • Networking configuration: LACP, Jumbo Frames (MTU 9000), or standard configuration
    • IP retention requirement: retain existing IPs after migration or assign new IPs on AHV
    • Guest OS list with versions and BIOS/UEFI boot mode for each VM in scope

    SQL Server Environments

    • SQL Server versions and editions (Standard vs. Enterprise) deployed
    • Deployment type: Standalone, Failover Cluster Instance (FCI), or Always On AG
    • Current vCPU to physical core allocation and whether LPIM is already configured
    • vDisk layout per VM: number of disks, purpose (Data, Log, TempDB), and whether any single large data files exist that should be split
    • Dependencies on MSDTC, Linked Servers, or SQL Agent Jobs that require documentation before cutover

    Oracle Environments

    • Oracle versions in scope and whether instances are Single Instance or RAC
    • Shared storage configuration for RAC: ASM with ASMLib, ASMFD, or udev rules
    • Huge Pages configuration status in the guest OS
    • Existing RMAN backup workflows or Data Guard standbys that can be leveraged
    • Source platform architecture: if any workloads currently run on AIX or Solaris (SPARC), be aware that Nutanix Move is strictly an x86-to-x86 tool and cannot be used for these migrations. AIX and Solaris on SPARC are Big-Endian, while Nutanix AHV runs exclusively on x86-64 (Little-Endian). Cross-endian migrations require a fully manual path using RMAN CONVERT for Oracle or an application level export and restore, and should be scoped separately from the rest of the Move migration plan.

    Migration Constraints

    • Maximum acceptable maintenance window for final cutover
    • Average daily change rate for production databases (drives seeding bandwidth planning)
    • Top 10 application functions or queries to validate Day 1 performance after migration
    • Total allocated versus used storage per database environment, plus expected annual growth
  • Insights from a Nutanix Migration Specialist

    Insights from a Nutanix Migration Specialist

    My work life as an IT specialist has always been quite varied.

    I spent part of my time installing traditional datacenter infrastructure, some of my time implementing cybersecurity solutions, and bits and pieces here and there, working on projects with a number of different technology vendors.

    But over the past 18 months, my main focus has been: migrate customers’ virtualization environments to Nutanix.

    The timing lines up with some big shakeups in the tech industry, as well as the continued growth of hyperconverged infrastructure (HCI). I heard my customers worry that support quality would decline for their existing environments, or that innovation might stall. In reality, what my customers have mostly seen is severe sticker shock on their renewal bills—partly due to inflation that has hit all sectors, but also due to dramatic changes to vendor licensing agreements. 

    Some customers have seen 3x, 5x, or even 10x increases in their virtualization costs, practically overnight. These are customers that have been with a vendor for 15 or 20 years, in many cases, and many had come to view their virtualization environments as something of a commodity with a stable pricing structure. But changes to licensing agreements have upended this stability. Before, customers could mostly purchase individual product licenses as needed, but they’re now being funneled into bundled packages with add-on features they don’t want and can’t use.

    Some large enterprises are able to absorb these new costs. But for others—especially small and medium-sized companies—the impact to their business is comparable to tripling their rent, or adding a zero to their monthly utility bills. These smaller customers also find themselves in a poor negotiating position with tech giants. 

    For example, we recently worked with Norfolk Public Schools in Virginia to migrate to Nutanix. The district was facing an eye-popping 680% cost increase if it stayed with its previous provider, but a five-year licensing agreement saved it approximately $2 million.

    For customers like Norfolk Public School, the numbers of the new virtualization landscape simply don’t add up. And for the first time, many of these organizations are willing to seriously consider a change.

    Even non-technical people can understand the anxiety that comes with switching technology platforms. (Think of how rarely people change to a phone with a different operating system.) Most of my customers never even considered switching from their existing virtualization provider until recently. After all, virtualization is a foundational technology that supports their entire business. Many system administrators have built their careers and expertise around the environment they know, developed their own workflows around its interface and capabilities, and integrated their entire application environment with that platform.

    Most importantly, businesses have come to rely on the stability of their virtualization environments to keep their mission-critical systems up and running. So, it’s understandable why many approach a change with a degree of trepidation. They want to know whether their applications will work the same way, how much downtime to expect, and whether their teams will need extensive retraining.

    However, once customers make the move, they tend to find that Nutanix infrastructure provides everything they need—and often in a more intuitive way, at what essentially amounts to what they were paying before the market shifts of the past couple off years. During the pre-sales process, I sit with customers to walk them through the Nutanix interface. We spend much of this time exploring the equivalent functionality between the platforms, which is often mostly a matter of learning new terminology for familiar features.

    At Norfolk Public Schools, we conducted site assessments, installed and configured new hardware, configured the Nutanix platform, and migrated more than 400 virtual machines—all in just over a month. The cutover to the new operating environment was seamless, and the district saw immediate improvements in performance and reliability.

    For most organizations, the migration is just as painless. Some clients prefer to migrate in small batches of just a few virtual machines, while others are ready to move hundreds of virtual machines over a single weekend. The actual cutover process for each virtual machine takes only about five to ten minutes—comparable to the standard maintenance window for most security patches. Post-migration, customers typically notice improved performance (mostly due to new hardware). In addition to the cost savings, many also cite Nutanix’s simplified disaster recovery capabilities as a major benefit of the move.

    After we start the migration, I can see the anxiety on my customers’ faces melt away, replaced by relief. Recently, one even started laughing. “This is so amazing!” he kept repeating. “This is so easy!”

  • Cloning Linux: A Step-by-Step Guide to Booting from iSCSI LUN

    In this comprehensive guide, we demystify the process of cloning a Linux operating system (Ubuntu) and guide you through the intricacies of booting directly from an iSCSI LUN. We’ll walk you through the entire process, from selecting the right tools for cloning to configuring your system for iSCSI boot. Whether you’re a seasoned Linux administrator or a curious enthusiast, this step-by-step guide is tailored to empower you with the knowledge and skills needed to successfully clone and boot Linux from an iSCSI LUN.

    Let’s begin with a summary of the technology prerequisites for accomplishing this task. Firstly, you’ll require a Linux box, whether physical or virtual. It’s essential to note that the method I propose involves system downtime, so scheduling a maintenance window is advisable, particularly if your system is in production. As part of this approach, a connection between the source system and the target Volume/LUN is crucial. I’ll explore the concept of cloning to a file from the source and transporting it to the target side in a future post.

    Lastly, a target system capable of providing the iSCSI volume is indispensable for the successful execution of this process. Keep these key components in mind as we delve into the steps for cloning a Linux OS and booting from an iSCSI LUN in our detailed guide.

    If you want to Lab the cloning, you’ll need three things:

    1. Linux Box: I will be using Ubuntu, you can download Ubuntu here: https://ubuntu.com/
    2. If you want to boot a Virtual Machine (VM) from iSCSI you will need iPXE: https://ipxe.org/download
    3. For the iSCSI server I used the Nutanix Community Edition: https://next.nutanix.com/discussion-forum-14/download-community-edition-38417 (you’ll need a Nutanix Next Community login)

    Here we go, in your Linux box gather a few nuggets of information by executing these commands:

    1. Become a super user with: sudo su
    2. List your disk drives with: fdisk -l
    3. Verify the boot device with: df -f and cat /etc/fstab, and blkid
    lab@lab-vm:~$ sudo su
    [sudo] password for lab:*****

    root@lab-vm:/home/lab# fdisk -l
    ...
    Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors
    Disk model: Virtual disk
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes
    Disklabel type: gpt
    Disk identifier: 2F122466-CF57-4DAB-A441-276FFFFE87BD
    ...
    Device Start End Sectors Size Type
    /dev/sda1 2048 4095 2048 1M BIOS boot
    /dev/sda2 4096 41940991 41936896 20G Linux filesystem

    root@lab-vm:/home/lab# df -h
    Filesystem Size Used Avail Use% Mounted on
    tmpfs 391M 1.2M 390M 1% /run
    /dev/sda2 20G 6.1G 13G 33% /
    tmpfs 2.0G 0 2.0G 0% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 391M 4.0K 391M 1% /run/user/1000
    root@lab-vm:/home/lab#

    root@lab-vm:/home/lab# cat /etc/fstab
    # /etc/fstab: static file system information.
    #
    # Use 'blkid' to print the universally unique identifier for a
    # device; this may be used with UUID= as a more robust way to name devices
    # that works even if disks are added and removed. See fstab(5).
    #
    # <file system> <mount point> <type> <options> <dump> <pass>
    # / was on /dev/sda2 during curtin installation
    /dev/disk/by-uuid/91cf8b5a-2c4d-49c4-bcb5-57b59339a2c0 / ext4 defaults 0 1
    /swap.img none swap sw 0 0

    root@lab-vm:/home/lab# blkid
    /dev/sda2: UUID="91cf8b5a-2c4d-49c4-bcb5-57b59339a2c0" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="853d57ea-8b35-4dc7-bb28-813d1a2e4769"
    ...
    /dev/sda1: PARTUUID="3d8c878b-7431-4732-acd4-ba2a21f5458a"
    root@lab-vm:/home/lab#

    Based on the results of the earlier commands, we’ve identified that our system is installed directly on /dev/sda. With this understanding, let’s proceed to boot Linux from a Live Ubuntu ISO Image and open a Terminal window. See the following slideshow of the process:

    While in the live Ubuntu you could enable ssh to make everything easier. From the terminal executed the following commands

    1. sudo su
    2. apt install openssh-server -y
    3. systemctl enable ssh
    4. ufw allow ssh

    I already have my Nutanix CE deployed and an iSCSI lun configured. The Discovery IP address in my case is 192.168.1.51 and the name on the iSCSI Lun is iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09. To configure the Lice Ubuntu to access the Lun execute these commands (while logged with root using ‘sudo su’):

    • apt install open-iscsi -y
    • apt install multipath-tools -y
    • service multipath-tools start
    • iscsiadm -m discovery -t sendtargets -p 192.168.1.51
    • iscsiadm -m node –op=update -n node.conn[0].startup -v automatic
    • iscsiadm -m node –op=update -n node.startup -v automatic
    • systemctl enable open-iscsi
    • systemctl enable iscsid
    • systemctl restart iscsid.service
    • iscsiadm -m node –loginall=automatic
    • iscsiadm -m session -o show

    The output of the last command should show something like this:

    root@ubuntu:/home/ubuntu# iscsiadm -m session -o show
    tcp: [1] 192.168.1.51:3260,1 iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09-tgt0 (non-flash)
    root@ubuntu:/home/ubuntu#

    In my case the original drive is in /dev/sda and the new iSCSI lun is /dev/sdb, to start the cloning execute the following command: dd if=/dev/sda of=/dev/sdb bs=32M status=progress

    The next step must be done before the reboot. It will configure the system to boot from the new iSCSI Lun.

    • mount /dev/sdb2 /mnt
    • mount –bind /dev /mnt/dev
    • mount –bind /sys /mnt/sys
    • chroot /mnt
    • mount -t proc none /proc
    • hostname -F /etc/hostname
    • echo “nameserver 8.8.8.8” >> /etc/resolv.conf
    • apt-get install initramfs-tools -y
    • apt-get install open-iscsi -y
    • echo “iscsi” >> /etc/initramfs-tools/modules
    • touch /etc/iscsi/iscsi.initramfs
    • update-initramfs -u
    • Edit /etc/default/grub:
      • Replace:
        • GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash”
      • With:
        • GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash ip=dhcp ISCSI_INITIATOR=iqn.2004-10.com.ubuntu:01:a3ea501f8a8 ISCSI_TARGET_NAME=iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09 ISCSI_TARGET_IP=192.168.1.51 ISCSI_TARGET_PORT=3260″
    • update-grub

    Now shutdown and boot the Linux from the iPXE iso. Follow the same steps and the slideshow above, except that you will use the ipxe.iso image now:

    Be alert to use the ctrl-b early in the boot process:

    Now, type ‘dhcp‘ to acquire an IP address and type ‘show net0/ip‘ to verify it.

    It is time to boot from the iSCSI Lun using this command:

    sanboot iscsi:192.168.1.51::::iqn.2010-06.com.nutanix:lab-boot-lun-ee392c61-6958-4be2-88fc-636bed265e09

    And you should have a system booting from the iSCSI Lun. In case that you don’t have access to both the source and target drives, you can pipe the dd command to gzip and save it to a file that can be read at the target system. An example is:

    • dd if=/dev/sda | gzip > file.gz

    I hope you find this post useful and remember to have a good backup before attempting the cloning procedure.