Windows Azure Drive Demo at MIX 2010

March 28, 2010, 7:38 pm

≫ Next: Using Windows Azure Page Blobs and How to Efficiently Upload and Download Page Blobs

With a Windows Azure Drive, applications running in the Windows Azure cloud can use existing NTFS APIs to access a network attached durable drive. The durable drive is actually a Page Blob formatted as a single volume NTFS Virtual Hard Drive (VHD).

In the MIX 2010 talk on Windows Azure Storage I gave a short 10 minute demo (starts at 19:20) on using Windows Azure Drives. Credit goes to Andy Edwards, who put together the demo. The demo focused on showing how easy it is to use a VHD on Windows 7, then uploading a VHD to a Windows Azure Page Blob, and then mounting it for a Windows Azure application to use.

The following recaps what was shown in the demo and points out a few things glossed over.

Using VHDs in Windows 7

One can easily create and use VHDs on your local Windows 7 machine. To do this run “diskmgmt.msc”. This will bring up the Disk Management program in Windows 7. You can then use the “Action” menu to create VHDs and attach (mount) VHDs. Note, Windows Azure Drives only supports “Fixed Size” VHDs (we do not not support “Dynamic” VHDs). So when creating a VHD in Disk Management, use the “Fixed Size” option.

Once the VHD is attached, you can then store data in it as any normal NTFS drive.

You can also use Disk Management to detach a drive by right clicking on the mounted disk, and then choosing “Detach VHD”. This is shown in the below, where we right clicked on Disk 1, and it brought up the menu below to “Detach VHD” the X: drive we had mounted.

Uploading VHDs into a Windows Azure Page Blob

The demo then showed using a command line tool Andy wrote to upload the local VHD file to a Page Blob in a Windows Azure Storage account using the Storage Client Library. The next blog posting will go into Page Blobs and provide the code for this tool.

Using a Windows Azure Drive in the Cloud Application

The demo then showed how to code up and use a Windows Azure Drive in a Windows Azure role. To do this we used the Windows Azure SDK and the Windows Azure Storage Client Library provided with it, so make sure you have the Windows Azure SDK installed.

Setting up the storage account

The first step in your application is to create a configuration settings to specify the storage account name and its secret key. To do this you would add a configuration setting to your ServiceConfiguration.cscfg as shown below:

<ConfigurationSettings><Setting name="CloudStorageAccount" 
value="DefaultEndpointsProtocol=http; 
       AccountName=xdrivedemo;AccountKey=vhz/HSBZzBwiNpx9399hnWAe3zJGX … f+VvVA4OQ==" /></ConfigurationSettings>

The storage account name we are using for the demo is “xdrivedemo” (the storage AccountKey isn’t fully listed for obvious reasons).

Now the same configuration setting can be specified by clicking on your web or worker role in Visual Studio to get the following settings screen. To create the storage account configuration setting select “Settings” and then use “Add Setting”.

In the above, we added the CloudStorageAccount configuration setting with the storage account name and key. When creating the setting we specified its type to be “Connection String”. When you do this, you’ll see a “…” at the end in the value field. When you click on that you’ll see the following dialog box, where you can enter your storage account name and account key.

Once the storage account configuration settings has been entered, we now need to write the code to allow us to access the storage account in our application. To do this, in the application we create a CloudStorageAccount object or static member. We can use this to then access our Windows Azure Storage account where the Page Blobs (drives) will be stored. This is done with the following code:

public CloudStorageAccount account;
account = CloudStorageAccount.FromConfigurationSetting(
"CloudStorageAccount");

At this point, we have an “account” object that we can use to perform actions against our storage account in the cloud.

Initializing the Drive Cache

Now before the application can use a Windows Azure Drive it first needs to initialize the Drive Cache. This initialization only needs to be done once per running VM instance. Once this initialization is done successfully for the VM instance, then all processes/threads running under that instance can mount and manipulate drives.

To perform the drive cache initialization we first need to reserve part of the local disk resources for the VM instance to be used by the Drive Cache. To do this, we first specify a Local Storage resource as part of the ServiceDefinition.csdef as follows:

<LocalResources><LocalStorage name="AzureDriveCache" 
cleanOnRoleRecycle="false" sizeInMB="1000" /></LocalResources>

Alternatively, one can use the web/worker role interface specification provided as part of the SDK in Visual Studio to create the local storage resource. To do this, click on your web or worker role in Visual Studio to get the following settings screen. Then to create the local storage resource, select “Local Storage”. Then select “Add Local Storage”. You then give it a name and specify the amount of local storage you want this resource to have in MBytes as shown here:

In the above, we’ve specified to reserve 1 GByte of the local disk space for this resource. We also left “Clean on Role Recycle” unchecked because we don’t want the cache flushed if/when the role restarts.

The disk space for this local storage resource comes out the allocated disk space for the VM instance you are running your role on. The maximum size you can specify for the local storage resource is the maximum local disk space your VM instance can have depending upon the VM size you have chosen (Small, Medium, Large or Extra Large). For example, a small VM instance provides up to 250 GBytes of local disk space.

Now that we have the local storage resource defined, we now need to initialize the drive cache. To do this, we add the following code to do this initialization when the role starts up:

LocalResource azureDriveCache = 
RoleEnvironment.GetLocalResource("AzureDriveCache");CloudDrive.InitializeCache(azureDriveCache.RootPath + 
"cache", azureDriveCache.MaximumSizeInMegabytes);

For this, we pass into InitializeCache the RootPath of our local storage resource, which specifies the drive letter to use for the cache. Then the second parameter is the amount storage space from this local storage resource we want to use for the drive cache. Note, this drive cache is to be used across all of the drives to be mounted on this VM instance. You can mount up to 16 drives, and all of these mounted drives share this drive cache disk space that is set aside by this call to InitailizeCache.

One point here is that there is currently a bug in InitializeCache when its first parameter (the path) ends in “\”. This is why we add the “cache” in the above, which puts the drive cache files under a cache directory in that RootPath. Another option to get around this bug would be to trim the “\” off as follows:

CloudDrive.InitializeCache(azureDriveCache.RootPath.TrimEnd({'\\'}), 
           azureDriveCache.MaximumSizeInMegabytes);

At this point we have initialized the drive cache, so we can start accessing and using Windows Azure Drives in our application.

Creating a drive

Before we mount a drive we first need to have either uploaded a single volume NTFS VHD into a Page Blob (as we did in the demo), or create a drive with our application in the cloud. Creating a drive can be done with the following:

CloudDrive drive = account.CreateCloudDrive(containerName+"/"+blobName);
drive.Create(sizeOfDriveInMegabytes);

We first create a CloudDrive object specifying the container name and blob name for the Page Blob we want to create. The CloudDrive Create method call will then create the Page Blob in the storage account under “containerName/blobName”, and it will be created with the size in megabytes passed into the Create. Note, the “containerName” Blob Container must already exist in the storage account for the call to succeed, so please make sure you have created that ahead of time.

In addition to creating the Page Blob, the CloudDrive Create method will also format the contents of the Page Blob as a single volume NTFS Fixed Size VHD. Note, the smallest drive you can create is 16MBytes, and the maximum size is 1TByte.

One important point here is that for a Page Blob, all pages that do not have any data stored in them are treated as initialized to zero, and you are only charged for pages you write data into. We take advantage of this when using Page Blobs as VHDs, since when you create and initialize a Fixed Size VHD most of it consists of empty blocks. As we showed in the demo, we had a VHD of 100MBytes and we only had to upload 3MBs of pages, since the rest of the VHD was effectively still zeros. Therefore, when you create a VHD using CloudDrive Create, the formatting only stores data in the pages that need to be updated in order to represent the VHD format and single volume NTFS initialization. This means that your drive after creation will cost just a small fraction of the size you created, since most of the pages are empty.

Mounting a drive

Once the drive cache has been initialized, the application can now start to use Windows Azure Drives. To mount a drive one just needs to create a CloudDrive object using the “account” created earlier. When doing this, we pass in the container name and blob name for the Page Blob we want to access using the CloudDrive object as follows:

CloudDrive drive = account.CreateCloudDrive(containerName+"/"+blobName);string path = drive.Mount(sizeOfDriveCacheToUse, 
DriveMountOptions.None);

When calling mount, you pass in the amount of the drive cache space to use for this specific drive being mounted, and the drive mount options you want to use. Most applications will want to use the default “DriveMountOptions.None”. A successful mount will return a string specifying the drive letter the VHD was mounted to, so the application can start using the NTFS drive.

Remember that a given VM instance can mount up to 16 drives, and all of these mounted drives share the total amount of drive cache space when calling InitializeCache described earlier. This means that the sum of all of the sizeOfDriveCacheToUse parameters for all of the currently active mounted drives for a VM instance has to be less than or equal to the amount of local disk space put aside for the Drive Cache when calling InitializeCache as described earlier.

Unmounting a drive

Then to unmount the drive, it is as easy as creating a CloudDrive object and calling unmount as follows:

CloudDrive drive = account.CreateCloudDrive(containerName+"/"+blobName);
drive.Unmount();

Snapshotting a drive

Then to create a snapshot of the drive you just need to do the following:

CloudDrive drive = account.CreateCloudDrive(containerName+"/"+blobName);Uri snapshotUri = drive.Snapshot();

Snapshots are read-only. Snapshots can be created while the drive is mounted or when it is unmounted. Snapshotting a drive is useful for creating backups of your drives. Under the covers this uses the Page Blob Snapshot command. A nice advantage of this is that you only pay for the unique pages across all of the snapshots and the baseline version of the Page Blob. So as you keep on updating the Page Blob you are only paying for the delta changes being made over the last stored snapshot.

Summary

Finally, there are a few areas that are worth summarizing and re-emphasizing about Windows Azure Drives:

Only Fixed Size VHDs are supported.
The minimum drive size supported is 16MBytes and the maximum is 1TByte.
A VM instance can mount up to 16 drives, and all the active mounts share the drive disk cache space reserved by InitializeCache.
A Page Blob Drive can only be mounted to a single VM instance at a time for read/write access.
Snapshot Page Blob Drives are read-only and can be mounted as read-only drives by multiple different VM instances at the same time.
The Page Blob Drives stored in the Cloud can only be mounted by Windows Azure applications running in the Cloud. We do not provide a driver with the SDK that allows local mounting of the Page Blob Drives in the Cloud.
When uploading a VHD to a Page Blob, make sure you do not upload the empty pages in the VHD, since the Page Blob in the Cloud will by default treat these as initialized to zero. This will reduce your costs of uploading and storing the VHD Page Blob in the Cloud.

Brad Calder
Windows Azure Storage

↧

Using Windows Azure Page Blobs and How to Efficiently Upload and Download Page Blobs

April 10, 2010, 10:04 pm

≫ Next: Windows Azure Storage Explorers

≪ Previous: Windows Azure Drive Demo at MIX 2010

This post refers to the Storage Client Library shipped in SDK 1.2. Windows Azure SDK 1.3 provides additional Page Blob functionality via the CloudPageBlob class. The current release can be downloaded here.

We introduced Page Blobs at PDC 2009 as a type of blob for Windows Azure Storage. With the introduction of Page Blobs, Windows Azure Storage now supports the following two blob types:

Block Blobs (introduced PDC 2008)– targeted at streaming workloads.
- Each blob consists of a sequence/list of blocks. The following are properties of Block Blobs:
  - Each Block has a unique ID, scoped by the Blob Name
  - Blocks can be up to 4MBs in size, and the blocks in a Blob do not have to be the same size
  - A Block blob can consist of up to 50,000 blocks
  - Max block blob size is 200GB
- Commit-based Update Semantics – Modifying a block blob is a two-phase update process. It first consists of uploading the blocks to add or modify as uncommitted blocks for a blob. Then after they are all uploaded, the blocks to add/change/remove are committed via a PutBlockList to create a new readable version of a blob. Therefore, updating a block blob is a two-phase update process where you upload all changes, and then commit them atomically.
- Range reads can be from any byte offset in the blob.
Page Blobs (introduced PDC 2009)– targeted at random write workloads.
- Each blob consists of an array/index of pages. The following are properties of Page Blobs:
  - Each page is of size 512 bytes, so all writes must be 512 byte aligned, and the blob size must be a multiple of 512 bytes.
  - Writes have a starting offset and can write up to 4MBs worth of pages at a time. These are range-based writes that consist of a sequential range of pages.
  - Max page blob size is 1TB
- Immediate Update Semantics – As soon as a write request for a sequential set of pages succeeds in the blob service, the write has committed, and success is returned back to the client. The update is immediate, so there is no commit step as there is for block blobs.
- Range reads can be done from any byte offset in the blob.

Unique Characteristics of Page Blobs

We created Page Blobs out of a need to have a cloud storage data abstraction for files that supports:

1. Fast range-based reads and writes– need a data abstraction with single update writes, to provide a fast update alternative to the two-phase update of Block Blobs.
2. Index-based data structure– need a data abstraction that supports index-based access, in comparison to the list-based approach of block blobs.
3. Efficient sparse data structure– since the data object can represent a large sparse index, we wanted to create an efficient way to manage and avoid charging for empty pages. Don’t charge for parts of the index that do not have any data pages stored in them.

Uses for Page Blobs

The following are some of the scenarios Page Blobs are being used for:

Windows Azure Drives - One of the key scenarios for Page Blobs was to support Windows Azure Drives. Windows Azure Drives allows Windows Azure cloud applications to mount a network attached durable drive, which is actually a Page Blob (see prior post).
Files with Range-Based Updates – An application can treat a Page Blob as a file, updating just the parts of the file/blob that have changed using ranged writes. In addition, to deal with concurrency, the application can obtain and renew a Blob Lease to maintain an exclusive write lease on the Page Blob for updating.
Logging - Another use of Page Blobs is to use them for custom logging for their applications. For example, for a given role instance, when the role starts up a Page Blob can be created for some MaxSize, which is the max amount of log space the role wants to use for a day. The given role instance can then write its logs using up to 4MB range-based writes, where a header provides metadata for the size of the log entry, timestamp, etc. When the Page Blob is filled up, then treat the Page Blob as a circular buffer and start writing from the beginning of the Page Blob, or create a new page blob, depending upon how the application wants to manage the log files (blobs). With this type of approach you can have a different Page Blob for each role instance so that there is just a single writer to each page blob for logging. Then to know where to start writing the logs on role failover the application can just create a new Page Blob if a role restarts, and GC the older Page Blobs after a given number of hours or days. Since you are not charged for pages that are empty, it doesn’t matter if you don’t fill the page blob up.

Using Storage Client Library to Access Page Blobs

We’ll now walk through how to use the Windows Azure Storage Client library to create, update and read Page Blobs.

Creating a Page Blob and Page Blob Size

To create a Page Blob, we first create a CloudBlobClient object, with the base Uri for accessing the blob storage for your storage account along with the StorageCredentialsAccountAndKey object as shown below. This gives a CloudBlobClient object you can then use to derive all of your requests to the blob service for that storage account. The example then shows creating a reference to a CloudBlobContainer object, and then creating the container if it doesn’t already exist. Then from the CloudBlobContainer object we can create a reference to a CloudPageBlob object by specifying the page blob name we want to access.

using Microsoft.WindowsAzure.StorageClient;StorageCredentialsAccountAndKey creds = new StorageCredentialsAccountAndKey(accountName, key);string baseUri = string.Format("http://{0}.blob.core.windows.net", accountName);CloudBlobClient blobStorage = new CloudBlobClient(baseUri, creds);CloudBlobContainer container = blobStorage.GetContainerReference(containerName);
container.CreateIfNotExist();CloudPageBlob pageBlob = container.GetPageBlobReference(blobName);
pageBlob.Create(blobSize);

Then to create the page blob we call CloudPageBlob.Create passing in the max size for the blob we want to create. Note that the blobSize has to be modulo 512 bytes.

Right after the page blob is created, no pages are actually stored, but you can read from any page range within the blob, and will get back zeros. This is because “empty pages” are treated by the page blob as if they were filled with zeros when trying to read those pages. This also means that after creating a blob, you are not charged for any pages even if you specify a 1TB page blob. You are only charged for pages that have data stored in them.

Make sure when uploading a blob that you don’t upload pages that are full of zeros, and instead skip over those pages leaving them empty. This will ensure that you aren’t charged for those empty pages. See the example below for uploading VHDs to page blobs, where we only upload pages that are non-zero. Similarly, when reading a page blob, if you have a lot of empty pages, you may want to first get the valid page ranges with GetPageRanges, and then download just those pages. This is used in the downloading VHD example below.

Writing Pages

To write pages you use the CloudPageBlob.WritePages method. This allows you to write a sequential set of pages up to 4MBs, and the offset being written to must start on a 512 byte boundary (startingOffset % 512 == 0), and end on a 512 boundary - 1. The below shows an example of calling WritePages for a blob object we are accessing:

CloudPageBlob pageBlob = container.GetPageBlobReference(blobName);
pageBlob.WritePages(dataStream, startingOffset);

In the above example, if the dataStream is larger than 4MBs or it does not end aligned to 512 bytes then an exception is thrown.

A word of caution here is that if you get a “500” or “504” error (e.g., timeout, connection closed, etc) back for a WritePages request, then this means the write may or may not have succeeded on the server. In this case, it is best to retry the WritePages to make sure the contents are updated.

Reading Pages

To read pages you use the CloudPageBlob.OpenRead method with BlobStream.Read reader object to read the pages. This allows you to stream the full blob or range of pages from any offset in the blob. Ranged reads can start end and at any byte offset (they do not have to be 512 byte aligned like in writing).

CloudPageBlob pageBlob = container.GetPageBlobReference(blobName);BlobStream blobStream = pageBlob.OpenRead();byte[] buffer = new byte[rangeSize];blobStream.Seek(blobOffset, SeekOrigin.Begin); 
int numBytesRead = blobStream.Read(buffer, bufferOffset, rangeSize);

In the above, we use CloudPageBlob.OpenRead (inherited from CloudBlob) to get a BlobStream for reading the contents of the blob. When creating a blob stream, the stream is set to be read at the start of the blob. To start reading a different byte offset, call blobStream.Seek with that offset. The read will then download the page blob bytes for the given rangeSize passed in storing it into the buffer at the bufferOffset. Remember, that if you do a read over pages without any data stored in them, the blob service will return 0s for those pages.

One of the key concepts we talked about earlier is that if you have a sparsely populated blob you may want to just download the valid page regions. To do this you can use the CloudPageBlob.GetPageRanges to get an enumerable of PageRange objects. Calling GetPageRanges returns the list of valid page range regions for the page blob. You can then enumerate these, and download just the pages with data in them. The below is an example of doing this:

CloudBlobClient blobStorage = new CloudBlobClient(accountName, creds);
blobStorage.ReadAheadInBytes = 0;CloudBlobContainer container = blobStorage.GetContainerReference(containerName);CloudPageBlob pageBlob = container.GetPageBlobReference(blobName);IEnumerable<PageRange> pageRanges = pageBlob.GetPageRanges();BlobStream blobStream = pageBlob.OpenRead();foreach (PageRange range in pageRanges)
{// EndOffset is inclusive... so need to add 1int rangeSize = (int)(range.EndOffset + 1 - range.StartOffset);// Seek to the correct starting offset in the page blob streamblobStream.Seek(range.StartOffset, SeekOrigin.Begin);// Read the next page range into the bufferbyte[] buffer = new byte[rangeSize];
    blobStream.Read(buffer, 0, rangeSize);// Then use the buffer for the page range just read}

The above example gets the list of valid page ranges, then reads each valid page range into a local buffer to be used by the application how it sees fit. An important step here is the “blobStream.Seek”, which moves the blob stream to the correct starting position (offset) for the next valid page range.

One thing to realize when using GetPageRanges is that you get back a list of continuous ranges for what are the current valid regions in the page blob. You do not get back the regions in the granularity or order that you wrote them. For example, assume you did the following write pages in the following order: [512-2048), then [4096-5120], then [2048-2560), and then [0-1024). In calling GetPageRanges, you would get back the two ranges [0-2560) and [4096-5120).

Note, in the above code we sent the CloudBlobClient.ReadAheadInBytes to 0. If we did not do this, then the code would read ahead from the blob service over the the page ranges without any data in them when doing the blobStream.Read. Therefore, setting the read ahead to zero, means that we can make sure that we only download the exact page ranges we want to (the pages with data in them).

Advanced Functionality – Clearing Pages and Changing Page Blob Size

There are a few advanced Page Blob features that are not exposed at the StorageClient level, but are accessible at the REST or StorageClient.Protocol level. We’ll briefly touch on two of them here -- clearing pages and changing the size of the page blob.

If you need to delete or zero out a set of pages in a Page Blob, writing zeros to those pages, will result in data pages (full of zeros) being stored into those pages, and there would be a charge for them. Therefore, if you have the need to delete or zero out a set of pages in a page blob, it is beneficial to call the Put Page with the header x-ms-page-write: clear in the REST APIs. This will clear the set of pages from the page blob, resulting in those being removed from the set of pages being charged. The following is an example ClearPages routine from Jai Haridas to use until we add support for clear pages at the Storage Client level:

/// Jai Haridas, Microsoft 2010using Microsoft.WindowsAzure.StorageClient;using Microsoft.WindowsAzure.StorageClient.Protocol;using System.Net;public static void ClearPages(CloudPageBlob pageBlob, int timeoutInSeconds, 
long start, long end, string leaseId)
{if (start % 512 != 0 || start >= end)
    {throw new ArgumentOutOfRangeException("start");
    }if ((end + 1) % 512 != 0)
    {throw new ArgumentOutOfRangeException("end");
    }if (pageBlob == null)
    {throw new ArgumentNullException("pageBlob");
    }if (timeoutInSeconds <= 0)
    {throw new ArgumentOutOfRangeException("timeoutInSeconds");
    }UriBuilder uriBuilder = new UriBuilder(pageBlob.Uri);
    uriBuilder.Query = string.Format("comp=page&timeout={0}", timeoutInSeconds);Uri requestUri = uriBuilder.Uri;// Take care of SAS query parameters if requiredif (pageBlob.ServiceClient.Credentials.NeedsTransformUri)
    {
        requestUri = new Uri(pageBlob.ServiceClient.Credentials.TransformUri(requestUri.ToString()));
    }// Create the request and set all the required headersHttpWebRequest request = (HttpWebRequest)WebRequest.Create(requestUri);
    request.Method = "PUT";// let the http web request timeout after 30s of the total timeout we provide to Azure Storagerequest.Timeout = (int)Math.Ceiling(TimeSpan.FromSeconds(timeoutInSeconds + 30).TotalMilliseconds);

    request.ContentLength = 0;
    request.Headers.Add("x-ms-version", "2009-09-19");
    request.Headers.Add("x-ms-page-write", "Clear");
    request.Headers.Add("x-ms-range", string.Format("bytes={0}-{1}", start, end));if (!string.IsNullOrEmpty(leaseId))
    {
        request.Headers.Add("x-ms-lease-id", leaseId);
    }// We have all the headers in place- let us add auth and date headerpageBlob.ServiceClient.Credentials.SignRequest(request);using (HttpWebResponse clearResponse = (HttpWebResponse)request.GetResponse())
    {// Add your own logging here as the call is successful if 
        // clearResponse.StatusCode == HttpStatusCode.Created}
}public static void ClearPageWithRetries(CloudPageBlob pageBlob, int timeoutInSeconds, 
long start, long end, string leaseId)
{int retry = 0;    int maxRetries = 4; 
for (; ; )
    {
        retry++;try{
            ClearPage(pageBlob, timeoutInSeconds, start, end, leaseId);break;
        }catch (WebException e)
        {            // Log the webexception status, since that tells what the error may be
            // Let us re-throw the error on 3xx,4xx, 501 and 505 errors OR
            // if we exceed the retry countHttpWebResponse response = e.Response as HttpWebResponse;if (retry == maxRetries || 
                (response != null && 
                (((int)response.StatusCode >= 300 && 
                  (int)response.StatusCode < 500) || 
                  (response.StatusCode == HttpStatusCode.NotImplemented)||
                  (response.StatusCode == HttpStatusCode.HttpVersionNotSupported))))
            {throw;
            }
        }// Backoff: 3s, 9s, 27s ... int retryInterval = (int)Math.Pow(3, retry);
        System.Threading.Thread.Sleep(retryInterval*1000);
    }
}

When creating the page blob, the max size specified is primarily used for bounds checking the updates to the blob. You can actually change the size of the Page Blob at anytime using the REST API (via Set Blob Properties and x-ms-blob-content-length) or Protocol interfaces (via BlobRequest.SetProperties and newBlobSize). If you shrink the blob size, then the pages past the new max size at the end of the blob will be deleted. If you increase the size of the page blob, then empty pages will be effectively added at the end of the Page Blob.

using Microsoft.WindowsAzure;using Microsoft.WindowsAzure.StorageClient;using Microsoft.WindowsAzure.StorageClient.Protocol;using System.Net;// leaving out account/container creation

CloudPageBlob pageBlob = container.GetPageBlobReference(config.Blob);Uri requestUri = pageBlob.Uri;if (blobStorage.Credentials.NeedsTransformUri)
{
    requestUri = new Uri(blobStorage.Credentials.TransformUri(requestUri.ToString()));
}HttpWebRequest request = BlobRequest.SetProperties(requestUri, timeout,
               pageBlob.Properties, null, newBlobSize);
request.Timeout = timeout;
blobStorage.Credentials.SignRequest(request);using (WebResponse response = request.GetResponse())
{// call succeeded};

Putting it All Together to Efficiently Upload VHDs to Page Blobs

Now we want to tie everything together by providing an example command line application written using the Storage Client Library that allows you to efficiently upload VHDs to Page Blobs. It actually works for any file (nothing specific in the program to VHDs), as long as you are OK with the end of the file to be 512 byte aligned when stored into the Page Blob.

The application was written by Andy Edwards for the Windows Azure Drive MIX 2010 demo (please see prior post). The program takes 3 parameters:

The local file you want to upload
The full uri for the page blob you want to store in the blob service
The name of a local file that has the storage account key stored in it

An example command line for running the program looks like:

c:\> vhdupload.exe input-file http://accountname.blob.core.windows.net/container/blobname key.txt

The program reads over the local file, finding the regions of non-zero pages, and then uses WritePages to write them to the page blob. As describe above, it skips over pages that are empty (filled with zeros), so they are not uploaded. Also, for the last buffer to be uploaded, if it is not 512 byte aligned, we resize it so that it is aligned to 512 bytes.

Here is the code:

// Andy Edwards, Microsoft 2010using System;using System.Collections.Generic;using System.Text;using System.IO;using Microsoft.WindowsAzure.StorageClient;using Microsoft.WindowsAzure;public class VhdUpload{public static void Main(string [] args)
    {Config config = Config.Parse(args);try{Console.WriteLine("Uploading: " + config.Vhd.FullName + "\n" +"To:        " + config.Url.AbsoluteUri);
            UploadVHDToCloud(config);
        }catch (Exception e)
        {Console.WriteLine("Error uploading vhd:\n" + e.ToString());
        }
    }private static bool IsAllZero(byte[] range, long rangeOffset, long size)
    {for (long offset = 0; offset < size; offset++)
        {if (range[rangeOffset + offset] != 0)
            {return false;
            }
        }return true;
    }private static void UploadVHDToCloud(Config config)
    {StorageCredentialsAccountAndKey creds = new 
StorageCredentialsAccountAndKey(config.Account, config.Key);CloudBlobClient blobStorage = new CloudBlobClient(config.AccountUrl, creds);CloudBlobContainer container = blobStorage.GetContainerReference(config.Container);
        container.CreateIfNotExist();CloudPageBlob pageBlob = container.GetPageBlobReference(config.Blob);Console.WriteLine("Vhd size:  " + Megabytes(config.Vhd.Length));long blobSize = RoundUpToPageBlobSize(config.Vhd.Length);
        pageBlob.Create(blobSize);FileStream stream = new FileStream(config.Vhd.FullName, FileMode.Open, FileAccess.Read);BinaryReader reader = new BinaryReader(stream);long totalUploaded = 0;long vhdOffset = 0;int offsetToTransfer = -1;while (vhdOffset < config.Vhd.Length)
        {byte[] range = reader.ReadBytes(FourMegabytesAsBytes);int offsetInRange = 0;// Make sure end is page size alignedif ((range.Length % PageBlobPageSize) > 0)
            {int grow = (int)(PageBlobPageSize - (range.Length % PageBlobPageSize));Array.Resize(ref range, range.Length + grow);
            }// Upload groups of contiguous non-zero page blob pages.  while (offsetInRange <= range.Length)
            {if ((offsetInRange == range.Length) ||
                    IsAllZero(range, offsetInRange, PageBlobPageSize))
                {if (offsetToTransfer != -1)
                    {// Transfer up to this pointint sizeToTransfer = offsetInRange - offsetToTransfer;MemoryStream memoryStream = new MemoryStream(range, 
                                     offsetToTransfer, sizeToTransfer, false, false);
                        pageBlob.WritePages(memoryStream, vhdOffset + offsetToTransfer);Console.WriteLine("Range ~" + Megabytes(offsetToTransfer + vhdOffset) 
                                + " + " + PrintSize(sizeToTransfer));
                        totalUploaded += sizeToTransfer;
                        offsetToTransfer = -1;
                    }
                }else{if (offsetToTransfer == -1)
                    {
                        offsetToTransfer = offsetInRange;
                    }
                }
                offsetInRange += PageBlobPageSize;
            }
            vhdOffset += range.Length;
        }Console.WriteLine("Uploaded " + Megabytes(totalUploaded) + " of " + Megabytes(blobSize));
    }private static int PageBlobPageSize = 512;private static int OneMegabyteAsBytes = 1024 * 1024;private static int FourMegabytesAsBytes = 4 * OneMegabyteAsBytes;private static string PrintSize(long bytes)
    {if (bytes >= 1024*1024) return (bytes / 1024 / 1024).ToString() + " MB";if (bytes >= 1024) return (bytes / 1024).ToString() + " kb";return (bytes).ToString() + " bytes";
    }private static string Megabytes(long bytes)
    {return (bytes / OneMegabyteAsBytes).ToString() + " MB";
    }private static long RoundUpToPageBlobSize(long size)
    {return (size + PageBlobPageSize - 1) & ~(PageBlobPageSize - 1);
    }
}public class Config{public Uri Url;public string Key;public FileInfo Vhd;public string AccountUrl
    {get {return Url.GetLeftPart(UriPartial.Authority);
        }
    }public string Account
    {get{string accountUrl = AccountUrl;

            accountUrl = accountUrl.Substring(Url.GetLeftPart(UriPartial.Scheme).Length);
            accountUrl = accountUrl.Substring(0, accountUrl.IndexOf('.'));return accountUrl;
        }
    }public string Container
    {get{string container = Url.PathAndQuery;
            container = container.Substring(1);
            container = container.Substring(0, container.IndexOf('/'));return container;
        }
    }public string Blob
    {get{string blob = Url.PathAndQuery;
            blob = blob.Substring(1);
            blob = blob.Substring(blob.IndexOf('/') + 1);int queryOffset = blob.IndexOf('?');if (queryOffset != -1)
            {
                blob = blob.Substring(0, queryOffset);
            }return blob;
        }
    }public static Config Parse(string [] args)
    {if (args.Length != 3)
        {
            WriteConsoleAndExit("Usage: vhdupload <file> <url> <keyfile>");
        }Config config = new Config();
        config.Url = new Uri(args[1]);
        config.Vhd = new FileInfo(args[0]);if (!config.Vhd.Exists)
        {
            WriteConsoleAndExit(args[0] + " does not exist");
        }

        config.ReadKey(args[2]);

        return config;
    }public void ReadKey(string filename)
    {try{
            Key = File.ReadAllText(filename);
            Key = Key.TrimEnd(null);
            Key = Key.TrimStart(null);
        }catch (Exception e)
        {
            WriteConsoleAndExit("Error reading key file:\n" + e.ToString());
        }
    }private static void WriteConsoleAndExit(string s)
    {Console.WriteLine(s);
        System.Environment.Exit(1);
    }
}

Putting it All Together to Efficiently Download Page Blobs to VHDs

Now we want to finish tying everything together by providing a command line program to also efficiently download page blobs using the Storage Client Library. The application was also written by Andy Edwards. The program takes 3 parameters:

The full uri for the page blob you want to download from the blob service
The local file you want to store the blob to
The name of a local file that has the storage account key stored in it

An example command line for running the program looks like:

c:\> vhdupload.exe http://accountname.blob.core.windows.net/container/blobname output-file key.txt

The program gets the valid page ranges with GetPageRanges, and then it reads just those ranges, and writes those ranges at the correct offset (by seeking to it) in the local file.

Note, that the reads are broken up into 4MB reads, because (a) there is a bug in the storage client library where ReadPages uses a default timeout of 90 seconds, which may not be large enough if you are downloading page ranges in the size of 100s of MBs or larger, and (b) breaking up reads into smaller chunks allows more efficient continuation and retries of the download if there are connectivity issues for the client.

Here is the code:

// Andy Edwards, Microsoft 2010using System;using System.Collections.Generic;using System.Text;using System.IO;using Microsoft.WindowsAzure.StorageClient;using Microsoft.WindowsAzure;public class VhdDownload{public static void Main(string [] args)
    {Config config = Config.Parse(args);try{Console.WriteLine("Downloading: " + config.Url.AbsoluteUri + "\n" +"To:          " + config.Vhd.FullName);
            DownloadVHDFromCloud(config);
        }catch (Exception e)
        {Console.WriteLine("Error downloading vhd:\n" + e.ToString());
        }
    }private static void DownloadVHDFromCloud(Config config)
    {StorageCredentialsAccountAndKey creds = 
new StorageCredentialsAccountAndKey(config.Account, config.Key);CloudBlobClient blobStorage = new CloudBlobClient(config.AccountUrl, creds);
        blobStorage.ReadAheadInBytes = 0;CloudBlobContainer container = blobStorage.GetContainerReference(config.Container);CloudPageBlob pageBlob = container.GetPageBlobReference(config.Blob);// Get the length of the blobpageBlob.FetchAttributes();long vhdLength = pageBlob.Properties.Length;long totalDownloaded = 0;Console.WriteLine("Vhd size:  " + Megabytes(vhdLength));// Create a new local file to write intoFileStream fileStream = new FileStream(config.Vhd.FullName, FileMode.Create, FileAccess.Write);
        fileStream.SetLength(vhdLength);// Download the valid ranges of the blob, and write them to the fileIEnumerable<PageRange> pageRanges = pageBlob.GetPageRanges();BlobStream blobStream = pageBlob.OpenRead();foreach (PageRange range in pageRanges)
        {// EndOffset is inclusive... so need to add 1int rangeSize = (int)(range.EndOffset + 1 - range.StartOffset);// Chop range into 4MB chucks, if neededfor (int subOffset = 0; subOffset < rangeSize; subOffset += FourMegabyteAsBytes)
            {int subRangeSize = Math.Min(rangeSize - subOffset, FourMegabyteAsBytes);
                blobStream.Seek(range.StartOffset + subOffset, SeekOrigin.Begin);
                fileStream.Seek(range.StartOffset + subOffset, SeekOrigin.Begin);Console.WriteLine("Range: ~" + Megabytes(range.StartOffset + subOffset) 
                                  + " + " + PrintSize(subRangeSize));byte[] buffer = new byte[subRangeSize];

                blobStream.Read(buffer, 0, subRangeSize);
                fileStream.Write(buffer, 0, subRangeSize);
                totalDownloaded += subRangeSize;
            }
        }
        Console.WriteLine("Downloaded " + Megabytes(totalDownloaded) + " of " + Megabytes(vhdLength));
    }private static int OneMegabyteAsBytes = 1024 * 1024;private static int FourMegabyteAsBytes = 4 * OneMegabyteAsBytes;private static string Megabytes(long bytes)
    {return (bytes / OneMegabyteAsBytes).ToString() + " MB";
    }private static string PrintSize(long bytes)
    {if (bytes >= 1024*1024) return (bytes / 1024 / 1024).ToString() + " MB";if (bytes >= 1024) return (bytes / 1024).ToString() + " kb";return (bytes).ToString() + " bytes";
    }
}public class Config{public Uri Url;public string Key;public FileInfo Vhd;public string AccountUrl
    {get {return Url.GetLeftPart(UriPartial.Authority);
        }
    }public string Account
    {get{string accountUrl = AccountUrl;

            accountUrl = accountUrl.Substring(Url.GetLeftPart(UriPartial.Scheme).Length);
            accountUrl = accountUrl.Substring(0, accountUrl.IndexOf('.'));return accountUrl;
        }
    }public string Container
    {get{string container = Url.PathAndQuery;
            container = container.Substring(1);
            container = container.Substring(0, container.IndexOf('/'));return container;
        }
    }public string Blob
    {get{string blob = Url.PathAndQuery;
            blob = blob.Substring(1);
            blob = blob.Substring(blob.IndexOf('/') + 1);int queryOffset = blob.IndexOf('?');if (queryOffset != -1)
            {
                blob = blob.Substring(0, queryOffset);
            }return blob;
        }
    }public static Config Parse(string [] args)
    {if (args.Length != 3)
        {
            WriteConsoleAndExit("Usage: vhddownload <url> <file> <keyfile>");
        }Config config = new Config();
        config.Url = new Uri(args[0]);
        config.Vhd = new FileInfo(args[1]);if (config.Vhd.Exists)
        {try{
                config.Vhd.Delete();
            }catch (Exception e)
            {
                WriteConsoleAndExit("Failed to delete vhd file:\n" + e.ToString());
            }
        }
        config.ReadKey(args[2]);return config;
    }public void ReadKey(string filename)
    {try{
            Key = File.ReadAllText(filename);
            Key = Key.TrimEnd(null);
            Key = Key.TrimStart(null);
        }catch (Exception e)
        {
            WriteConsoleAndExit("Error reading key file:\n" + e.ToString());
        }
    }private static void WriteConsoleAndExit(string s)
    {Console.WriteLine(s);
        System.Environment.Exit(1);
    }

}

Summary

The following are a few areas worth summarizing about Page Blobs:

When creating a Page Blob you specify the max size, but are only charged for pages with data stored in them.
When uploading a Page Blob, do not store empty pages.
When updating pages with zeros, clear them with ClearPages
Reading from empty pages will return zeros
When downloading a Page Blob, first use GetPageRanges, and only download the page ranges with data in them

Brad Calder

↧

Windows Azure Storage Explorers

April 16, 2010, 7:48 pm

≫ Next: SaveChangesWithRetries and Batch Option

≪ Previous: Using Windows Azure Page Blobs and How to Efficiently Upload and Download Page Blobs

We get a few queries every now and then on the availability of utilities for Windows Azure Storage and decided to put together a list of the storage explorers we know of. The tools are all Windows Azure Storage explorers that can be used to enumerate and/or transfer data to and from blobs, tables or queues. Many of these are free and some come with evaluation periods.

I should point out that we have not verified the functionality claimed by these utilities and their listing does not imply an endorsement by Microsoft. Since these applications have not been verified, it is possible that they could exhibit undesirable behavior.

Do also note that the table below is a snapshot of what we are currently aware of and we fully expect that these tools will continue to evolve and grow in functionality. If there are corrections or updates, please click on the email link on the right and let us know. Likewise if you know of tools that ought to be here, we’d be happy to add them.

In the below table, we list each Windows Azure Storage explorer, and then put an “X” in each block if it provides the ability to either enumerate and/or access the data abstraction. The last column indicates if the explorer is free or not.

(Table updated on 1/5/2011)

Windows Azure Storage Explorer	Block Blob	Page Blob	Tables	Queues	Free
Azure Blob Studio 2011	X	X	.	.	Y
Azure Blob Compressor Enables compressing blobs for upload and download	X	.	.	.	Y
Azure Blob Explorer	X	.	.	.	Y
Azure Storage Explorer	X	X	X	X	Y
Azure Storage Explorer for Eclipse	X	X	X	X	Y
Azure Storage Simple Viewer	X	.	X	X	Y
Azure Web Storage Explorer A portal to access blobs, tables and queues	X	X	X	X	Y
Cerebrata Cloud Storage Studio	X	X	X	X	Y/N
Cloud Berry Explorer	X	X	.	.	Y
Clumsy Leaf Azure Explorer Visual studio plug-in	X	X	X	X	Y/N
Factonomy Azure Utility	X	.	.	.	Y
Gladinet Cloud Desktop	X	.	.	.	Y
MyAzureStorage.com A portal to access blobs, tables and queues	X	X	X	X	Y
Space Block	X	.	.	.	Y
Windows Azure Management Tool	X	X	X	X	Y
Windows Azure Storage Explorer for Visual Studio 2010	X	X	X	.	Y

Dinesh Haridas

↧

SaveChangesWithRetries and Batch Option

April 22, 2010, 12:00 pm

≫ Next: Updating Metadata during Copy Blob and Snapshot Blob

≪ Previous: Windows Azure Storage Explorers

The issue is resolved in the Windows Azure SDK 1.3 release which can be downloaded here.

We recently found that there is a bug in our SaveChangesWithRetries method that takes in the SaveChangesOptions in our Storage Client Library.

public DataServiceResponse 
SaveChangesWithRetries(SaveChangesOptions options)

The problem is that SaveChangesWithRetries does not propagate the SaveChangesOptions to OData client’s SaveChanges method which leads to each entity being sent over a separate request in a serial fashion. This clearly can cause problems to clients since it does not give transaction semantics. This will be fixed in the next release of our SDK but until then we wanted to provide a workaround.

The goal of this post is to provide a workaround for the above mentioned bug so as to allow “Entity Group Transactions” (EGT) to be used with retries. If your application has always handled retries in a custom manner, you can always use the OData client library directly to issue the EGT operation, and this works as expected with no known issues as shown here:

context.SaveChanges(SaveChangesOptions.Batch);

We will now describe a BatchWithRetries method that uses the OData API “SaveChanges” and provides the required Batch option to it. As you can see below, the majority of the code is to handle the exceptions and retry logic. But before we delve into more details of BatchWithRetries, let us quickly see how it can be used.

StorageCredentialsAccountAndKey credentials = 
new StorageCredentialsAccountAndKey(accountName, key);CloudStorageAccount account = 
new CloudStorageAccount(credentials, false);CloudTableClient tableClient = new CloudTableClient(
       account.TableEndpoint.AbsoluteUri, account.Credentials);// create the table if it does not existtableClient.CreateTableIfNotExist(tableName);// Get the context and add entities (use UpdateObject/DeleteObject if 
// you want to update/delete entities respectively)TableServiceContext context = tableClient.GetDataServiceContext();for (int i = 0; i < 5; i++)
{// Add entity with same partition keycontext.AddObject(tableName, new Entity()
    {
        PartitionKey = partitionKey,
        RowKey = Guid.NewGuid().ToString()
    });
}try{// Use the routine below with our own custom retry policy 
    // that retries only on certain exceptionscontext.BatchWithRetries(TableExtensions.RetryExponential());
}catch (Exception e)
{// Handle exception here as required by your application scenario
    // since we have already retried the batch on intermittent errors
    // This exception can be interpreted as 
    // a> Retry count exceeded OR 
    // b> request failed because of application error like Conflict, Pre-Condition Failed etc. }

BatchWithRetries takes a RetryPolicy which dictates whether we should retry on an error and how long we should wait between retries. We have also provided a RetryPolicy here which reuses the exponential backoff strategy that the StorageClient uses to wait between each retry. In addition, it retries only on certain exceptions. We handle the following exceptions:

DataServiceRequestException – This exception can return a list of operation responses. We will use the status code of the operation response and if it does not exist, then we will use the batch status code.
DataServiceClientException – We will use the status code in the exception
WebException – We will just use the status code if it exists. If it does not, it implies that the request never went out and we will assume bad gateway.

The code to decide if we will retry is simple – for any 2xx, 3xx, 4xx, 501 and 505 errors, we will not retry. The 2xx is a special case for Batch. Generally, 2xx means success and an exception is not thrown – however, Batch requests returns “Accepted” for certain failures (example: bad request). Because of this, we include 2xx in the list on which we will not retry.

Here is the complete code that can be used for EGT with retries and please note that this will be fixed in our next SDK release. As always, please provide feedback using the email link on the right.

Jai Haridas

// NOTE: You will need to add System.Data.Services.Client and 
// Microsoft.WindowsAzure.StorageClient to your project referencesusing System;using System.Collections.Generic;using System.Data.Services.Client;using System.Linq;using System.Net;using Microsoft.WindowsAzure.StorageClient;public static class TableExtensions{/// <summary>
    /// Extension method invokes SaveChanges using the batch option and handle any retry errors./// Please note that you can get errors that indicate: ///  1> Entity already exists for inserts///  2> Entity does notexist for deletes///  3> Etag mismatch for updates/// on retries because the previous attempt that failed may have actually succeeded  /// on the server and the second attempt may fail with a 4xx error. /// </summary>
    /// <param name="context"></param>
    /// <param name="retryPolicy">The retry policy to use</param>public static void BatchWithRetries(this TableServiceContext context, 
RetryPolicy retryPolicy)
    {if (context == null)
        {throw new ArgumentNullException("context");
        }if (retryPolicy == null)
        {throw new ArgumentNullException("retryPolicy");
        }ShouldRetry shouldRetry = retryPolicy();// we will wait at most 40 seconds since Azure request can take at most 30s. 
        // We will reset the timeout before exiting this method int oldTimeout = context.Timeout;
        context.Timeout = 40;int currentRetryCount = -1;TimeSpan delay;try{for (; ; )
            {
                currentRetryCount++;try{// Directly use the OData’s SaveChanges apicontext.SaveChanges(SaveChangesOptions.Batch);break;
                }catch (InvalidOperationException e)
                {// TODO: Log the exception here for debugging

                    // Check if we need to retry using the required policy
                    if (!shouldRetry(currentRetryCount, e, out delay))
                    {throw;
                    }

                    System.Threading.Thread.Sleep((int)delay.TotalMilliseconds);
                }

            }
        }
        finally{
            context.Timeout = oldTimeout;
        }
    }/// <summary>
    /// This is the ShouldRetry delegate that StorageClient uses. This can be easily wrapped in a /// RetryPolicy as shown in RetryExponential and used in places where a retry policy is required/// </summary>
    /// <param name="currentRetryCount"></param>
    /// <param name="lastException"></param>
    /// <param name="retryInterval"></param>
    /// <returns></returns>public static bool ShouldRetryOnException(int currentRetryCount, Exception lastException, 
out TimeSpan retryInterval)
    {int statusCode = TableExtensions.GetStatusCodeFromException(lastException);// Let us not retry if 2xx, 3xx, 4xx, 501 and 505 errors OR if we exceed our retry count
        // The 202 error code is one such possibility for batch requests.if (currentRetryCount == RetryPolicies.DefaultClientRetryCount
            || statusCode == -1
            || (statusCode >= 200 && statusCode < 500)
            || statusCode == (int)HttpStatusCode.NotImplemented
            || statusCode == (int)HttpStatusCode.HttpVersionNotSupported)
        {
            retryInterval = TimeSpan.Zero;return false;
        }// The following is an exponential backoff strategyRandom r = new Random();int increment = (int)(
            (Math.Pow(2, currentRetryCount) - 1) * 
             r.Next((int)(RetryPolicies.DefaultClientBackoff.TotalMilliseconds * 0.8),
                    (int)(RetryPolicies.DefaultClientBackoff.TotalMilliseconds * 1.2)));int timeToSleepMsec = (int)Math.Min(RetryPolicies.DefaultMinBackoff.TotalMilliseconds + increment,RetryPolicies.DefaultMaxBackoff.TotalMilliseconds);

        retryInterval = TimeSpan.FromMilliseconds(timeToSleepMsec);return true;
    }/// <summary>
    /// This retry policy follows an exponential backoff and in addition retries 
    /// only on required HTTP status codes 
/// </summary>public static RetryPolicy RetryExponential()
    {return () =>
        {return ShouldRetryOnException;
        };
    }/// <summary>
    /// Get the status code from exception/// </summary>
    /// <param name="e"></param>
    /// <returns></returns>private static int GetStatusCodeFromException(Exception e)
    {DataServiceRequestException dsre = e as DataServiceRequestException;if (dsre != null)
        {// Retrieve the status code:
            //  - if we have an operation response, then it is the status code of that response. 
            //     We can only have one response on failure and we can ignore the batch status
            //  - otherwise it is the batch status code OperationResponse opResponse = dsre.Response.FirstOrDefault();if (opResponse != null)
            {return opResponse.StatusCode;
            }return dsre.Response.BatchStatusCode;
        }DataServiceClientException dsce = e as DataServiceClientException;if (dsce != null)
        {return dsce.StatusCode;
        }WebException we = e as WebException;if (we != null)
        {HttpWebResponse response = we.Response as HttpWebResponse;// if we do not get a response, we will assume bad gateway. 
            // This is not completely true, but since it is better to retry on such errors, 
            // we make up an error code herereturn response != null ? (int)response.StatusCode : (int)HttpStatusCode.BadGateway;
        }// let us not retry on any other exceptionsreturn -1;
    }
}

↧

Updating Metadata during Copy Blob and Snapshot Blob

May 24, 2010, 10:27 pm

≫ Next: WCF Data Service Asynchronous Issue when using Windows Azure Tables from SDK 1.0/1.1

≪ Previous: SaveChangesWithRetries and Batch Option

This issue has been resolved for copy blob operations in the Windows Azure SDK 1.3 release which can be downloaded here. The snapshot issue will be resolved in a future release.

When you copy or snapshot a blob in the Windows Azure Blob Service, you can specify the metadata to be stored on the snapshot or the destination blob. If no metadata is provided in the request, the server will copy the metadata from the source blob (the blob to copy from or to snapshot). However, if metadata is provided, the service will not copy the metadata from the source blob and just store the new metadata provided. Additionally, you cannot change the metadata of a snapshot after it has been created.

The current Storage Client Library in the Windows Azure SDK allows you to specify the metadata to send for CopyBlob, but there is an issue that causes the updated metadata not actually to be sent to the server. This can cause the application to incorrectly believe that it has updated the metadata on CopyBlob when it actually hasn’t been updated.

Until this is fixed in a future version of the SDK, you will need to use your own extension method if you want to update the metadata during CopyBlob. The extension method will take in metadata to send to the server. Then if the application wants to add more metadata to the existing metadata of the blob being copied, the application would then first fetch metadata on the source blob and then send all the metadata including the new ones to the extension CopyBlob operation.

In addition, the SnapshotBlob operation in the SDK does not allow sending metadata. For SnapshotBlob, we gave an extension method that does this in our post on protecting blobs from application errors. This can be easily extended for “CopyBlob” too and we have the code at the end of this post, which can be copied into the same extension class.

We now explain the issue with CopyBlob using some code examples. If you want to update the metadata of the destination of the CopyBlob, you would want to set the metadata on the destination blob instance before invoking CopyBlob, as shown below. In the current SDK when doing this, the metadata is not sent to the server. This results in the server copying all the metadata from the source to destination. But on the client end, the destination instance continues to have the new metadata set by the application. The following code shows the problem:

CloudBlob destinationBlob = cloudContainer.GetBlobReference("mydocs/PDC09_draft2.ppt");// set metadata “version” on destinationBlob so that once copied, the destinationBlob instance 
// should only have “version” as the metadatadestinationBlob.Attributes.Metadata.Add("version", "draft2");// BUG: CopyBlob does not send “version” in the REST protocol so server goes ahead and copies 
// any metadata present in the sourceBlobdestinationBlob.CopyFromBlob(sourceBlob);

Solution: The solution is to use the CopyBlob extension method at the end of this post as follows:

CloudBlob destinationBlob = cloudContainer.GetBlobReference("mydocs/PDC09_draft2.ppt");// Get the metadata from source blob if you want to copy them too to the destination blob and add 
// the new version metadataNameValueCollection metadata = new NameValueCollection();
            metadata.Add(sourceBlob.Metadata);
            metadata.Add("version", "draft2");BlobRequestOptions options = new BlobRequestOptions()
    {
        Timeout = TimeSpan.FromSeconds(45),
        RetryPolicy = RetryPolicies.RetryExponential(RetryPolicies.DefaultClientRetryCount, RetryPolicies.DefaultClientBackoff)
    };// Send the metadata too with the copy operationdestinationBlob.CopyBlob(
    sourceBlob, 
    metadata, 
    Microsoft.WindowsAzure.StorageClient.Protocol.ConditionHeaderKind.None, null /*sourceConditionValues*/, null /*leaseId*/, 
    options);

Another issue to be aware of is that if a CopyFromBlob operation results in the metadata being copied from source to destination, an application developer may expect the destination blob instance to have the metadata set. However, since the copy operation does not return the metadata, the operation does not set the metadata on the destination blob instance in your application. The application will therefore need to call FetchAttributes explicitly to retrieve the metadata from the server as shown in the following:

// Let us assume sourceBlob already has metadata “author” set and we are just copying the blob

// Let us assume sourceBlob already has metadata “author” set and we are just copying the blobCloudBlob destinationBlob = cloudContainer.GetBlobReference("mydocs/PDC09_draft3.ppt");
destinationBlob.CopyFromBlob(sourceBlob);// before FetchAttributes is called, destinationBlob instance has no metadata even though the 
// server has copied metadata from the source blob.destinationBlob.FetchAttributes();// Now destinationBlob instance has the metadata “author” available.

Jai Haridas

Here is the code for Copy Blob extension method.

/// <summary>
/// Copy blob with new metadata /// </summary>public static CloudBlob CopyBlob(this CloudBlob destinationBlob, CloudBlob sourceBlob, 
    NameValueCollection destinationBlobMetadata, 
    ConditionHeaderKind sourceConditions,string sourceConditionValues,string leaseId,BlobRequestOptions options)
{if (sourceBlob == null)
    {throw new ArgumentNullException("sourceBlob");
    }if (destinationBlob == null)
    {throw new ArgumentNullException("destinationBlob");
    }ShouldRetry shouldRetry = options.RetryPolicy == null ? RetryPolicies.RetryExponential(RetryPolicies.DefaultClientRetryCount, RetryPolicies.DefaultClientBackoff)() : options.RetryPolicy();int currentRetryCount = -1;for (; ; )
    {
        currentRetryCount++;try{TimeSpan timeout = options.Timeout.HasValue ? options.Timeout.Value : TimeSpan.FromSeconds(30);return BlobExtensions.CopyBlob(
                        destinationBlob, 
                        sourceBlob, 
                        destinationBlobMetadata, 
                        timeout, 
                        sourceConditions, 
                        sourceConditionValues, 
                        leaseId);
        }catch (InvalidOperationException e)
        {// TODO: Log the exception here for debugging

            // Check if we need to retry using the required policy
            TimeSpan delay;if (!IsExceptionRetryable(e) || !shouldRetry(currentRetryCount, e, out delay))
            {throw;
            }

            System.Threading.Thread.Sleep((int)delay.TotalMilliseconds);
        }
    }
}private static CloudBlob CopyBlob(this CloudBlob destinationBlob, CloudBlob sourceBlob, 
    NameValueCollection destinationBlobMetadata, TimeSpan timeout, 
    ConditionHeaderKind sourceConditions,string sourceConditionValues,string leaseId)
{
    StringBuilder canonicalName = new StringBuilder();
    canonicalName.AppendFormat(
        CultureInfo.InvariantCulture,"/{0}{1}",
        sourceBlob.ServiceClient.Credentials.AccountName,
        sourceBlob.Uri.AbsolutePath);if (sourceBlob.SnapshotTime.HasValue)
    {
        canonicalName.AppendFormat("?snapshot={0}", sourceBlob.SnapshotTime.Value.ToString("yyyy'-'MM'-'dd'T'HH':'mm':'ss'.'fffffff'Z'", CultureInfo.InvariantCulture));
    }HttpWebRequest request = BlobRequest.CopyFrom(
        destinationBlob.Uri, 
        (int)timeout.TotalSeconds, 
        canonicalName.ToString(),null ,
        sourceConditions, 
        sourceConditionValues, 
        leaseId);// Adding 2 seconds to have the timeouton webrequest slightly more than the timeout we set for the server to 
    // complete the operationrequest.Timeout = (int)timeout.TotalMilliseconds + 2000;if (destinationBlobMetadata != null)
    {foreach (string key in destinationBlobMetadata.Keys)
        {
            request.Headers.Add("x-ms-meta-" + key, destinationBlobMetadata[key]);
        }
    }
    destinationBlob.ServiceClient.Credentials.SignRequest(request);using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
    {
        destinationBlob.FetchAttributes();return destinationBlob;
    }
}

↧

WCF Data Service Asynchronous Issue when using Windows Azure Tables from SDK 1.0/1.1

May 27, 2010, 8:28 pm

≫ Next: Stream Position Not Reset on Retries in PageBlob WritePages API

≪ Previous: Updating Metadata during Copy Blob and Snapshot Blob

The issue is resolved in the latest version of the Windows Azure SDK which can be downloaded here.

We have received a few reports of problems when using the following APIs in Windows Azure Storage Client Library (WA SCL) for Windows Azure Tables and the following routines:

SaveChangesWithRetries,
BeginSaveChangesWithRetries/EndSaveChangesWithRetries,
Using CloudTableQuery to iterate query results or using BeginExecuteSegmented/EndExecutedSegment
BeginSaveChanges/EndSaveChanges in WCF Data Service Client library
BeginExecuteQuery/EndExecuteQuery in WCF Data Service Client library

The problems can surface in any one of these forms:

Incomplete callbacks in WCF Data Service client library which can lead to 90 second delays
NotSupported Exception – the stream does not support concurrent IO read or write operations
System.IndexOutOfRangeException – probable I/O race condition detected while copying memory

This issue stems from a bug in asynchronous APIs (BeginSaveChanges/EndSaveChanges and BeginExecuteQuery/EndExecuteQuery) provided by WCF Data Service which has been fixed in .NET 4.0 and .NET 3.5 SP1 Update. However, this version of .NET 3.5 is not available in the Guest OS and SDK 1.1 does not support hosting your application in .NET 4.0.

The available options for users to deal with this are:

Rather than using the WA SCL Table APIs that provide continuation token handling and retries, use the WCF Data Service synchronous APIs directly until .NET 3.5 SP1 Update is available in the cloud.
As a work around, if the application is fine with dealing with the occasional delay from using the unpatched version of .NET 3.5, the application can look for the StorageClientException with the message “Unexpected Internal Storage Client Error” and status code “HttpStatusCode.Unused”. On occurrence of such an exception, the application should dispose of the context that was being used and this context should not be reused for any other operation as recommended in the workaround.
Use the next version of the SDK when it comes out, since it will allow your application to use .NET 4.0

To elaborate a little on option 1, when a CUD operation needs to be performed, one can use the synchronous WCF Data Service API to avoid the problem mentioned above. The following code shows an example:

CloudTableClient tableClient = new CloudTableClient(
account.TableEndpoint.AbsoluteUri, 
account.Credentials);TableServiceContext context = tableClient.GetDataServiceContext();// Replace context.SaveChangesWithRetries() with the followingcontext.SaveChanges();

The WA SCL handled retries for you and if you use bare bones WCF Data Services client library, you would need to provide your own retry logic. However, our previous posts have examples on how to wrap your own functionality with retry logic and they provide the basic functions for providing retry. To get into the details - you can wrap it with an extension method that implements the retry logic similar to how we added retries to the CreateSnapshot method in the “Protecting Your Blobs Against Application Errors” post. The methods GetStatusCodeFromException and IsExceptionRetryable are generic enough to be used for Tables and the logic of retrying the operation on all “retryable” errors using the retry strategy is useful to implement your own implementation of extension methods.

For queries, you can enumerate over the IQueryable rather than converting it to CloudTableQuery which uses WCF Data Service’s async API. A crucial point to note here is that the application would have to handle continuation tokens and retry each query request. For continuation token handling please refer to the Windows Azure Table technical documentation that provides an example. The above mention retry strategy can be used for queries too.

Hope this post helps and as always, please do send feedback our way.

Jai Haridas

↧

Stream Position Not Reset on Retries in PageBlob WritePages API

June 14, 2010, 11:18 pm

≫ Next: Updates to Windows Azure Drive (Beta) in OS 1.4

≪ Previous: WCF Data Service Asynchronous Issue when using Windows Azure Tables from SDK 1.0/1.1

The issue is resolved in the Windows Azure SDK 1.3 release which can be downloaded here.

We recently came across a bug in the StorageClient library in which WritePages fails on retries because the stream is not reset to the beginning before a retry is attempted. This results in the StorageClient reading from incorrect position and hence causing WritePages to fail on retries.

The workaround is to implement a custom retry on which we reset the stream to the start before we invoke WritePages. We have taken note of this bug and it will be fixed in the next release of StorageClient library. Until then, Andrew Edwards, an architect in Storage team, has provided a workaround which can be used to implement WritePages with retries. The workaround saves the selected retry option and sets the retry policy to “None”. It then implements its own retry mechanism using the policy set. But before issuing a request, it rewinds the stream position to the beginning ensures that WritePages will read from the correct stream position.

This solution should also be used in the VHD upload code provided in blog “Using Windows Azure Page Blobs and How to Efficiently Upload and Download Page Blobs”. Please replace pageBlob.WritePages with the call to the below static method to get around the bug mentioned in this post.

Jai Haridas

static void WritePages(CloudPageBlob pageBlob, MemoryStream memoryStream, long offset)
{CloudBlobClient client = pageBlob.ServiceClient;RetryPolicy savedPolicy = client.RetryPolicy;
        client.RetryPolicy = RetryPolicies.NoRetry();int retryCount = 0;for (;;)
        {TimeSpan delay = TimeSpan.FromMilliseconds(-1);// pageBlob.WritePages doesn't do this for retriesmemoryStream.Seek(0, SeekOrigin.Begin);try{
                pageBlob.WritePages(memoryStream, offset);break;
            }catch (TimeoutException e)
            {bool shouldRetry = savedPolicy != null ? 
                    savedPolicy()(retryCount++, e, out delay) : false;if (!shouldRetry)
                {throw;
                }
            }catch (StorageServerException e)
            {bool shouldRetry = savedPolicy != null ? 
                    savedPolicy()(retryCount++, e, out delay) : false;if (!shouldRetry)
                {throw;
                }
            }if (delay > TimeSpan.Zero)
            {
                System.Threading.Thread.Sleep(delay);
            }
        }

        client.RetryPolicy = savedPolicy;
}

↧

Updates to Windows Azure Drive (Beta) in OS 1.4

June 21, 2010, 7:37 pm

≫ Next: Storage Client Hotfix Release – September 2010

≪ Previous: Stream Position Not Reset on Retries in PageBlob WritePages API

This post refers to OS 1.4; this issue has been resolved for all subsequent OS versions.

During internal testing we discovered an issue that might impact Windows Azure Drive (Beta) users under heavy load causing I/O errors. All users need to upgrade to OS 1.4 immediately, which has a fix for the problem.

Note, this issue does not apply to Windows Azure Blobs, Tables or Queues.

To ensure that your role is always upgraded to the most recent OS version, you can set the value of the osVersion attribute to “*” in the service configuration element. Here’s an example of how to specify the most recent OS version for your role.

<ServiceConfiguration serviceName="<service-name>" osVersion="*"><Role name="<role-name>">….</Role></ServiceConfiguration>

If you would like to upgrade to OS 1.4 without being auto-updated beyond that, you must replace the “*” above and set the value of the osVersion attribute to “WA-GUEST-OS-1.4_201005-01”.

Additional details on configuring operating system versions and the service configuration are available here.

Dinesh Haridas

↧

Storage Client Hotfix Release – September 2010

September 2, 2010, 12:31 pm

≫ Next: Windows Azure Storage Client Library: CloudBlob.DownloadToFile() may not entirely overwrite file contents

≪ Previous: Updates to Windows Azure Drive (Beta) in OS 1.4

The issues are resolved in the Windows Azure SDK 1.3 release which can be downloaded here.

1. Application crashes with unhandled NullReferenceException that is raised on a callback thread

Storage Client uses a timer object to keep track of timeouts when getting the web response. There is a race condition that causes access to a disposed timer object causing a NullReferenceException with the following call stack:

System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.WindowsAzure.StorageClient.Tasks.DelayTask.<BeginDelay>b__0(Object state)
at System.Threading.ExecutionContext.runTryCode(Object userData)
at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading._TimerCallback.PerformTimerCallback(Object state)

This can impact any API in the library and since this occurred on a callback thread and the exception was not handled, it resulted in the application crashing. We have fixed this issue now.

2. MD5 Header is not passed to Azure Blob Service in block blob uploads

The Storage Client routines UploadText, UploadFromStream, UploadFile, UploadByteArray use block blobs to upload the files. The client library used incorrect header (it used x-ms-blob-content-md5 rather than the standard content-md5 header) while uploading individual blocks. Therefore, the MD5 was not checked when the block was stored in the Azure Blob service. We have fixed this by using the correct header.

Please download the update from here.

↧

Windows Azure Storage Client Library: CloudBlob.DownloadToFile() may not entirely overwrite file contents

November 12, 2010, 7:04 pm

≫ Next: Windows Azure Storage Client Library: Potential Deadlock When Using Synchronous Methods

≪ Previous: Storage Client Hotfix Release – September 2010

Update 3/09/011: The bug is fixed in the Windows Azure SDK March 2011 release.

Summary

There is an issue in the Windows Azure Storage Client Library that can lead to unexpected behavior when utilizing the CloudBlob.DownloadToFile() methods.

The current implementation of CloudBlob.DownloadToFile() does not erase or clear any preexisting data in the file. Therefore, if you download a blob which is smaller than the existing file, preexisting data at the tail end of the file will still exist.

Example: Let’s say I have a blob titled movieblob that currently contains all the movies that I would like to watch in the future. I want to download this blob to a local file moviesToWatch.txt, which currently contains a lot of romantic comedies which my wife recently watched, however, when I overwrite that file with the action movies I want to watch (which happens to be a smaller list) the existing text is not completely overwritten which may lead to a somewhat random movie selection.

moviesToWatch.txt	You've Got Mail;P.S. I Love You.;Gone With The Wind;Sleepless in Seattle;Notting Hill;Pretty Woman;The Runaway Bride;The Holiday;Little Women;When Harry Met Sally
movieblob	The Dark Knight;The Matrix;Braveheart;The Core;Star Trek 2:The Wrath of Khan;The Dirty Dozen;
moviesToWatch.txt (updated)	The Dark Knight;The Matrix;Braveheart;The Core;Star Trek 2:The Wrath of Khan;The Dirty Dozen;Woman;The Runaway Bride;The Holiday;Little Women;When Harry Met Sally

As you can see in the updated local moviesToWatch.txt file, the last section of the previous movie data still exists on the tail end of the file.

This issue will be addressed in a forthcoming release of the Storage Client Library.

Workaround

In order to avoid this behavior you can use the CloudBlob.DownloadToStream() method and pass in the stream for a file that you have already called File.Create on, see below.

using(var stream = File.Create("myFile.txt"))
{
myBlob.DownloadToStream(stream);
}

To reiterate, this issue only affects scenarios where the file already exists, and the downloaded blob contents are less than the length of the previously existing file. If you are using CloudBlob.DownloadToFile() to write to a new file then you will be unaffected by this issue. Until the issue is resolved in a future release, we recommend that users follow the pattern above.

Joe Giardino

↧

Windows Azure Storage Client Library: Potential Deadlock When Using Synchronous Methods

November 23, 2010, 11:12 am

≫ Next: Page Blob Writes in Windows Azure Storage Client Library does not support Streams with non-zero Position

≪ Previous: Windows Azure Storage Client Library: CloudBlob.DownloadToFile() may not entirely overwrite file contents

Update 11/06/11: The bug is fixed in the Windows Azure SDK September release.

Summary

In certain scenarios, using the synchronous methods provided in the Windows Azure Storage Client Library can lead to deadlock. Specifically, scenarios where the system is using all of the available ThreadPool threads while performing synchronous method calls via the Windows Azure Storage Client Library. Examples or such behavior are ASP.NET requests served synchronously, as well as a simple work client where you queue up N number of worker threads to execute. Note if you are manually creating threads outside of the managed ThreadPool or executing code off of a ThreadPool thread then this issue will not affect you.

When calling synchronous methods, the current implementation of the Storage Client Library blocks the calling thread while performing an asynchronous call. This blocked thread will be waiting for the asynchronous result to return. As such, if one of the asynchronous results requires an available ThreadPool thread (e.g. MemoryStream.BeginWrite, FileStream.BeginWrite, etc.) and no ThreadPool threads are available; its callback will be added to the ThreadPool work queue to wait until a thread becomes available for it to run on. This leads to a condition where the calling thread is blocked until that asynchronous result (callback) unblocks it, but the callback will not execute until threads become unblocked; in other words the system is now deadlocked.

Affected Scenarios

This issue could affect you if your code is executing on a ThreadPool thread and you are using the synchronous methods from the Storage Client Library. Specifically, this issue will arise when the application has used all of its available ThreadPool threads. To find out if your code is executing on a ThreadPool thread you can check System.Threading.Thread.CurrentThread.IsThreadPoolThread at runtime. Some specific methods in the Storage Client Library that can exhibit this issue include the various blob download methods (file, byte array, text, stream, etc.)

Example

For example let’s say that we have a maximum of 10 ThreadPool threads in our system which can be set using ThreadPool.SetMaxThreads. If each of the threads is currently blocked on a synchronous method call waiting for the async wait handle to be set which will require a ThreadPool thread to set the wait handle, we are deadlocked since there are no available threads in the ThreadPool that can set the wait handle.

Workarounds

The following workarounds will avoid this issue:

Use asynchronous methods instead of their synchronous equivalents (i.e. use BeginDownloadByteArray/ EndDownloadByteArray rather than DownloadByteArray). Utilizing the Asynchronous Programming Model is really the only way to guarantee performant and scalable solutions which do not perform any blocking calls. As such, using the Asynchronous Programming Model will avoid the deadlock scenario detailed in this post.
If you are unable to use asynchronous methods, limit the level of concurrency in the system at the application layer to reduce simultaneous in flight operations using a semaphore construct such as System.Threading.Semaphore or Interlocked.Increment/Decrement. An example of this would be to have each incoming client perform an Interlocked.Increment on an integer and do a corresponding Interlocked.Decrement when the synchronous operation completes. With this mechanism in place you can now limit the number of simultaneous in flight operations below the limit of the ThreadPool and return “Server Busy” to avoid blocking more worker threads. When setting the maximum number of ThreadPool threads via ThreadPool.SetMaxThreads be cognizant of any additional ThreadPool threads you are using in the app domain via ThreadPool.QueueUserWorkItem or otherwise so as to accommodate them in your scenario. The goal is to limit the number of blocked threads in the system at any given point. Therefore, make sure that you do not block the thread prior to calling synchronous methods, since that will result in the same number of overall blocked threads. Instead when application reaches its concurrency limit you must ensure that no more additional ThreadPool threads become blocked.

Mitigations

If you are experiencing this issue and the options above are not viable in your scenario, you might try one of the options below. Please ensure you fully understand the implications of the actions below as they will result in additional threads being created on the system.

Increasing the number of ThreadPool threads can mitigate this to some extent; however deadlock will always be a possibility without a limit on simultaneous operations.
Offload work to non ThreadPool threads (make sure you understand the implications before doing this, the main purpose of the ThreadPool is to avoid the cost of constantly spinning up and killing off threads which can be expensive in code that runs frequently or in a tight loop).

Summary

We are currently investigating long term solutions for this issue for an upcoming release of the Windows Azure SDK. As such if you are currently affected by this issue please follow the workarounds contained in this post until a future release of the SDK is made available. To summarize, here are some best practices that will help avoid the potential deadlock detailed above.

Use Asynchronous methods for applications that scale– simply stated synchronous does not scale well as it implies the system must lock a thread for some amount of time. For applications with low demand this is acceptable, however threads are a finite resource in a system and should be treated as such. Applications that desire to scale should use simultaneous asynchronous calls so that a given thread can service many calls.
Limit concurrent requests when using synchronous APIs - Use semaphores/counters to control concurrent requests.
Perform a stress test - where you purposely saturate the ThreadPool workers to ensure your application responds appropriately.

References

MSDN ThreadPool documentation: http://msdn.microsoft.com/en-us/library/y5htx827(v=VS.90).aspx

Implementing the CLR Asynchronous Programming Model: http://msdn.microsoft.com/en-us/magazine/cc163467.aspx

Developing High-Performance ASP.NET Applications: http://msdn.microsoft.com/en-us/library/aa719969(v=VS.71).aspx

Joe Giardino

↧

Page Blob Writes in Windows Azure Storage Client Library does not support Streams with non-zero Position

December 17, 2010, 12:53 pm

≫ Next: Windows Azure Blob MD5 Overview

≪ Previous: Windows Azure Storage Client Library: Potential Deadlock When Using Synchronous Methods

Update 3/09/011: The bug is fixed in the Windows Azure SDK March 2011 release.

The current Windows Azure Storage Client Library does not support passing in a stream to CloudPageBlob.[Begin]WritePages where the stream position is a non-zero value. In such a scenario the Storage Client Library will incorrectly calculate the size of the data range which will cause the server to return HTTP 500: Internal Server Error. This is surfaced to the client via a StorageServerException with the message "Server encountered an internal error. Please try again after some time.”

HTTP 500 errors are generally retryable by the client as they relate to an issue on the server side, however in this instance it is the client which is supplying the invalid request. As such this request will be retried by the storage client library N times according to the RetryPolicy specified on the CloudBlobClient (default is 3 retries with an exponential backoff delay). However these requests will not succeed and all subsequent retries will fail with the same error.

Workarounds

In the code below I have included a set of extension methods that provide a safe way to invoke [Begin]WritePages which will throw an exception if the source stream is at a non-zero position. You can alter these methods for your specific scenario in case you wish to possibly rewind the stream. Future releases of Storage Client Library will accommodate for this scenario as we continue to expand support for PageBlob at the convenience Layer.

public static class StorageExtensions{/// <summary>
    /// Begins an asynchronous operation to write pages to a page blob while enforcing that the stream is at the beginning/// </summary>
    /// <param name="pageData">A stream providing the page data.</param>
    /// <param name="startOffset">The offset at which to begin writing, in bytes. The offset must be a multiple of 512.</param>
    /// <param name="callback">The callback delegate that will receive notification when the asynchronous operation completes.</param>
    /// <param name="state">A user-defined object that will be passed to the callback delegate.</param>
    /// <returns>An <see cref="IAsyncResult"/> that references the asynchronous operation.</returns>public static IAsyncResult BeginWritePagesSafe(this CloudPageBlob blobRef, Stream pageData, long startOffset, AsyncCallback callback, object state)
    {1if (pageData.Position != 0)
        {throw new InvalidOperationException("Stream position must be set to zero!");
        }return blobRef.BeginWritePages(pageData, startOffset, callback, state);
    }/// <summary>
    /// Writes pages to a page blob while enforcing that the stream is at the beginning/// </summary>
    /// <param name="pageData">A stream providing the page data.</param>
    /// <param name="startOffset">The offset at which to begin writing, in bytes. The offset must be a multiple of 512.</param>public static void WritePagesSafe(this CloudPageBlob blobRef, Stream pageData, long startOffset)
    {if (pageData.Position != 0)
        {throw new InvalidOperationException("Stream position must be set to zero!");
        }

        blobRef.WritePages(pageData, startOffset, null);
    }/// <summary>
    /// Writes pages to a page blob while enforcing that the stream is at the beginning/// </summary>
    /// <param name="pageData">A stream providing the page data.</param>
    /// <param name="startOffset">The offset at which to begin writing, in bytes. The offset must be a multiple of 512.</param>
    /// <param name="options">An object that specifies any additional options for the request.</param>public static void WritePagesSafe(this CloudPageBlob blobRef, Stream pageData, long startOffset, BlobRequestOptions options)
    {if (pageData.Position != 0)
        {throw new InvalidOperationException("Stream position must be set to zero!");
        }

        blobRef.WritePages(pageData, startOffset, options);
    }
}

Summary

The current Storage Client Library requires an additional check prior to passing in a Stream to CloudPageBlob.[Begin]WritePages in order to avoid producing an invalid request. Using the code above or applying similar checks at the application level can avoid this issue. Please note that other types of blobs are unaffected by this issue (i.e. CloudBlob.UploadFromStream) and we will be addressing this issue in a future release of the Storage Client Library.

Joe Giardino

↧

Windows Azure Blob MD5 Overview

February 17, 2011, 11:22 pm

≫ Next: Windows Azure Storage Client Library: Parallel Single Blob Upload Race Condition Can Throw an Unhandled Exception

≪ Previous: Page Blob Writes in Windows Azure Storage Client Library does not support Streams with non-zero Position

Overview

Windows Azure Blob service provides mechanisms to ensure data integrity both at the application and transport layers. This post will detail these mechanisms from the service and client perspective. MD5 checking is optional on both PUT and GET operations; however it does provide a convenience facility to ensure data integrity across the network when using HTTP. Additionally since HTTPS provides transport layer security additional MD5 checking is not needed while connecting over HTTPS as it would be redundant.

To ensure data integrity the Windows Azure Blob service uses MD5 hashes of the data in a couple different manners. It is important to understand how these values are calculated, transmitted, stored, and eventually enforced in order to appropriately design your application to utilize them to provide data integrity.

Please note, the Windows Azure Blob service provides a durable storage medium, and uses its own integrity checking for stored data. The MD5's that are used when interacting with an application are provided for checking the integrity of the data when transferring that data between the application and service via HTTP. For more information regarding the durability of the storage system please refer to the Windows Azure Storage Architecture Overview.

The following table shows the Windows Azure Blob service REST APIs and the MD5 checks provided for them:

REST API	Header	Value	Validated By	Notes
Put Blob	x-ms-blob-content-md5	MD5 value of blobs bits	Server	Full Blob
Put Blob	Content-MD5	MD5 value of blobs bits	Server	Full Blob,If x-ms-blob-content-md5 is present Content-md5 is ignored
Put Block	Content-MD5	MD5 value of block bits	Server	Validated prior to storing the block
Put Page	Content-MD5	MD5 value of page bits	Server	Validated prior to storing the page
Put Block List	x-ms-blob-content-md5	MD5 value of blobs bits	Client on subsequent download	Stored as the Content-MD5 blob property to be downloaded with blob for client side checks
Set Blob Properties	x-ms-blob-content-md5	MD5 value of blobs bits	Client on subsequent download	Sets the blob Content-MD5 property.
Get Blob	Content-MD5	MD5 value of blobs bits	Client	Returns the Content-MD5 property if one was stored/set with the blob
Get Blob (range)	Content-MD5	MD5 value of blobs range bits	Client	If client specifies x-ms-range-get-content-md5: true the Content-MD5 header will be dynamically calculated over the range of bytes requested. This is restricted to <= 4 MB range requests
Get Blob Properties	Content-MD5	MD5 value of blobs bits	Client	Returns the Content-MD5 property if one was stored/set with the blob

Table 1 : REST API MD5 Compatibility

Service Perspective

From the Windows Azure Blob Storage service perspective the only MD5 values that are explicitly calculated and validated on each transaction are the transport layer (HTTP) MD5 values. MD5 checking is optional on both PUT and GET operations. Note, since HTTPS provides transport layer security when using HTTPS any additional MD5 checking would be redundant, so MD5 checking is not needed when using HTTPS. We will be discussing two separate MD5 values which will provide checks for at different layers:

PUT with Content-MD5: When a content MD5 header is specified, the storage service calculates an MD5 of the data sent and checks that with the Content-MD5 that was also sent. If the two hashes do not match, the operation will fail with error code 400 (Bad Request). These values are transmitted via the Content-MD5 HTTP header. This validation is available for PutBlob, PutBlock and PutPage. Note, when uploading a block, page, or blob the service will return the Content-MD5 HTTP header in the response populated with the MD5 it calculated for the data received.
PUT with x-ms-blob-content-md5: The application can also set the Content-MD5 property that is stored with a blob. The application can pass this in with the header x-ms-blob-content-md5, and the value with this is stored as the Content-MD5 header to be returned on subsequent GETs for the blob. This can be set when using PutBlob, PutBlockList or SetBlobProperties for the blob. If a user provides this value on upload all subsequent GET operations will return this header with the client provided value. The x-ms-blob-content-md5 header is a header we introduced for scenarios where we wanted to specify the hash for the blob content when the http request content is not fully indicative of the actual blob data, such as in PutBlockList. In a PutBlockList, the Content-MD5 header would provide transactional integrity for the message contents (the block list in the request body) , while the x-ms-blob-content-md5 header would set the service side blob property. To reiterate, if a x-ms-blob-content-md5 header is provided it will supersede the Content-MD5 header on a PutBlob operation, for a PutBlock or PutPage operation it is ignored.
GET: On a subsequent GET operation the service will optionally populate the Content-MD5 HTTP header if a value was previously stored with the blob via a PutBlob, PutBlockList, or SetBlobProperties. For range GETs an optional x-ms-range-get-content-md5 header can be added to the request. When this header is set to true and specified together with the Range header for a range GET, the service dynamically calculates an MD5 for the range and returns it in the Content-MD5 header, as long as the range is less than or equal to 4 MB in size. If this header is specified without the Range header, the service returns status code 400 (Bad Request). If this header is set to true when the range exceeds 4 MB in size, the service returns status code 400 (Bad Request).

Client Perspective

We have already discussed above how the Windows Azure Blob service can provide transport layer security via the Content-MD5 HTTP header or HTTPS. In addition to this the client can store and manually validate MD5 hashes on the blob data from the application layer. The Windows Azure Storage Client library provides this calculation functionality via the exposed object model and relevant abstractions such as BlobWriteStream.

Storing Application layer MD5 when Uploading Blobs via the Storage Client Library

When utilizing the CloudBlob Convenience layer methods in most cases the library will automatically calculate and transmit the application layer MD5 value. However, there is an exception to this behavior when a call to an upload method results in

A single PUT operation to the Blob service, which will occur when source data is less than CloudBlobClient.SingleBlobUploadThresholdInBytes.
A parallel upload (length > CloudBlobClient.SingleBlobUploadThresholdInBytes and CloudBlobClient.ParallelOperationThreadCount > 1).

In both of the above cases, an MD5 value is not passed in to be checked, so in this scenario if the client requires data integrity checking they need to make sure and use HTTPS. (HTTPS can be enabled when constructing a CloudStorageAccount via the constructor or by specifying HTTPS as part of the baseAddress when manually constructing a CloudBlobClient)

All other blob upload operations from the convenience layer in the SDK send MD5’s that are checked at the blob service.

In addition to the exposed object methods, you can also provide the x-ms-blob-content-md5 header via the Protocol layer on a PutBlob or PutBlockList request.

The below table lists the convention functions used to upload blobs, and which ones support sending MD5 checks and when they are sent.

Layer	Method	Notes
Convenience	CloudBlob.OpenWrite	MD5 is sent. Note, this function is not currently supported for PageBlob
Convenience	CloudBlob.UploadByteArray	MD5 is sent if: Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND CloudBlobClient. ParallelOperationThreadCount==1
Convenience	CloudBlob.UploadFile	MD5 is sent if: Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND CloudBlobClient. ParallelOperationThreadCount==1
Convenience	CloudBlob.UploadText	MD5 is sent if: Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND CloudBlobClient. ParallelOperationThreadCount==1
Convenience	CloudBlob.UploadFromStream	MD5 is sent if: Length is >= CloudBlobClient. SingleBlobUploadThresholdInBytes AND CloudBlobClient. ParallelOperationThreadCount==1

Table 2 : Blob upload methods MD5 compatibility

Validating Application Layer MD5 when downloading Blobs via the Storage Client Library

The CloudBlob Download methods do not provide application layer MD5 validation; as such it is up to the application to verify the Content-MD5 returned against the data returned by the service. If the application layer MD5 value was specified on upload the Windows Azure Storage Client Library will populate it in CloudBlob.Properties.ContentMD5 on any download (i.e. DownloadText, DownloadByteArray, DownloadToFile, DownloadToStream, and OpenRead).

The example below shows how a client can validate the blobs MD5 hash once all the data is retrieved.

Example

// Initializationstring blobName = "md5test" + Guid.NewGuid().ToString();long blobSize = 8 * 1024 * 1024;

StorageCredentialsAccountAndKey creds = 
        new StorageCredentialsAccountAndKey(AccountName, AccountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, false);
CloudBlobClient bClient = account.CreateCloudBlobClient();// Set CloudBlobClient.SingleBlobUploadThresholdInBytes, all blobs above this 
// length will be uploaded using blocksbClient.SingleBlobUploadThresholdInBytes = 4 * 1024 * 1024;// Create Blob Container CloudBlobContainer container = bClient.GetContainerReference("md5blobcontainer");
Console.WriteLine("Validating the Container");
container.CreateIfNotExist();// Populate Blob Databyte[] blobData = new byte[blobSize];
Random rand = new Random();
rand.NextBytes(blobData);
MemoryStream retStream = new MemoryStream(blobData);// Upload BlobCloudBlob blobRef = container.GetBlobReference(blobName);// Any upload method will work here: byte array, file, text, streamblobRef.UploadByteArray(blobData);// Download will re-populate the client MD5 value from the serverbyte[] retrievedBuffer = blobRef.DownloadByteArray();// Validate MD5 Valuevar md5Check = System.Security.Cryptography.MD5.Create();
md5Check.TransformBlock(retrievedBuffer, 0, retrievedBuffer.Length, null, 0);     
md5Check.TransformFinalBlock(new byte[0], 0, 0);// Get Hash Valuebyte[] hashBytes = md5Check.Hash;string hashVal = Convert.ToBase64String(hashBytes);if (hashVal != blobRef.Properties.ContentMD5) 
{throw new InvalidDataException("MD5 Mismatch, Data is corrupted!");
}

Figure 1: Validating a Blobs MD5 value

A note about Page Blobs

Page blobs are designed to provide a durable storage medium that can perform a high rate of IO. Data can be accessed in 512 byte pages allowing a high rate of non-contiguous transactions to complete efficiently. If HTTP needs to be used with MD5 checks, then the application should pass in the Content-MD5 on PutPage, and then use the x-ms-range-get-content-md5 on each subsequent GetBlob using ranges less than or equal to 4MBs.

Considerations

Currently the convenience layer of the Windows Azure Storage Client Library does not support passing in MD5 values for PageBlobs, nor returning Content-MD5 for getting PageBlob ranges. As such, if your scenario requires data integrity checking at the transport level it is recommended that you use HTTPS or utilize the Protocol Layer and add the additional Content-MD5 header.

In the following example we will show how to perform page blob range GETs with an optional x-ms-range-get-content-md5 via the protocol layer in order to provide transport layer security over HTTP.

Example

// Initializationstring blobName = "md5test" + Guid.NewGuid().ToString();long blobSize = 8 * 1024 * 1024;// Must be divisible by 512int writeSize = 1 * 1024 * 1024;

StorageCredentialsAccountAndKey creds = 
    new StorageCredentialsAccountAndKey(AccountName, AccountKey);
CloudStorageAccount account = new CloudStorageAccount(creds, false);
CloudBlobClient bClient = account.CreateCloudBlobClient();
bClient.ParallelOperationThreadCount = 1;// Create Blob Container CloudBlobContainer container = bClient.GetContainerReference("md5blobcontainer");
Console.WriteLine("Validating the Container");
container.CreateIfNotExist();int uploadedBytes = 0;// Upload BlobCloudPageBlob blobRef = container.GetBlobReference(blobName).ToPageBlob;
blobRef.Create(blobSize);// Populate Blob Databyte[] blobData = new byte[writeSize];
Random rand = new Random();
rand.NextBytes(blobData);
MemoryStream retStream = new MemoryStream(blobData);while (uploadedBytes < blobSize)
{
    blobRef.WritePages(retStream, uploadedBytes);
    uploadedBytes += writeSize;
    retStream.Position = 0;
}

HttpWebRequest webRequest = BlobRequest.Get(
                                        blobRef.Uri,        // URI90,                 // Timeoutnull,               // Snapshot (optional)1024 * 1024,        // Start Offset3 * 1024 * 1024,    // Count null);              // Lease ID ( optional)webRequest.Headers.Add("x-ms-range-get-content-md5", "true");
bClient.Credentials.SignRequest(webRequest);
WebResponse resp = webRequest.GetResponse();

Figure 2: Transport Layer security via optional x-ms-range-get-content-md5 header on a PageBlob

Summary

This article has detailed various strategies when utilizing MD5 values to provide data integrity. As with many cases the correct solution is dependent on your specific scenario.

We will be evaluating this topic in future releases of the Windows Azure Storage Client Library as we continue to improve the functionality offered. Please leave comments below if you have questions.

Joe Giardino

Windows Azure Storage Client Library: Parallel Single Blob Upload Race Condition Can Throw an Unhandled Exception

February 22, 2011, 9:41 pm

≫ Next: Windows Azure Storage Client Library: Rewinding stream position less than BlobStream.ReadAheadSize can result in lost bytes from BlobStream.Read()

≪ Previous: Windows Azure Blob MD5 Overview

Update 11/06/11: The bug is fixed in the Windows Azure SDK September release.

There is a race condition in the current Windows Azure Storage Client Library that could potentially throw an unhandled exception under certain circumstances. Essentially the way the parallel upload executes is by dispatching up to N (N= CloudBlobClient.ParallelOperationThreadCount) number of simultaneous block uploads at a time and waiting on one of them to return via WaitHandle.WaitAny (Note: CloudBlobClient.ParallelOperationThreadCount is initialized by default to be the number of logical processors on the machine, meaning an XL VM will be initialized to 8). Once an operation returns it will attempt to kick off more operations until it satisfies the desired parallelism or there is no more data to write. This loop continues until all data is written and a subsequent PutBlockList operation is performed.

The bug is that there is a race condition in the parallel upload feature resulting in the termination of this loop before it gets to the PutBlockList. The net result is that some blocks will be added to a blobs uncommitted block list, but the exception will prevent the PutBlockList operation. Subsequently it will appear to the client as if the blob exists on the service with a size of 0 bytes. However, if you retrieve the block list you will be able to see the blocks that were uploaded to the uncommitted block list.

Mitigations

When looking at performance, it is important to distinguish between throughput and latency. If your scenario requires a low latency for a single blob upload, then the parallel upload feature is designed to meet this need. To get around the above issue, which should be a rare occurrence you could catch the exception and retry the operation using the current Storage Client Library. Alternatively the following code can be used to perform the necessary PutBlock / PutBlockList operations to perform the parallel blob upload to work around this issue:

///Joe Giardino, Microsoft 2011 /// <summary> /// Extension class to provide ParallelUpload on CloudBlockBlobs. /// </summary> public static class ParallelUploadExtensions { /// <summary> /// Performs a parallel upload operation on a block blob using the associated serviceclient configuration  /// </summary> /// <param name="blobRef">The reference to the blob.</param> /// <param name="sourceStream">The source data to upload.</param> /// <param name="options">BlobRequestOptions to use for each upload, can be null.</param> /// <summary> /// Performs a parallel upload operation on a block blob using the associated serviceclient configuration  /// </summary> /// <param name="blobRef">The reference to the blob.</param> /// <param name="sourceStream">The source data to upload.</param> /// <param name="blockIdSequenceNumber">The intial block ID, each subsequent block will increment of this value </param> /// <param name="options">BlobRequestOptions to use for each upload, can be null.</param>  public static void ParallelUpload(this CloudBlockBlob blobRef, Stream sourceStream, long blockIdSequenceNumber, BlobRequestOptions options)
 {// Parameter Validation & Locals if (null == blobRef.ServiceClient)
    {throw new ArgumentException("Blob Reference must have a valid service client associated with it");
    }if (sourceStream.Length - sourceStream.Position == 0)
    {throw new ArgumentException("Cannot upload empty stream.");
    }if (null == options)
    {
        options = new BlobRequestOptions()
        {
            Timeout = blobRef.ServiceClient.Timeout,
            RetryPolicy = RetryPolicies.RetryExponential(RetryPolicies.DefaultClientRetryCount, RetryPolicies.DefaultClientBackoff)
        };
    }bool moreToUpload = true;List<IAsyncResult> asyncResults = new List<IAsyncResult>();List<string> blockList = new List<string>();using (MD5 fullBlobMD5 = MD5.Create())
    {do {int currentPendingTasks = asyncResults.Count;for (int i = currentPendingTasks; i < blobRef.ServiceClient.ParallelOperationThreadCount && moreToUpload; i++)
            {// Step 1: Create block streams in a serial order as stream can only be read sequentially string blockId = null;// Dispense Block Stream int blockSize = (int)blobRef.ServiceClient.WriteBlockSizeInBytes;int totalCopied = 0, numRead = 0;
                MemoryStream blockAsStream = null;
                blockIdSequenceNumber++;int blockBufferSize = (int)Math.Min(blockSize, sourceStream.Length - sourceStream.Position);byte[] buffer = new byte[blockBufferSize];
                blockAsStream = new MemoryStream(buffer);do {
                    numRead = sourceStream.Read(buffer, totalCopied, blockBufferSize - totalCopied);
                    totalCopied += numRead;
                }while (numRead != 0 && totalCopied < blockBufferSize);// Update Running MD5 Hashes fullBlobMD5.TransformBlock(buffer, 0, totalCopied, null, 0);          
                blockId = GenerateBase64BlockID(blockIdSequenceNumber);// Step 2: Fire off consumer tasks that may finish on other threads blockList.Add(blockId);IAsyncResult asyncresult = blobRef.BeginPutBlock(blockId, blockAsStream, null, options, null, blockAsStream);
                asyncResults.Add(asyncresult);if (sourceStream.Length == sourceStream.Position)
                {// No more upload tasks moreToUpload = false;
                }
            }// Step 3: Wait for 1 or more put blocks to finish and finish operations if (asyncResults.Count > 0)
            {int waitTimeout = options.Timeout.HasValue ? (int)Math.Ceiling(options.Timeout.Value.TotalMilliseconds) : Timeout.Infinite;int waitResult = WaitHandle.WaitAny(asyncResults.Select(result => result.AsyncWaitHandle).ToArray(), waitTimeout);if (waitResult == WaitHandle.WaitTimeout)
                {throw new TimeoutException(String.Format("ParallelUpload Failed with timeout = {0}", options.Timeout.Value));
                }// Optimize away any other completed operations for (int index = 0; index < asyncResults.Count; index++)
                {IAsyncResult result = asyncResults[index];if (result.IsCompleted)
                    {// Dispose of memory stream (result.AsyncState as IDisposable).Dispose();
                        asyncResults.RemoveAt(index);
                        blobRef.EndPutBlock(result);
                        index--;
                    }
                }
            }
        }while (moreToUpload || asyncResults.Count != 0);// Step 4: Calculate MD5 and do a PutBlockList to commit the blob fullBlobMD5.TransformFinalBlock(new byte[0], 0, 0);byte[] blobHashBytes = fullBlobMD5.Hash;string blobHash = Convert.ToBase64String(blobHashBytes);
        blobRef.Properties.ContentMD5 = blobHash;
        blobRef.PutBlockList(blockList, options);
    }
 } /// <summary> /// Generates a unique Base64 encoded blockID  /// </summary> /// <param name="seqNo">The blocks sequence number in the given upload operation.</param> /// <returns></returns>  private static string GenerateBase64BlockID(long seqNo)
 {// 9 bytes needed since base64 encoding requires 6 bits per character (6*12 = 8*9) byte[] tempArray = new byte[9];for (int m = 0; m < 9; m++)
    {
        tempArray[8 - m] = (byte)((seqNo >> (8 * m)) & 0xFF);
    }Convert.ToBase64String(tempArray);return Convert.ToBase64String(tempArray);
 } 
}

Note: In order to prevent potential block collisions when uploading to a pre-existing blob, use a non-constant blockIdSequenceNumber. To generate a random starting ID you can use the following code.

Random rand = new Random();long blockIdSequenceNumber = (long)rand.Next() << 32;
blockIdSequenceNumber += rand.Next();

Instead of uploading a single blob in parallel, if your target scenario is uploading many blobs you may consider enforcing parallelism at the application layer. This can be achieved by performing a number of simultaneous uploads on N blobs while setting CloudBlobClient.ParallelOperationThreadCount = 1 (which will cause the Storage Client Library to not utilize the parallel upload feature). When uploading many blobs simultaneously, applications should be aware that the largest blob may take longer than the smaller blobs and start uploading the larger blob first. In addition, if the application is waiting on all blobs to be uploaded before continuing, then the last blob to complete may be the critical path and parallelizing its upload could reduce the overall latency.

Lastly, it is important to understand the implications of using the parallel single blob upload feature at the same time as parallelizing multiple blob uploads at the application layer. If your scenario initiates 30 simultaneous blob uploads using the parallel single blob upload feature, the default CloudBlobClient settings will cause the Storage Client Library to use potentially 240 simultaneous put block operations (8 x30) on a machine with 8 logical processors. In general it is recommended to use the number of logical processors to determine parallelism, in this case setting CloudBlobClient.ParallelOperationThreadCount = 1 should not adversely affect your overall throughput as the Storage Client Library will be performing 30 operations (in this case a put block) simultaneously. Additionally, an excessively large number of concurrent operations will have an adverse effect on overall system performance due to ThreadPool demands as well as frequent context switches. In general if your application is already providing parallelism you may consider avoiding the parallel upload feature altogether by setting CloudBlobClient.ParallelOperationThreadCount = 1.

This race condition in parallel single blob upload will be addressed in a future release of the SDK. Please feel free to leave comments or questions,

Joe Giardino

↧

Windows Azure Storage Client Library: Rewinding stream position less than BlobStream.ReadAheadSize can result in lost bytes from BlobStream.Read()

February 27, 2011, 10:34 am

≫ Next: Getting the Page Ranges of a Large Page Blob in Segments

≪ Previous: Windows Azure Storage Client Library: Parallel Single Blob Upload Race Condition Can Throw an Unhandled Exception

Update 3/09/011: The bug is fixed in the Windows Azure SDK March 2011 release.

In the current Windows Azure storage client library, BlobStream.Read() may read less than the requested number of bytes if the user rewinds the stream position. This occurs when using the seek operation to a position which is equal or less than BlobStream.ReadAheadSize byte(s) away from the previous start position. Furthermore, in this case, if BlobStream.Read() is called again to read the remaining bytes, data from an incorrect position will be read into the buffer.

What does ReadAheadSize property do?

BlobStream.ReadAheadSize is used to define how many extra bytes to prefetch in a get blob request when BlobStream.Read() is called. This design is suppose to ensure that the storage client library does not need to send another request to the blob service if BlobStream.Read() is called again to read N bytes from the current position, where N < BlobStream.ReadAheadSize. It is an optimization for reading blobs in the forward direction, which reduces the number of the get blob requests to the blob service when reading a Blob.

This bug impacts users only if their scenario involves rewinding the stream to read, i.e. using Seek operation to seek to a position BlobStream.ReadAheadSize bytes less than the previous byte offset.

The root cause of this issue is that the number of bytes to read is incorrectly calculated in the storage client library when the stream position is rewound by N bytes using Seek, where N <=BlobStream.ReadAheadSize bytes away from the previous read’s start offset. (Note, if the stream is rewound more than BlobStream.ReadAheadSize bytes away from the previous start offset, the stream reads work as expected.)

To understand this issue better, let us explain this using an example of user code that exhibits this bug.

We begin with getting a BlobStream that we can use to read the blob, which is 16MB in size. We set the ReadAheadSize to 16 bytes. We then seek to offset 100 and read 16 bytes of data. :

BlobStream stream = blob.OpenRead();
stream.ReadAheadSize = 16;int bufferSize = 16;int readCount;byte[] buffer1 = new byte[bufferSize];
stream.Seek(100, System.IO.SeekOrigin.Begin);
readCount = stream.Read(buffer1, 0, bufferSize);

BlobStream.Read() works as expected in which buffer1 is filled with 16 bytes of the blob data from offset 100. Because of ReadAheadSize set to 16, the Storage Client issues a read request for 32 bytes of data as seen in the request trace as seen in the “x-ms-range” header set to 100-131 in the request trace. The response as we see in the content-length, returns the 32 bytes:

Request and response trace:

Request header:
GET http://foo.blob.core.windows.net/test/blob?timeout=90 HTTP/1.1
x-ms-version: 2009-09-19
User-Agent: WA-Storage/0.0.0.0
x-ms-range: bytes=100-131
…

Response header:
HTTP/1.1 206 Partial Content
Content-Length: 32
Content-Range: bytes 100-131/16777216
Content-Type: application/octet-stream
…

We will now rewind the stream to 10 bytes away from the previous read’s start offset (previous start offset was at 100 and so the new offset is 90). It is worth noting that 10 is < ReadAheadSize which exhibits the problem (note, if we had set the seek to be > ReadAheadSize back from 100, then everything would work as expected). We then issue a Read for 16 bytes starting from offset 90.

byte[] buffer2 = new byte[bufferSize];
stream.Seek(90, System.IO.SeekOrigin.Begin);
readCount = stream.Read(buffer2, 0, bufferSize);

BlobStream.Read() does not work as expected here. It is called to read 16 bytes of the blob data from offset 90 into buffer2, but only 9 bytes of blob data is downloaded because the Storage Client has a bug in calculating the size it needs to read as seen in the trace below. We see that x-ms-range is set to 9 bytes (range = 90-98 rather than 90-105) and the content-length in the response set to 9.

Request and response trace:

Request header:
GET http://foo.blob.core.windows.net/test/blob?timeout=90 HTTP/1.1
x-ms-version: 2009-09-19
User-Agent: WA-Storage/0.0.0.0
x-ms-range: bytes=90-98
…

Response header:
HTTP/1.1 206 Partial Content
Content-Length: 9
Content-Range: bytes 90-98/16777216
Content-Type: application/octet-stream
…

Now, since the previous request for reading 16 bytes just returned 9 bytes, the client will issue another Read request to continue reading the remaining 7 bytes,

readCount = stream.Read(buffer2, readCount, bufferSize – readCount);

BlobStream.Read() still does not work as expected. It is called to read the remaining 7 bytes into buffer2 but the whole blob is downloaded as seen in the request and response trace below. As seen in the request, due to bug in Storage client, an incorrect range is sent to the service which then returns the entire blob data resulting in an incorrect data being read into the buffer. The request trace shows that the range is invalid: 99-98. The invalid range causes the Windows Azure Blob service to return the entire content as seen in the response trace. Since the client does not check to see the range and it was expecting the starting offset to be 99, it copies the 7 bytes from the beginning of the stream which is incorrect.

Request and response trace:

Request header:
GET http://foo.blob.core.windows.net/test/blob?timeout=90 HTTP/1.1x-ms-version: 2009-09-19
User-Agent: WA-Storage/0.0.0.0
x-ms-range: bytes=99-98
…

Response header:
HTTP/1.1 200 OK
Content-Length: 16777216
Content-Type: application/octet-stream
…

Mitigation

The workaround is to set the value of BlobStream.ReadAheadSize to 0 before BlobStream.Read() is called if a rewind operation is required:

BlobStream stream = blob.OpenRead();
stream.ReadAheadSize = 0;

As we explained above, the property BlobStream.ReadAheadSize is an optimization which can reduce the number of the requests to send when reading blobs in the forward direction, and setting it to 0 removes that benefit.

Summary

To summarize, the bug in the client library can result in data from an incorrect offset to be read. This happens only when the user code seeks to a position less than the previous offset where the distance is < ReadAheadSize. The bug will be fixed a future release of the Windows Azure SDK and we will post a link to the download here once it is released.

Justin Yu

↧

Getting the Page Ranges of a Large Page Blob in Segments

March 26, 2012, 8:38 am

≫ Next: Azure Files Preview Update

≪ Previous: Windows Azure Storage Client Library: Rewinding stream position less than BlobStream.ReadAheadSize can result in lost bytes from BlobStream.Read()

One of the blob types supported by Windows Azure Storage is the Page Blob. Page Blobs provide efficient storage of sparse data by physically storing only pages that have been written and not cleared. Each page is 512 bytes in size. The Get Page Ranges REST service call returns a list of all contiguous page ranges that contain valid data. In the Windows Azure Storage Client Library, the method GetPageRanges exposes this functionality.

Get Page Ranges may fail in certain circumstances where the service takes too long to process the request. Like all Blob REST APIs, Get Page Ranges takes a timeout parameter that specifies the time a request is allowed, including the reading/writing over the network. However, the server is allowed a fixed amount of time to process the request and begin sending the response. If this server timeout expires then the request fails, even if the time specified by the API timeout parameter has not elapsed.

In a highly fragmented page blob with a large number of writes, populating the list returned by Get Page Ranges may take longer than the server timeout and hence the request will fail. Therefore, it is recommended that if your application usage pattern has page blobs with a large number of writes and you want to call GetPageRanges, then your application should retrieve a subset of the page ranges at a time.

For example, suppose a 500 GB page blob was populated with 500,000 writes throughout the blob. By default the storage client specifies a timeout of 90 seconds for the Get Page Ranges operation. If Get Page Ranges does not complete within the server timeout interval then the call will fail. This can be solved by fetching the ranges in groups of, say, 50 GB. This splits the work into ten requests. Each of these requests would then individually complete within the server timeout interval, allowing all ranges to be retrieved successfully.

To be certain that the requests complete within the server timeout interval, fetch ranges in segments spanning 150 MB each. This is safe even for maximally fragmented page blobs. If a page blob is less fragmented then larger segments can be used.

Client Library Extension

We present below a simple extension method for the storage client that addresses this issue by providing a rangeSize parameter and splitting the requests into ranges of the given size. The resulting IEnumerable object lazily iterates through page ranges, making service calls as needed.

As a consequence of splitting the request into ranges, any page ranges that span across the rangeSize boundary are split into multiple page ranges in the result. Thus for a range size of 10 GB, the following range spanning 40 GB

[0 – 42949672959]

would be split into four ranges spanning 10 GB each:

[0 – 10737418239]
[10737418240 – 21474836479]
[21474836480 – 32212254719]
[32212254720 – 42949672959].

With a range size of 20 GB the above range would be split into just two ranges.

Note that a custom timeout may be used by specifying a BlobRequestOptions object as a parameter, but the method below does not use any retry policy. The specified timeout is applied to each of the service calls individually. If a service call fails for any reason then GetPageRanges throws an exception.

namespace Microsoft.WindowsAzure.StorageClient
{using System;using System.Collections.Generic;using System.Linq;using System.Net;using Microsoft.WindowsAzure.StorageClient.Protocol;/// <summary>
    /// Class containing an extension method for the <see cref="CloudPageBlob"/> class./// </summary>public static class CloudPageBlobExtensions{/// <summary>
        /// Enumerates the page ranges of a page blob, sending one service call as needed for each/// <paramref name="rangeSize"/> bytes./// </summary>
        /// <param name="pageBlob">The page blob to read.</param>
        /// <param name="rangeSize">The range, in bytes, that each service call will cover. This must be a multiple of///     512 bytes.</param>
        /// <param name="options">The request options, optionally specifying a timeout for the requests.</param>
        /// <returns>An <see cref="IEnumerable"/> object that enumerates the page ranges.</returns>public static IEnumerable<PageRange> GetPageRanges(this CloudPageBlob pageBlob,long rangeSize,BlobRequestOptions options)
        {int timeout;if (options == null || !options.Timeout.HasValue)
            {
                timeout = (int)pageBlob.ServiceClient.Timeout.TotalSeconds;
            }else{
                timeout = (int)options.Timeout.Value.TotalSeconds;
            }if ((rangeSize % 512) != 0)
            {throw new ArgumentOutOfRangeException("rangeSize", "The range size must be a multiple of 512 bytes.");
            }long startOffset = 0;long blobSize;do{// Generate a web request for getting page rangesHttpWebRequest webRequest = BlobRequest.GetPageRanges(
                    pageBlob.Uri,
                    timeout,
                    pageBlob.SnapshotTime,null /* lease ID */);// Specify a range of bytes to searchwebRequest.Headers["x-ms-range"] = string.Format("bytes={0}-{1}",
                    startOffset,
                    startOffset + rangeSize - 1);// Sign the requestpageBlob.ServiceClient.Credentials.SignRequest(webRequest);List<PageRange> pageRanges;using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
                {// Refresh the size of the blobblobSize = long.Parse(webResponse.Headers["x-ms-blob-content-length"]);
                    GetPageRangesResponse getPageRangesResponse = BlobResponse.GetPageRanges(webResponse);// Materialize response so we can close the webResponsepageRanges = getPageRangesResponse.PageRanges.ToList();
                }// Lazily return each page range in this result segment.foreach (PageRange range in pageRanges)
                {yield return range;
                }
                startOffset += rangeSize;
            }while (startOffset < blobSize);
        }
    }
}

Usage Examples:

pageBlob.GetPageRanges(10 * 1024 * 1024 * 1024 /* 10 GB */, null);
pageBlob.GetPageRanges(150 * 1024 * 1024 /* 150 MB */, options /* custom timeout in options */);

Summary

For some fragmented page blobs, the GetPageRanges API call might not complete within the maximum server timeout interval. To solve this, the page ranges can be incrementally fetched for a fraction of the page blob at a time, thus decreasing the time any single service call takes. We present an extension method implementing this technique in the Windows Azure Storage Client Library.

Michael Roberson

↧

Azure Files Preview Update

August 3, 2015, 9:23 am

≫ Next: AzCopy – Introducing Append Blob, File Storage Asynchronous Copying, File Storage Share SAS, Table Storage data exporting to CSV and more

≪ Previous: Getting the Page Ranges of a Large Page Blob in Segments

At Build 2015 we announced that technical support is now available for Azure Files customers with technical support subscriptions. We are pleased to announce several additional updates for the Azure Files service which have been made in response to customer feedback. Please check them out below:

New REST API Features

Server Side Copy File

Copy File allows you to copy a blob or file to a destination file within the Storage account or across different Storage accounts all on the server side. Before this update, performing a copy operation with the REST API or SMB required you to download the file or blob and re-upload it to its destination.

File SAS

You can now provide access to file shares and individual files by using SAS (shared access signatures) in REST API calls.

Share Size Quota

Another new feature for Azure Files is the ability to set the “share size quota” via the REST API. This means that you can now set limits on the size of file shares. When the sum of the sizes of the files on the share exceeds the quota set on the share, you will not be able to increase the size of the files in the share.

Get/Set Directory Metadata

The new Get/Set Directory Metadata operation allows you to get/set all user-defined metadata for a specified directory.

CORS Support

Cross-Origin Resource Sharing (CORS) has been supported in the Blob, Table, and Queue services since November 2013. We are pleased to announce that CORS will now be supported in Files.

Learn more about these new features by checking out the Azure Files REST API documentation.

Library and Tooling Updates

The client libraries that support these new features are .NET (desktop), Node.JS, Java, Android, ASP.NET 5, Windows Phone, and Windows Runtime. Azure Powershell and Azure CLI also support all of these features – except for get/set directory metadata. In addition, the newest version of AZCopy now uses the server side copy file feature.

If you’d like to learn more about using client libraries and tooling with Azure Files then a great way to get started would be to check out our tutorial for using Azure Files with Powershell and .NET.

As always, if you have any feature requests please let us know by submitting your ideas to Azure Storage Feedback.

Thanks!

Azure Storage Team

↧

AzCopy – Introducing Append Blob, File Storage Asynchronous Copying, File Storage Share SAS, Table Storage data exporting to CSV and more

August 18, 2015, 5:20 am

≫ Next: Issue in Azure Storage Client Library 5.0.0 and 5.0.1 preview in AppendBlob functionality

≪ Previous: Azure Files Preview Update

We are pleased to announce that AzCopy 3.2.0 and AzCopy 4.2.0-preview are now released! These two releases introduce the following new features:

Append Blob

Append Blob is a new Microsoft Azure Storage blob type which is optimized for fast append operations, making it ideal for scenarios where the data must be added to an existing blob without modifying the existing contents of that blob (E.g. logging, auditing). For more details, please go to Introducing Azure Storage Append Blob.

Both AzCopy 3.2.0 and 4.2.0-preview will include the support for Append Blob in the following scenarios:

Download Append Blob, same as downloading a block or page blob

AzCopy /Source:https://myaccount.blob.core.windows.net/mycontainer /Dest:C:\myfolder /SourceKey:key /Pattern:appendblob1.txt

Upload Append Blob, add option /BlobType:Append to specify the blob type

AzCopy /Source:C:\myfolder /Dest:https://myaccount.blob.core.windows.net/mycontainer /DestKey:key /Pattern:appendblob1.txt /BlobType:Append

Copy Append Blob, there is no need to specify the /BlobType

AzCopy /Source:https://myaccount.blob.core.windows.net/mycontainer1 /Dest:https://myaccount.blob.core.windows.net/mycontainer2 /SourceKey:key /DestKey:key /Pattern:appendblob1.txt

Note that when uploading or copying append blobs with names that already exist in the destination, AzCopy will prompt either “overwrite or skip” message. Trying to overwrite a blob with the same name but a mismatched blob type will fail. For example, AzCopy will report a failure when overwriting a Block Blob with an Append Blob.

AzCopy does not include the support for appending data to an existing append blob, and if you are using an older version AzCopy, the download and copy operations will fail with the following error message when the source container includes Append Blob.

Error parsing the source location “[the source URL specified in the command line]”: The remote server returned an error: (409) Conflict. The type of a blob in the container is unrecognized by this version.

File Storage Asynchronous Copy (4.2.0 only)

Azure Storage File Service adds several new features with Storage Service REST version 2015-2-21, please find more details at Azure Storage File Preview Update.

In the previous version of AzCopy 4.1.0, we introduced synchronous copy for Blob and File, now AzCopy 4.2.0-preview includes the support for the following File Storageasynchronous copy scenarios.

Unlike synchronous copy which simulate the copy by downloading the blobs from the source storage endpoint to local memory and then uploading them to the destination storage end point, the File Storage asynchronous copy is a server side copy which is running in the background and you can get the copy status programmatically, please find more details at Server Side Copy File.

Asynchronous copying from File Storage to File Storage

AzCopy /Source:https://myaccount1.file.core.windows.net/myfileshare1/ /Dest:https://myaccount2.file.core.windows.net/myfileshare2/ /SourceKey:key1 /DestKey:key2 /S

Asynchronous copying from File Storage to Block Blob

AzCopy /Source:https://myaccount1.file.core.windows.net/myfileshare/ /Dest:https://myaccount2.blob.core.windows.net/mycontainer/ /SourceKey:key1 /DestKey:key2 /S

Asynchronous copying from Block/Page Blob Storage to File Storage

AzCopy /Source:https://myaccount1.blob.core.windows.net/mycontainer/ /Dest:https://myaccount2.file.core.windows.net/myfileshare/ /SourceKey:key1 /DestKey:key2 /S

Note that asynchronous copying from File Storage to Page Blob is not supported.

File Storage Share SAS (Preview version 4.2.0 only)

Besides the File asynchronous copy, another File Storage new feature ‘File Share SAS’ will be supported in AzCopy 4.2.0-preview as well.

Now you can use option /SourceSAS and /DestSAS to authenticate the file transfer request.

AzCopy /Source:https://myaccount1.file.core.windows.net/myfileshare1/ /Dest:https://myaccount2.file.core.windows.net/myfileshare2/ /SourceSAS:SAS1 /DestSAS:SAS2 /S

For more details about File Storage share SAS, please visit Azure Storage File Preview Update.

Export Table Storage entities to CSV (Preview version 4.2.0 only)

AzCopy allows end users to export Table entities to local files in JSON format since the 4.0.0 preview version, now you can specify the new option /PayloadFormat:<JSON | CSV> to export data to CSV files. Without specifying this new option, AzCopy will export Table entities to JSON files.

AzCopy /Source:https://myaccount.table.core.windows.net/myTable/ /Dest:C:\myfolder\ /SourceKey:key /PayloadFormat:CSV

Besides the data files with .csv extension that will be found in the place specified by the parameter /Dest, AzCopy will generate scheme file with file extension .schema.csv for each data file.

Note that AzCopy does not include the support for “importing” CSV data file, you can use JSON format to export/import as you did in previous version of AzCopy.

Specify the manifest file name when exporting Table entities (Preview version 4.2.0 only)

AzCopy requires end users to specify the option /Manifest when importing table entities, in previous version the manifest file name is decided by AzCopy during the exporting which looks like “myaccount_mytable_timestamp.manifest”, and users need to find the name in the destination folder firstly before writing the import command line.

Now you can specify the manifest file name during the exporting by option /Manifest which should bring more flexibility and convenience to your importing scenarios.

AzCopy /Source:https://myaccount.table.core.windows.net/myTable/ /Dest:C:\myfolder\ /SourceKey:key /Manifest:abc.manifest

Enable FIPS compliant MD5 algorithm

AzCopy by default uses .NET MD5 implementation to calculate the MD5 when copying objects, now we include the support for FIPS compliant MD5 setting to fulfill some scenarios’ security requirements.

You can create an app.config file “AzCopy.exe.config” with property “AzureStorageUseV1MD5” and put it aside with AzCopy.exe.

<?xml version="1.0" encoding="utf-8" ?> 
<configuration> 
    <appSettings> 
        <add key="AzureStorageUseV1MD5" value="false"/> 
    </appSettings> 
</configuration>

For property “AzureStorageUseV1MD5”

true – The default value, AzCopy will use .NET MD5 implementation.
false – AzCopy will use FIPS compliant MD5 algorithm.

Note that FIPS compliant algorithms is disabled by default on your Windows machine, you can type secpol.msc in your Run window and check this switch at “Security Setting->Local Policy->Security Options->System cryptography: Use FIPS compliant algorithms for encryption, hashing and signing”.

Reference

Azure Storage File Preview Update

Microsoft Azure Storage Release –Append Blob, New Azure File Service Features and Client Side Encryption General Availability

Introducing Azure Storage Append Blob

Enable FISMA MD5 setting via Microsoft Azure Storage Client Library for .NET

Getting Started with the AzCopy Command-Line Utility

As always, we look forward to your feedback.

Microsoft Azure Storage Team

↧

Issue in Azure Storage Client Library 5.0.0 and 5.0.1 preview in AppendBlob functionality

September 1, 2015, 10:07 am

≫ Next: Introducing the Azure Storage Client Library for iOS (Public Preview)

≪ Previous: AzCopy – Introducing Append Blob, File Storage Asynchronous Copying, File Storage Share SAS, Table Storage data exporting to CSV and more

An issue in the Azure Storage Client Library 5.0.0 for .Net and in the Azure Storage Client Library 5.0.1 preview for .Net was recently discovered. This will impact the Windows desktop and phone targets. The details of the issue are as follows:

When the method to append a string of text to an append blob asynchronously, CloudAppendBlob.AppendTextAsync() is invoked with either only the content parameter specified or only the content and CancellationToken parameters specified, the call will overwrite the blob content instead of appending to it. Other synchronous and asynchronous invocations to append a string of text to an append blob (CloudAppendBlob.AppendText() , CloudAppendBlob.AppendTextAsync()) do not manifest the issue.

The Azure Storage team has hotfixes available for both releases for this issue. The hotfix will have updated versions 5.0.2 and 5.0.3-preview respectively. If you had installed either Azure Storage Client Library 5.0.0 for .Net or the Azure Storage Client Library 5.0.1 preview for .Net, please make sure to update your references with the corresponding package. You can install the these versions either from:

The Visual Studio NuGet Package Manager UI.
The Package Manager console using the following command (the released version for instance): Install-Package WindowsAzure.Storage -Version 5.0.2
The NuGet gallery web page that houses the package: here for the released version and here for the preview version.

Please note the following:

The older versions will be unlisted in the Visual Studio NuGet Package Manager UI.
If you attempt to launch the web page that contained the original package, you may encounter a 404 error.
We recommend you to not install the older versions through the Package Manager console so that you don’t run into the issue.

Thank you for your support to Azure Storage. We look forward to your continued feedback.

Microsoft Azure Storage Team

↧

Introducing the Azure Storage Client Library for iOS (Public Preview)

September 18, 2015, 5:44 am

≫ Next: (Cross-Post) Introducing Azure Storage Data Movement Library Preview

≪ Previous: Issue in Azure Storage Client Library 5.0.0 and 5.0.1 preview in AppendBlob functionality

We are excited to announce the public preview of the Azure Storage Client Library for iOS!

Having a client library for iOS is essential to providing a complete mobile story for developers. With this release, developers can now take advantage of Azure Storage on all major mobile platforms: Windows Phone, iOS, Android, and Xamarin.

Currently, this library supports iOS 9, iOS 8 and iOS 7 and can be used with both Objective-C and Swift. This library also supports the latest Azure Storage service version 2015-02-21.

With this being the first release, we want to make sure we’re taking advantage of the wealth of knowledge provided by the iOS developer community. For this reason, we’ll be releasing block blob support first with the goal being to solicit feedback plus better understand additional scenarios you would like to see supported.

Please check out How to use Blob Storage from iOS to get started. You can also download the sample app to quickly see the use of Azure Storage in an iOS application.

As always, if you have any feature requests please let us know by submitting your ideas to Azure Storage Feedback.

We’d also like to give a special thanks to all those who joined our preview program and contributed their ideas and suggestions.

Thanks!

Azure Storage Team

↧

Update 3/09/011: The bug is fixed in the Windows Azure SDK March 2011 release.

Summary

Workaround

Summary

Affected Scenarios

Example

Workarounds

Mitigations

Summary

References

Workarounds

Summary

Overview

Service Perspective

Client Perspective

Storing Application layer MD5 when Uploading Blobs via the Storage Client Library

Validating Application Layer MD5 when downloading Blobs via the Storage Client Library

A note about Page Blobs

Summary

Links

What does ReadAheadSize property do?

Mitigation

Summary

Client Library Extension

Summary

New REST API Features

Library and Tooling Updates