Storage¶

Multiple Storage Options, which to use is determine by your requirements, such as:

Data Model
- Do you need a SQL pipeline?
- Do you use NoSQL
- Just have objects?
Time To Access?
- Nanoseconds
- Microseconds
- Milliseconds

Options include:

Cache
Persistent disks for VMs
Object storage
Memorystore

Redis cache service for caching
Archival storage

Structured data:

Non Structured data:

Cloud Datastore
Cloud Firestore
Cloud Bigtable

Cache¶

While cache is an in-memory store with high-speed access it can’t be considered storage as such, because it does not persist if the machine is shut down.

It is possible to save cache contents to a persistent storage solution at intervals, but the recovery point would not be the same as the shutdown point.

Memorystore¶

This managed Redis service provides a larger cache which may be configured for high-availability. Memorystore integrates with:

Compute Engine
App Engine
Kubernetes Engine

As with most GCP services, choose region and zone. A basic instance is cheapest, but does not support replicas. The Redis instance will be available on your default network and its IP range may be defined or labels added.

Persistent Disks¶

Persistent storage can be associated with your VMs in Compute Engine and Kubernetes Engine. Capacity of up to 64TB. Data is encrypted at rest. Zonal storage or regional storage is available. If zonal is chosen, the data is stored across multiple physical drives. If regional storage is selected, then the date is replicated across zones providing redundancy.

As block storage, such disks can support file systems. The drive is virtual and accessible via your VM. Local SSD (solid state, i.e. high-performance, low latency) drives are an option, but these do not persist through shutdown of the VM.

SSDs are ideal for high-performance (input/output IOPS-intensive), whether with random access or sequential access patterns. They are more expensive than the longer latency, spinning hard disk drive (HDD).

HDDs can be a better option for storing large amounts of data and undertaking batch processing.

Compare:

NB Read/Write rates are measured per second per GB:

Drive	Read	Write	Notes
SSD	local ~300	local ~300	network attached factor 10 smaller
HDD	0.75	1.5

Snapshots of disks can be created as data backups. Once a snapshot disk has been made into an image and mounted to a new machine, it behaves like a normal disk, i.e. supports read and write access.

Object Storage¶

Cloud Bucket provides simple object storage for exabyte-volumes of data, or data that needs to be shared widely. No data structures is required, each item is – an “object”. Buckets share a global namespace and, therefore, bucket name must be unique. Using project ID as part of the name is a simple method to find a unique key.

Buckets are regional resources and are replicated across zones in a region.

Using Buckets¶

How buckets behave depends on their metadata tags. For example if you are going to make a PDF document available to all users and wish the functionality from the web to be more than just a download option, then use:

This will enable a pdf viewer.

To ensure that all web traffic has access to the pdf in the bucket:

The gsutil command makes buckets, like so:

gsutil mb -l [location]

For example:

gsutil mb -l US [projectID-name]

Cloud Storage Fuse¶

There is no file system or folder system with buckets, each object sits at the same level. To create a file structure, Cloud Storage Fuse can be used on Linux O.S. to mount a bucket. This allows file structure to be created from the point of view of the VM/s that possess these mount instructions.

Bucket URL¶

Each object in cloud storage has its own URL. The objects are held in buckets. These object are immutable (unchangeable). If you want to edit an image object, for example, you can overwrite the image held in that bucket, with no impact on the URL.

GCP cloud storage is like an API that gives you POST, GET, and DELETE but no PUT or PATCH. No fragment of a bucket may be updated, it is all or nothing.

Security

So, how are my objects kept safe? When in transit, objects are transferred over HTTPS. When in storage, it is the IAM that controls who/what can get their hands on the object.

Access

An Access Control List (ACL) gives fine-grained control over who/what can access the object.

ACLs have:

Scope = Who/What has access
Permissions = What can be done

Auditing Access logs

Data access logging is not enabled by default and needs to be enabled when setting up the bucket. Data Access audit logs do not record all the data-access operations on resources: - not those that are publicly shared (available to All Users or All Authenticated Users) - not those that can be accessed without logging into Google Cloud

Logs can be tracked with Stackdriver and accessed through reports and filters.

Version Control

A history of modifications can be kept if you turn on object versioning of your bucket/s.

If you don’t turn on versioning then a new file will always overwrite old with no recourse.

Bucket Lifecycle Policies¶

A set of rules can be applied to buckets. For example, once a bucket reaches a specified age it can be moved to Nearline or Coldline storage.

Multiregional & regional objects can > Nearline or Coldline
Nearline can > Coldline
Nearline can evolve to Coldline, not back, Coldline can’t be reverted.

You can assign a lifecycle-management configuration to a bucket. When an object meets the criteria of the rules, Cloud Storage automatically performs the specified action on the object. One of the supported actions is to Delete objects.

enables you to set the lifecycle configuration on a bucket based on a JSON configuration file. The config-json-file specified on the command line should be a path to a local file containing this document.

Storage Classes¶

As with the rest of GCP, location matters. When you create a bucket you set a region that will minimize latency for your typical user/access point. OR for global access, chose multiregional. Multi-regional and regional storage are for buckets that are accessed frequently. The cost drops as you enter the long-term storage options.

GCP Storage Options¶

The various Cloud Storage options:

Multi-regional

Regional

Nearline

30-day minimum storage for data that is accessed less than one per month

Coldline

90-day minimum storage for data accessed less than annually

Storage class can be changed on the fly to some extent.

Multi-regional is intended for use with data accessed frequently, with regional being the same – with the expectation that this occurs from a particular region.

Charges are applied per GB of data stored per month, varying according to the type. Accessing of data is also charged with nearline and coldline.

Have a Go¶

Setup a bucket from the console

GCP> Storage> Browser

Click Create bucket.
Provide a globally unique bucket name (think project ID + name)
Click Create.

Setup a bucket from cloud shell

gsutil mb gs://<BUCKET_NAME>

Upload a file via Cloud Shell

Click the three dots icon in the Cloud Shell toolbar to display further options.
Click Upload file.
In Cloud Shell’s CLI, type ls to confirm that the file was uploaded.
Copy the file into a pre-existing bucket

gsutil cp [MY_FILE] gs://[BUCKET_NAME]

NB If your filename has whitespaces, place single quotes around the filename. For example, gsutil cp ‘uploaded file.txt’ gs://[BUCKET_NAME]

Copy via CLI

gsutil cp {location-on-your-machine} gs://{bucket-name}/

Or grab a bucket’s contents with:

It is just as simple to copy data from one bucket to another, e.g.:

gsutil cp gs://$MY_BUCKET_NAME_1/image.jpg \
gs://$MY_BUCKET_NAME_2/image.jpg

Note that we did not need to specify ‘image.jpg’, as the filename was left unchanged. You could specify a new filename at this step, or remove the property from the command to leave filename as is.

Or to move that data:

Move can be use in the abstract to rename a bucket:

If you need to verify who has access to a file (acess control list, or acl):

gsutil acl get gs://$MY_BUCKET_NAME_1/image.jpg  > acl.txt

cat acl.txt

To change who has access use:

gsutil acl set private
gs://$MY_BUCKET_NAME_1/image.jpg

A publicly-hosted piece of content would require my more open access, e.g.:

gsutil iam ch allUsers:objectViewer gs://{MY_BUCKET_NAME_1}

From the Storage options in the GCP, you will be able to pickup the publically-available URL for this item.

Moving Data¶

The gsutil command is all well and good if you have small requirements that can be handled by your bandwidth via the Chrome browser. If you want to schedule batch transfers there is an HTTPS endpoint service that can connect to an upload facility. Up to a petabyte of data may be transferred this way. Or, you can post your data on a drive (!).

It gets fancy, BigQuery and App Engine can both submit data to cloud storage.