Google Cloud Storage is often used along with CloudML to manage and serve training data. This article provides details on:
Copying and synchronizing files between your local workstation and Google Cloud.
Reading data from Google Cloud Storage buckets from within a training script.
Varying data source configuration between local script development and CloudML training.
Google Cloud Storage is organized around storage units named “buckets”, which are roughly analogous to filesystem directories. You can copy data between your local system and cloud storage using the
gs_copy() function. For example:
library(cloudml) # copy from a local directory to a bucket gs_copy("training-data", "gs://quarter-deck-529/training-data") # copy from a bucket to a local directory gs_copy("gs://quarter-deck-529/training-data", "training-data")
You can also use the
gs_rsync() function to syncrhonize a local directory and a bucket in Google Storage (this is much more efficient than copying the data each time):
# synchronize a bucket and a local directory gs_rsync("gs://quarter-deck-529/training-data", "training-data")
Note that to use these functions you need to import the cloudml package with
library(cloudml) as illustrated above.
There are two distinct ways to read data from Google Storage. Which you use will depend on whether the TensorFlow API you are using supports direct references to
gs:// bucket URLs.
If you are using the TensorFlow Datasets API, then you can use
gs:// bucket URLs directly. In this case you'll want to use the
gs:// URL when running on CloudML, and a synchonized copy of the bucket when running locally. You can use the
gs_data_dir() function to accomplish this. For example:
library(tfdatasets) library(cloudml) data_dir <- gs_data_dir("gs://mtcars-data") mtcars_csv <- file.path(data_dir, "mtcars.csv") mtcars_dataset <- csv_dataset(mtcars_csv) %>% dataset_prepare(x = c(mpg, disp), y = cyl)
While some TensorFlow APIs can take
gs:// URLs directly, in many cases a local filesystem path will be required. If you want to store data in Google Storage but still use it with APIs that require local paths you can use the
gs_data_dir_local() function to provide the local path.
For example, this code reads CSV files from Google Storage:
library(cloudml) library(readr) data_dir <- gs_data_dir_local("gs://quarter-deck-529/training-data") train_data <- read_csv(file.path(data_dir, "train.csv")) test_data <- read_csv(file.path(data_dir, "test.csv"))
Under the hood this function will rsync data from Google Storage as required to provide the local filesystem interface to it.
Here's another example which creates a Keras image data generator from a bucket:
train_generator <- flow_images_from_directory( gs_data_dir_local("gs://quarter-deck-529/images/train"), image_data_generator(rescale = 1/255), target_size = c(150, 150), batch_size = 32, class_mode = "binary" )
Note that if the path passed to
gs_data_dir_local() is from the local filesystem it will be returned unmodified.
It's often useful to do training script development with a local subsample of data that you've extracted from the complete set of training data. In this configuration, you'll want your training script to dynamically use the local subsample during development then use the complete dataset stored in Google Cloud Storage when running on CloudML. You can accomplish this with a combination of training flags and the
gs_local_dir() function described above.
Here's a complete example. We start with a training script that declares a flag for the location of the training data:
library(keras) library(cloudml) # define a flag for the location of the data directory FLAGS <- flags( flag_string("data_dir", "data") ) # determine the location of the directory (during local development this will # be the default "data" subdirectory specified in the FLAGS declaration above) data_dir <- gs_data_dir_local(FLAGS$data_dir) # read the data train_data <- read_csv(file.path(FLAGS$data_dir, "train.csv"))
Note that the
data_dir R variable is computed by passing
FLAGS$data_dir to the
gs_data_dir_local() function. This enables it to take on a dynamic value depending upon the training environment.
The way to vary this value when running on CloudML is by adding a
flags.yml configuration file to your project directory. For example:
cloudml: data_dir: "gs://quarter-deck-529/training-data"
With the addition of this config file, your script will resolve the
data_dir flag to specified the Google Storage bucket, but only when it is running on CloudML.
You can view and manage data within Google Cloud Storage buckets using either a web based user-interface or via command line utilities included with the Google Cloud SDK.
To access the web-bqsed UI, navigate to https://console.cloud.google.com/storage/browser.
Here's what the storage browser looks like for a sample project:
The Google Cloud SDK includes the
gsutil utility program for managing cloud storage buckets. Documentation for
gsutil can be found here: https://cloud.google.com/storage/docs/gsutil.
gsutil from within a terminal. If you are running within RStudio v1.1 or higher you can activate a terminal with the
Here is an example of using the
gsutil ls command to list the contents of a bucket within a terminal: