Term
|
Definition
- SQL Like wrapper for querying in Hadoop - Very useful for BI/OLAP |
|
|
Term
|
Definition
Database for Fast sequential scans of static data. |
|
|
Term
|
Definition
Data manipulation language for scripting to transform unstructured data to structured |
|
|
Term
|
Definition
Distributed computing with Hadoop. Uses RDBs. |
|
|
Term
|
Definition
Hadoop Orchestration. Not really very popular. Dataproc does somewhat the same |
|
|
Term
|
Definition
Stream processing of unbounded data sets |
|
|
Term
Google replacement for Hive |
|
Definition
|
|
Term
What are the GCP Compute options? |
|
Definition
- AppEngine: PaaS, serverless, ops-free - Container Engine - cluseters of machines running Kubernetes - Compute Engine, IaaS, fully controllable down to OS |
|
|
Term
What option should I choose for a simple static web site |
|
Definition
|
|
Term
What option should I choose for a web site needing SSL, HTTPS, CDN |
|
Definition
Firebase Hosting with Cloud Storage |
|
|
Term
What option should I choose for a web site requiring load balancing and autoscaling with fine grain control |
|
Definition
GCP - Google Compute Engine |
|
|
Term
What kinds of services are available when choosing GCE for web hosting? |
|
Definition
- Cloud Launcher for web app deployment
- choose machine sizes and disk sizes
- storage options: cloud buckets, persistent disk, local SSD
- storage technologies: Cloud SQL (mySQL, PostgreSQL; NoSQL)
- Load Balancing at any stack level by GCP or 3rd party products
- DevOps tools |
|
|
Term
How long do you have to react to a preemptive GCE shutdown notice |
|
Definition
|
|
Term
what is the longest period a preemptive GCE can be used? |
|
Definition
|
|
Term
What are the storage options for a GCE |
|
Definition
- small root persistent disk with OS is included - Additional options: - Persistent disk: standard; SSD - Local SSD - Cloud storage buckets |
|
|
Term
What is available wtih GCE for logging and monitoring? |
|
Definition
|
|
Term
How can you keep data available on a container restart? |
|
Definition
|
|
Term
What two environment choices are available for AppEngine |
|
Definition
- Standard: Java7, Python 2.7, Go, PHP - Flexible: Java8, Python 3.x, .NET, other choices |
|
|
Term
What are the three levels of abstraction for choosing a platform for running your applicadtion? |
|
Definition
- Compute Engine - Container Engine - App Engine |
|
|
Term
How many VM instances does AppEngine use? |
|
Definition
|
|
Term
Which of the following is a PAAS option for hosting web apps?
- Compute Engine VM
- Container Engine instance
- Cloud storage with Firebase hosting
- App Engine standard or flexible environment
|
|
Definition
App Engine standard or flexible environment |
|
|
Term
Which of the following is a IAAS option for hosting web apps on GCP?
- App Engine standard environment
- Container Engine instance
- Compute Engine instance
- Cloud storage with Firebase hosting
|
|
Definition
|
|
Term
Rank the following storage options from most expensive to cheapest (per GB)
- Cloud storage > SSD persistent disks > Local SSD > standard persistent
- Local SSD > SSD persistent disks > standard persistent > Cloud storage
- Local SSD > standard persistent disks > standard SSD > Cloud storage
- SSD (any type) > Cloud Storage > standard persistent
|
|
Definition
Local SSD > SSD persistent disks > standard persistent > Cloud storage |
|
|
Term
Rank the following options in scope of access
- Cloud storage - global, persistent (SSD and standard ) - zonal, local SSD - instance
- Cloud storage - regional, persistent (SSD and standard ) - regional, local SSD - instance
- All storage options offer global access (but billing rates vary)
- Cloud storage - global, persistent (SSD and standard ) - regional, local SSD - zonal
|
|
Definition
Cloud storage - global, persistent (SSD and standard ) - zonal, local SSD - instance |
|
|
Term
How do storage options differ with container engine instances relative to compute engine instances?
- Cloud storage is global access for container engine instances; regional for compute engine VMs
- No difference in storage options
- BigQuery and BigTable can be used from containers but not from raw compute engine
- VMs Container disks are ephemeral by default; need to use a specific abstraction to make them persistent
|
|
Definition
Container disks are ephemeral by default; need to use a specific abstraction to make them persistent |
|
|
Term
Use cases and Hadoop storage technologies |
|
Definition
|
|
Term
Use cases and GCP storage technologies |
|
Definition
|
|
Term
Name the storage options for Compute, Block Storage |
|
Definition
- Persistent Disks
- Local SSD
|
|
|
Term
Name Storage for media, blob storage |
|
Definition
|
|
Term
Name the bucket storage classes for Cloud Storage |
|
Definition
- Multi-regional - for frequent access globally
- Regional - for frequent access regionally
- Nearline - access once a month max
- Coldline - access once a year
|
|
|
Term
To store media and blob storage, what are the storage technologies in Hadoop and GCP? What is the advantage of the GCP option |
|
Definition
- Hadoop - HDFS
- GCP - Cloud storage
HDFS requires a name node. GCP Cloud Storage does not. |
|
|
Term
What are Hadoop and GCP storage technologies for SQL interface atop file data? What is the advantage of the GCP option? |
|
Definition
- Hadoop - Hive
- GCP - BigQuery
BigQuery is far faster than Hive. It uses columner storage. Hive is on top fo HDFS. |
|
|
Term
Can BigQuery be used for OLTP requiring ACID? |
|
Definition
No. ACID transactions not supported bu BigQuery |
|
|
Term
What GCP storage technologies are available for OLTP transaction prorocessing? |
|
Definition
|
|
Term
What GCP technology is available for OLAP? |
|
Definition
|
|
Term
What GCP relational databases technologies are available? |
|
Definition
|
|
Term
What GCP offering supports open source RDBMSs, and which are supported? Which GCP offering is not open source, and what are the differences |
|
Definition
- Cloud SQL supports these
- Cloud Spanner is proprietary
- Cloud Spanner supports auto horizontal scaling.
|
|
|
Term
What GCP offering is available for document data storage? What are it's properties |
|
Definition
DataStore
- Multi-key value store
- Very fast hash based indexes
- Very fast lookups of non-sequential keys
- Same time for queries regardless of data size; query performence depends on result size
- best for low write/high read-intensive needs
- Offers transaction support
|
|
|
Term
What are the mobile specific GCP offerings? |
|
Definition
- Cloud Storage for Firebase: compute, block storage with mobile SDK access
- Firebase Realtime DB: fast, random access with mobile SDK access.
|
|
|
Term
What is the command line tool for managing Google Cloud Storage? |
|
Definition
|
|
Term
What feature is provided that can be used to set up automatic deletion of a objects in cloud storage in a given time period. |
|
Definition
|
|
Term
When is the GCP Transfer service preferred over gsutil for loading data? |
|
Definition
- When transferring from another cloud provider
- When copying files from on-premise for the first time.
|
|
|
Term
What is the GCP equivalent of HDFS |
|
Definition
DataProc. Dataproc uses Cloud Storage instead of HDFS. |
|
|
Term
|
Definition
- Hotspotting
- Interleaving
- Splits
- Primary indexes required; Secondary Indices available (Not a feature in HBase)
- Index directives (force a query to use an index)
- STORING clause: force a column in an index
- Non-normal data types: arrays; arrays of arrays; Structs in queries only.
- Stronger than ACID: guarantees order of commitment
- Transaction modes: locking read-write; read-only; single read call -- doesn't use locking.
- Staleness timestamp bounds: Latest; bounded; exact
|
|
|
Term
What are advantages of BigTable over HBase? |
|
Definition
- Scalability
- Low admin burden
- Cluster resize without downtime
- Many more column families before performance drops
|
|
|
Term
Are HIVE and BigTable NoSQL databases? Can SQL be used to query these? |
|
Definition
|
|
Term
Are these supported in BigTable
- Multiple table operations?
- Indexes
- Constraints
- Grouping
- Joins
- Aggregates
- CRUD operations
|
|
Definition
No. for all exception basic CRUD operations are supported. |
|
|
Term
What are the 4 dimensions of data in HIVE/Big Table? |
|
Definition
- Row ID
- Column family
- Column
- Timestamp
|
|
|
Term
In what order is the Big Table Row Key stored? |
|
Definition
|
|
Term
What are two approaches to avoid hotspotting in BigTable |
|
Definition
- Field Promotion: Reverse URL order
- Salting: Hash the key value
|
|
|
Term
How does BigTable improve performance overtime? |
|
Definition
It observes read and write patterns and redistributes data among shards. |
|
|
Term
What are things to look for when BigTable is not performing? |
|
Definition
- Poor schema design
- workload too small (<300 gb)
- used in short bursts
- cluster too small
- cluster just started
- using HDD instead of SSD
|
|
|
Term
What are two fast operations in BigTable? |
|
Definition
- Lookup by row key
- sequential scans
|
|
|
Term
What is the GCP document database product? |
|
Definition
|
|
Term
Which of these affects response time in datastore?
- Database size
- Resultset size
|
|
Definition
|
|
Term
What is the index called that is used in a DataStore query, and how is it chosen? |
|
Definition
It is called the Perfect index, and the index is chosen in this order:
- Choose equality indexes
- If no equality condition, choose an inequality index (only one inequality condition is allowed in a search)
- Choose an index and satisfies a sort condition.
|
|
|
Term
What are some restrictions and limitations of using DataStore? |
|
Definition
- Restrictions
- No Joins
- Only one inequality condition allowed
- Cannot filter based on subquery results
- Limitations
- Updates are slow
- limited ACID support
|
|
|
Term
What consistency options are avaialble for DataStore? |
|
Definition
- Strong consistency
- Eventual consistency
|
|
|
Term
When can a schema be created in BigQuery? |
|
Definition
- At creation time
- During initial load.
|
|
|
Term
What are two ways of loading data to Big Query? |
|
Definition
- Batch loads
- Streaming loads
|
|
|
Term
What data formats are supported for data loading? |
|
Definition
- CSV
- JSON
- Avro
- Cloud DataStore backups
|
|
|
Term
What feature is available above and beyond HIVE to enhance schema-on-read? |
|
Definition
|
|
Term
What are the four ways of querying BigQuery? Give important details. |
|
Definition
- Interactive queries
- Batch queries - which BigQuery will run when resources allow within 24 hours.
- Views
- Authorized view for security
- Row level permissions available
- Can't export data from a view
- can't mix standard and legacy SQL
- No functions or wildcards may be used
- Limit of 1000 views
- Partitioned tables
- Automatically created based on load datetime
- Automatic discarding
|
|
|
Term
How would you load data from Cloud Storage into BigQuery tables? |
|
Definition
From the command line, use the bq load command. |
|
|
Term
How can multiple tables be queried using wildcards in BigQuery using Standard SQL |
|
Definition
Enclose the table name in backticks (`) |
|
|
Term
What are called the inputs and outputs of a transformation in Apache Beam / DataFlow? |
|
Definition
|
|
Term
What are the components of Apache Beam? |
|
Definition
- Directed-acyclic graph: DAG
- Pipeline: A single Data Flow job
- PCollection: The data sets that are inputs and outputs form tranforms
- Transform: transforms the data
- Source, Sink - source and destinations
- Driver: Defines the computation DAG (pipeline)
- Runner: executes the DAG on the backend
- Backends suppoted:
- Apache Spark
- Apache Flink
- Google Cloud Dataflow
- Beam Model
|
|
|
Term
What are the elements in Pub / Sub? |
|
Definition
- Publisher
- Subscriber
- Messages
- Queues: one per subscription
- Acknowledgement
- Planes
- Data plane: moves messages between publishers and subscribers. Servers here are forwarders.
- Control plane: handles assignment of publishes and subscribers to servers on the data plane. Servers here are routers.
|
|
|
Term
What are the two types of subscriptions in Pub Sub? How do those subscriptions connect? |
|
Definition
- Push subscriptions use a WebHook endpoint
- Pull subscriptions use an HTTPS request to an endpoint
|
|
|
Term
What interface is used to publish messages to pub sub? |
|
Definition
Https request to googleapis.com |
|
|
Term
In what order are messages in Pub Sub delivered to a subscriber |
|
Definition
Random order -- no order guaranteed |
|
|
Term
Define the pub sub:
- Sliding window
- Sliding interval
|
|
Definition
- Sliding window: the window of time from which all data is gathered for processing.
- Sliding interval: The amount of time a window will shift for processing the next sliding window
|
|
|
Term
What GCP product is used for Notebooks? |
|
Definition
|
|
Term
What does the python kernel do for Data Lab notebooks |
|
Definition
It manages the notebook session and variables |
|
|
Term
What is a Representation ML-based system? What is the names commonly used for such systems? |
|
Definition
They figure out by themselves what features to pay attention to. These are commonly called "Deep Learning" systems, which generally refer to Neural Networks. |
|
|
Term
Describe a neural network model |
|
Definition
A neural network model is made up of layers that feed one another. Each layer is composed of neurons. The outer layers -- the input and outer layers - are called the "visible layers". Other layers are the "hidden" layers. It is the depth of the layers that give Deep Learning its name. |
|
|
Term
|
Definition
- Rank - The number of dimensions
- 0 - Scalar
- 1 - Vector
- 2 - Matrix
- >2 - n-dimensional
- Shape - The number of elements in each dimension
- Data Type
|
|
|
Term
What are, in Tensorflow
- Constants
- Placeholders
- Variables
|
|
Definition
- Constants: Immutable values which do no change
- Placeholders: Assigned once and do not change after
- Variables: are constantly recomputed
|
|
|
Term
What is a Tensorflow fee dictionary? |
|
Definition
A way to specify the graphs input values. |
|
|
Term
In Tensorflow, what are:
- Coordinators
- QueueRunners
|
|
Definition
- Coordinators - a class to manage and work with multiple threads.
- QueueRunner - allows you to work with multiple elements from a queue in parallel using multiple threads.
|
|
|
Term
What does tf.stack do in TensorFlow? |
|
Definition
Converts multiple tensors to one tensor by adding a dimension, which becomes the index. |
|
|
Term
Name some distance measures for measuring ML model accuracy |
|
Definition
- Euclidean: sqrt(sum((xj-xij)^2)), as the crow flies
- L1/Manhattan/Snake/City block: absolute values of number of horizontal and vertical steps summed
|
|
|
Term
What is K-nearest neighbor |
|
Definition
K-nearest neighbor is a machine learning algorithm for classification and regression. It uses distance measures to find the closest matching class -- the most similar -- for a given input |
|
|
Term
|
Definition
One-hot notation is a way of encoding values by using a vector where all but one of the values is zero, and the value of the vector is the one that is not zero. |
|
|
Term
What are the two functions applied in each neuron? |
|
Definition
- A linear (affine) transformation
- Applies Wx + b to the intput
- W: Weight
- b: bias
- These are variables finalized by the training process
- An activation function
|
|
|
Term
|
Definition
- Rectified Linear Unit
- A very commonly used activiation function in a neural network neuron.
- Returns the max of the result from the Affine Transformation, or zero.
|
|
|
Term
In machine learning regression, what are all of the x variables together called? |
|
Definition
|
|
Term
Describe the process by which linear regeression ML model converges to the final variable values |
|
Definition
The process is assigned a gradient descent method which determines the learning rate by which the model travels through epochs until completion. |
|
|
Term
What are three gradient descent optimizer choices, and how do they differ in batch size? |
|
Definition
- Stochastic: Uses only one training sample for each epoch
- Mini-batch: Uses a subset of the whole training data set
- Batch: Uses the whole training set for each epoch
|
|
|
Term
What three things must you decide and provide to the training model? |
|
Definition
- Batch size
- Number of steps
- optimizer function
|
|
|
Term
Describe logistic regression |
|
Definition
- Provides probability of a y value given an x value
- produces an X curve
- X variable can be continuous, but y variables can only be categorical, .e.g, binary
- p(yi) = 1/(1+e^-(A+Bxi))
- Often used for linear classification
|
|
|
Term
What does the TensorFlow Logit function do? |
|
Definition
Transoforms a logistic regression S Curve to a linear regressions straight line. |
|
|
Term
What Affinity function is used with
- Linear regresssion
- Logistic regression
|
|
Definition
- Linear regresssion: identity (does nothing)
- Logistic regression: Softmax (transforms Affine transformation into probabilities
|
|
|
Term
What is the activation function used in logistic regression? |
|
Definition
|
|
Term
What cost function is used for logistic regression |
|
Definition
|
|