Term
What cluster managers does Spark support? |
|
Definition
|
|
Term
What storage systems does Spark support? |
|
Definition
Any supported by the Hadoop APIs. |
|
|
Term
Local mode is also known as |
|
Definition
|
|
Term
How do you change the logging level of Spark? |
|
Definition
via the conf/log4j.properties file |
|
|
Term
How do you execute a standalone spark application? |
|
Definition
|
|
Term
How do you set the Spark application's name? |
|
Definition
Through:
- SparkConf::setAppName("...") - spark-submit --name "..." |
|
|
Term
Can you run a Spark application without setting any SparkConf properties? |
|
Definition
|
|
Term
What pseudocode creates an accumulator with initial value of 1 and adds 2 to it? |
|
Definition
sc.accumulator(1) .add(2) |
|
|
Term
What method is used to obtain an accumulator's value? |
|
Definition
|
|
Term
What guarantees does Spark provide with respect to applying updates to accumulators? |
|
Definition
Updates to accumulators in actions are only applied once. There is no guarantee for updates in transformations. |
|
|
Term
|
Definition
|
|
Term
|
Definition
|
|
Term
If your application has high network traffic, what might you adjust? |
|
Definition
|
|
Term
When can partitioning provide performance benefits? |
|
Definition
When a cached dataset would be shuffled by key (i.e. reused) multiple times in key-oriented applications - such as joins. |
|
|
Term
On what is partitioning available? |
|
Definition
|
|
Term
How can partitioning provide performance benefits? |
|
Definition
By ensuring a static dataset is only hashed once, assuming that dataset is cached and reused. |
|
|
Term
What is the number of partitions an upper bound for? |
|
Definition
The number of tasks / degree of parallelism |
|
|
Term
What is a good guideline for the number of partitions you should have for an RDD? |
|
Definition
At least as large as the number of cores in your cluster. |
|
|
Term
If an operation modifies a single RDD and that RDD is partitioned and cached, what data is transferred over the network? |
|
Definition
Only the output of the operation. The input is operated upon locally. |
|
|
Term
What partitioner will be selected for an operation with two RDDs that sets a partitioner? |
|
Definition
It depends on whether those RDDs have partitioners set.
1. If not, a Hash Partitioner is used with the level of parallelism defined by the operation
2. If only one parent RDD has a partitioner, that partitioner is used
3. If both parent RDDs have a partitioner, the partitioner of the first parent is used. |
|
|
Term
In local mode, how many processes does a Spark application have? |
|
Definition
One - the driver and a single executor run in the same process. |
|
|
Term
In distributed mode, how many processes does a Spark application have? |
|
Definition
One for the driver, and one for each executor. |
|
|
Term
On encountering an action, what does Spark's scheduler do? |
|
Definition
Create a physical execution plan working backward from the final RDD being computed. |
|
|
Term
|
Definition
One or more transformations / action that are divided into N tasks, where N is the number of partitions. |
|
|
Term
|
Definition
When multiple transformations/action are combined into a single stage |
|
|
Term
In the simplest case, how many stages will a Spark application have? |
|
Definition
One for each transformation and action |
|
|
Term
When is pipelining performed? |
|
Definition
When an RDD can be computed from its parents without any data movement |
|
|
Term
When is data shuffling avoided? |
|
Definition
When the necessary shuffle output is in a persisted RDD, or is still written to disk. |
|
|
Term
In what order are Spark stages executed? |
|
Definition
In the order defined by RDD lineage |
|
|
Term
How does spark handle loss of a persisted RDD? |
|
Definition
It determines what is necessary to calculate that RDD through the lineage graph, and then recalculates it. |
|
|
Term
Why shouldn't we use collect() on large datasets? |
|
Definition
Because the entire dataset will have to fit in the Driver program's memory, which may not be feasible. |
|
|
Term
What can you do to mitigate slowdowns due to lost persisted RDDs? |
|
Definition
Replicate persisted data to multiple nodes |
|
|
Term
What persistence options are available? |
|
Definition
MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_SER DISK_ONLY OFF_HEAP |
|
|
Term
What happens if you try to persist more data than you have memory for? |
|
Definition
Spark will evict partitions based on a LRU policy. If a MEMORY_AND_DISK policy is used, the contents will spill over to disk, otherwise they will just be recomputed the next time they are needed. |
|
|
Term
How to you remove data from the cache? |
|
Definition
Call the .unpersist() method on the RDD |
|
|
Term
What can hinder performance when using broadcast variables? |
|
Definition
|
|
Term
Where can external program scripts be loaded from? |
|
Definition
Local file system, any Hadoop supported file system, HTTP, HTTPS, or FTP |
|
|
Term
How can we set environment variables for external scripts? |
|
Definition
Pass them in as a map to the second argument of the pipe() command. |
|
|
Term
When are object files commonly used? |
|
Definition
To save Spark job data to be used by other code/jobs |
|
|
Term
What is a danger of using object files? |
|
Definition
It requires programmer effort to maintain backwards compatibility when changing serialized classes |
|
|
Term
What is a caveat when serializing an RDD of Writable objects? |
|
Definition
Many Writable objects are not serializable. You may need to use a map function to unwrap them before serializing. |
|
|
Term
What is a caveat when caching an RDD of Writable objects? |
|
Definition
Caching an RDD of writables can fail due to it using the same RecordReader instance. Use a map prior to caching. |
|
|
Term
What type of Hadoop formats generally support compression? |
|
Definition
|
|
Term
What happens if Spark reads data from a source with an unsplittable compression? |
|
Definition
Spark reads all the data on a single node. |
|
|
Term
How does textFile handle sources with splittable compression? |
|
Definition
textFile ignores splittable alltogether |
|
|
Term
If you want to read a text file with a splittable compression, what should you do? |
|
Definition
Use Hadoop API commands directly and specify the code |
|
|
Term
How does spark handle reading from local file systems? |
|
Definition
All files must be at the same path on all cluster nodes |
|
|
Term
What is a URI Spark would recognize for an S3 object? |
|
Definition
|
|
Term
What is a URI Spark would recognize for an HDFS object? |
|
Definition
|
|
Term
What happens after a user submits an application with spark-submit? |
|
Definition
1. spark-submit launches the driver program and invokes the main method 2. The driver program requests the cluster manager ask for resources to launch executors 3. The cluster manage launches executors 4. The driver process runs, sending tasks to executors based on transformations/actions 5. Executors run tasks and save/transmit results 6. On exit, the executors are terminated and the cluster manager resources are released |
|
|
Term
When are executors terminated and cluster manager resources released by a Spark application? |
|
Definition
When the driver's main() method exits or when SparkContext.stop() is called |
|
|
Term
In what mode does the following execute?
spark-submit my_script.py |
|
Definition
|
|
Term
How would you run myjob.jar locally with 2 cores? |
|
Definition
spark-submit --master local[2] myjob.jar |
|
|
Term
How would you run myjob.jar locally with the maximum number of cores? |
|
Definition
spark-submit --master local[*] myjob.jar |
|
|
Term
How would you run myjob.jar on a yarn cluster? |
|
Definition
spark-submit --master yarn
Additionally, set the HADOOP_CONF_DIR environment variable to the location of your Hadoop configuration directory |
|
|
Term
How would you run myjob.jar on a mesos cluster? |
|
Definition
spark-submit --master mesos://host:port myjob.jar |
|
|
Term
How would you run myjob.jar on a standalone cluster? |
|
Definition
spark-submit --master spark://host:port myjob.jar |
|
|
Term
Where does the driver program run when spark-submit is executed? |
|
Definition
By default, it will run on the machine where spark-submit is executed (client mode). To instead run it on one of the worker nodes, use:
--deploy-mode cluster |
|
|
Term
How do you set the name of your application via spark-submit? |
|
Definition
spark-submit --name "..." |
|
|
Term
How do you put JAR files on the classpath and transmit those JARs to the cluster nodes? |
|
Definition
spark-submit --jars jar,jar,... |
|
|
Term
How do you put non-jar files in the working directory of your application for each cluster node? |
|
Definition
spark-submit --files file,file,... |
|
|
Term
How do you put python files on the PYTHONPATH of the application? |
|
Definition
--py-files *.py,*.egg,*.zip |
|
|
Term
How do you specify 512 megabytes of executor memory for your application? |
|
Definition
spark-submit --executor-memory 512m |
|
|
Term
How do you specify 5 gigabytes of driver memory for your application? |
|
Definition
spark-submit --driver-memory 5g |
|
|
Term
How can you provide arbitrary configuration properties via spark-submit? |
|
Definition
spark-submit --conf prop=value --conf prop=value ... |
|
|
Term
How can you provide a properties file via spark-submit? |
|
Definition
spark-submit --properties-file |
|
|
Term
If you have many library dependencies, what is an alternative to providing them via spark-submit? |
|
Definition
Create a fat jar containing all dependencies, typically using a build tool. |
|
|
Term
What shouldn't you include as a dependency in a fat jar? |
|
Definition
|
|
Term
What primarily governs resource sharing between (inter) Spark applications? |
|
Definition
|
|
Term
If a Spark application asks for 5 executors, how many is it guaranteed to get? |
|
Definition
No guarantee. It may receive fewer, or more, depending on availability and contention in the cluster. |
|
|
Term
What governs resource sharing within a long lived Spark application (intra)? |
|
Definition
Spark's internal scheduler |
|
|
Term
What does Spark's Fair Scheduler provide? |
|
Definition
Applications can define priority queues for tasks |
|
|
Term
How do most cluster managers handle scheduling between jobs? |
|
Definition
By defining priority queues and/or capacity limits for jobs |
|
|
Term
When is the standalone cluster manager appropriate? |
|
Definition
When you only want Spark to run on the cluster. |
|
|
Term
What are the steps to stand up a standalone cluster? |
|
Definition
1. Put spark at the same location on all cluster nodes 2. Enable password-less SSH access between the cluster nodes 3. Add the worker hostnames to the master's conf/slaves 4. Run sbin/start-all.sh on the master |
|
|
Term
On what port does a standalone cluster run on by default? |
|
Definition
|
|
Term
How does spark handle a request for more memory than an executor node has available? |
|
Definition
It does not add that executor node to the cluster. |
|
|
Term
If you have a *standalone*, 20-node cluster with 4-core machines, how can you limit your job to running on eight machines? |
|
Definition
spark-submit --total-executor-cores 8 |
|
|
Term
For multiple applications how many executors will run on a single node in a *standalone* cluster? |
|
Definition
By default no more than 1 per application |
|
|
Term
If you have a *standalone* 20-node cluster with 4-core machines, how can you have 8 executors running on as few nodes as possible? |
|
Definition
Set spark.deploy.spreadOut to false |
|
|
Term
Can you have multiple masters with a *standalone* cluster? |
|
Definition
|
|
Term
By default, how many executors are used for a YARN application? |
|
Definition
|
|
Term
How can you change the number of executors a application will launch? |
|
Definition
spark-submit --num-executors |
|
|
Term
How do you set the number of cores each executor will use in a YARN cluster? |
|
Definition
spark-submit --executor-cores ... |
|
|
Term
How can you submit a Spark application to a specific YARN cluster queue? |
|
Definition
|
|
Term
How can you use Zookeeper to elect a master node in a Mesos cluster? |
|
Definition
spark-submit --master mesos://zk://node1:port/mesos,node2:port/mesos,... |
|
|
Term
What modes does Mesos offer for scheduling and which is the default? |
|
Definition
Fine-grained (default) Coarse-grained (set via spark.mesos.coarse=true) |
|
|
Term
What is Mesos' fine-grained mode? |
|
Definition
Dynamically scales the number of CPUs executors claim to share cluster resources among multiple jobs as they come and go. |
|
|
Term
When wouldn't you want to use Mesos' fine-grained mode? |
|
Definition
For applications with high latency sensitivity (e.g., Spark streaming) |
|
|
Term
What is Mesos' coarse-grained mode? |
|
Definition
Spark allocates a fixed number of CPUs to each executor which are not released until the application ends. |
|
|
Term
At what level is Mesos' scheduling mode set (per-cluster/per-job) |
|
Definition
|
|
Term
How many cores will Mesos use in the cluster by default? |
|
Definition
|
|
Term
How can you set a limit for the number of cores Mesos will use? |
|
Definition
spark-submit --total-executor-cores |
|
|
Term
How can you start a standalone cluster in EC2? |
|
Definition
|
|
Term
What does the Spark script for launching a EC2 cluster also put on the nodes? |
|
Definition
Ephemeral HDFS Persistent HDFS Tachyon Ganglia |
|
|
Term
What config property sets the application name? |
|
Definition
|
|
Term
What config property sets the master? |
|
Definition
|
|
Term
What file does Spark read properties from by default? How is it overridden? |
|
Definition
SPARK_HOME/conf/spark-defaults.conf
spark-submit --properties-file |
|
|
Term
What is the order of precedence for how Spark loads properties? |
|
Definition
1. Properties set via user code (highest) 2. Properties file 3. Default properties (lowest) |
|
|
Term
How would you provide Java options to executors using spark-submit? |
|
Definition
spark-submit --conf "spark.executor.extraJavaOptions=..." |
|
|
Term
How would you provide library paths to executors using spark-submit? |
|
Definition
spark-submit --conf "spark.executor.extraLibraryPath=..." |
|
|
Term
How would you specify the local storage directories for executors? |
|
Definition
Standalone/Mesos: SPARK_LOCAL_DIRS environment variable, fallback to spark.local.dir property (csv)
YARN: LOCAL_DIRS environment variable or fallback spark.local.dir property (csv) |
|
|
Term
Where are shuffle outputs written? |
|
Definition
|
|
Term
|
Definition
When a small number of tasks take a large amount of time |
|
|
Term
Where are Spark's logs stored and accessed in Standalone and Mesos? |
|
Definition
Standalone: stored in work/ dir on workers. Displayed in master's web interface
Mesos: Stored in work/ dir on mesos slaves, accessed via mesos master UI |
|
|
Term
How would you view application logs in YARN? |
|
Definition
yarn logs -applicationId ... |
|
|
Term
How can you easily provide a log4j.properties to your spark application? |
|
Definition
spark-submit --files log4j.properties |
|
|
Term
How do you specify to use the Kryo serializer? |
|
Definition
Set the 'spark.serializer' property to 'org.apache.spark.serializer.KryoSerializer' |
|
|
Term
What should you consider when using Kryo to serialize your custom classes? |
|
Definition
Register those classes with Kryo to save space via:
conf.registerKryoClasses(Array(classOf[...], ...)) |
|
|
Term
What can help debug a NotSerializableException? |
|
Definition
Set the java option "-Dsun.io.serialization.extendedDebugInfo=true" |
|
|
Term
How is JVM memory distributed for executors by default? How is it changed? |
|
Definition
Divided between:
- Persisted RDD storage, default 60%, set with spark.storage.memoryFraction - Shuffle output, 20% soft limit, set with spark.shuffle.memoryFraction - User code, anything leftover (default 20%) |
|
|
Term
What is one reason you might cache serialized objects? |
|
Definition
To reduce garbage collection times, which scales by the number of objects on the heap and not the size of objects. |
|
|