Term
Draw the diagram for Query Processing |
|
Definition
|
|
Term
What are the types of query plans and their characteristics? |
|
Definition
- Logical query plan
- Alegebraic representation of query
- Operators taken from relational algebra
- Is abstract
- Physical query plan
- Algorithms selected for each operator in plan
- Execution order for specific operators
|
|
|
Term
Please show in the diagra where does query optimization take place |
|
Definition
|
|
Term
What does the 'selecting logical query plan' stage consist of? |
|
Definition
It takes place after the parse query, before select physical query plan, and it consists of
- Generate Logical Query plan
- Rewrite logical query plan
|
|
|
Term
What are Physical Plan Operators? |
|
Definition
Algorithm that implements one of the basic relational operations that are used in query plans
e.g. Relational algebra has join operator. How the join is carried out depends on >Structure of relations, > Size of relations, > Presence of indexes and hashes, etc |
|
|
Term
What is the computational model for Pyhsical plan operators?
|
|
Definition
Need to choose good physical plan operators:
- Estimate "cost" of each operator
- Key measure of cost is the number of disk accesses (far more costly than main memory accesses)
Assumption: arguments of operator are on disk, result is in main memory |
|
|
Term
What are the cost parameters of Physical Plan operators? |
|
Definition
M: Main memory available for buffers
S(R): Size of a tuple of relation R
B(R): Blocks used to store relation R
T(R): Number of tuples in relation R (Cardinality of R)
V(R,a): Number of distinct values for attribute a in relation R |
|
|
Term
What is a clustered 1) File, 2) relation and 3) index? |
|
Definition
- Tuples from different relations that can be joined (on particular attribute values) stored in blocks together.
- [R1 R2 S1 S2] [R3 R4 S3 S4]
- Tuples from relation are stored together in blocks, but not necessarily sorted
- [R1 R2 R3 R4] [R5 R6 R7 R8]
- Index allows tuples to be read in an order that corresponde to physical order
|
|
|
Term
What does scanning consist of? |
|
Definition
- e.g. Read all tuples of relation R (that satisfy some predicate)
- Two variants
|
|
|
Term
What are the characteristics of Table scan? |
|
Definition
- Tuples arranged in blocks
- All blocks known to the system
- Possible to get blocks one at a time
- I/O Cost
- B(R) disk accesses, if R is clustered
- T(R) disk accesses, if R is not clustered
|
|
|
Term
What are the characteristics of Index Scan? |
|
Definition
- An index exists on some attribute of R
- Use index to find all blocks holding R
- Retreive blocks for R
- I/O Cost
- B(R) + B(IR) disk accesses if clustered
- B(R) >> B(IR), so treat as only B(R)
- T(R) disk accesses if not clustered
|
|
|
Term
What are the main characteristics of one-pass algorithms? |
|
Definition
- Read data from disk only once
- Typically require that at least one argument fits in main memory
- Three broad categories:
- Unary, tuple at a time (i.e. select, project)
- Unary, full relation (i.e. duplicate elimination, grouping)
- Binary, full relation
|
|
|
Term
Describe the one-pass unary, tuple at a time |
|
Definition
- Algorithm:
- Foreach block of R:
- Copy block to input buffer
- Perform operation (select, project) on each tuple in block
- Move selected/projected tuples to output buffer
- Cost:
- In general, B(R) or T(R) disk accesses depending on clustering
- If operator is a select that compares an attribute to a constant and index exists for attributes used in select, <<B(R) disk accesses
Requires M >= 1
|
|
|
Term
Describe the Unary, Full-relation, one pass algorithm and cost |
|
Definition
- Algorithm (Normal):
- foreach block of R
- Copy block to input buffer
- update accumulator
- move tuples to output buffer
- Algorithm (Duplicate Elimination)
- foreach block of R:
- copy block to input buffer
- foreach tuple in block
- If tuple is not in accumulator
- else copy to accumulator
- Requires M >= B(d(R))+1 block of main memory
- 1 block for input buffer
- M-1 blocks for accumulator (records each tuple seen so far)
- Accumulator implemented as in-memory data structure (tree, hash)
- If fewer than B(d(R)) blocks available, thrashing likely
- Cost is B(R) disk accesses
|
|
|
Term
What are the characteristics of grouping in a unary, full relation? |
|
Definition
Grouping operators: min, max, sum, count, avg
- Accumulator contains per-group values
- Output contains per-group values
- Cost is B(R) disk accesses
|
|
|
Term
What are the characteristics of the Binary, Full-relation? |
|
Definition
Union, intersection, difference, product, join
In general, cost is B(R) + B(S). (R, S are operand relations)
Requirement for one pass operation:
min(B(R), B(S)) < M-1 |
|
|
Term
What are the characteristics of the join, binary ful relation? |
|
Definition
- Two relations, R(X,Y) and S(Y,Z), B(S)<B(R)
- Uses main memory search structure keyed on Y
- foreach block of S:
- read block
- add tuples to search structure
- foreach block of R:
- copy block to input buffer
- foreach tuple in block:
- Find matching tuples in search structure
- Construct new tuples and copy to output
|
|
|
Term
What are the characteristics of nested-loop joins? |
|
Definition
- Also known as iteration join
- Assumming that we're joining the relations R,S on attribute C, the algorithm is as follows:
- foreach tuple r in R
- foreach tuple s in S
- if r.C = s.C then output r,s pair
|
|
|
Term
What are some factors that affect cost? |
|
Definition
- Tuples of a relation stored physically together (clustered)
- Relations sorted by join attribute
- Indexes exist?
|
|
|
Term
- Solve a join between relations R1, R2 on attribute C through a tuple-based nested loop join with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
- Relations not contiguous
- One disk access per tuple
- Cost for each tuple in R1=cost to read tuple + cost to read R2
- Total Cost =
- T(R1)*(1+T(R2)) =
- 10,000 * (1+5,000) =
- 50,010,000 disk accesses
It is possible to do better by using all available main memory (M=101), read outer relation R1 in chunks of 100 blocks, and read all inner relation2 (using 1 block) + join
|
|
|
Term
- Solve a join between relations R1, R2 on attribute C through a block-based nested loop join with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
- Cost to read one 100-block chunk of R1 = 100*1/S(R1) = 1,000 disk accesses
- Cost to process each chunk = 1000 + T(R2) = 6,000
- Total cost = T(R1)/ 1000 * 6000 = 60,000 disk accesses
It's possible to do better by reversing the join order ( i.e. R1 becomes the inner relation, R2 becomes the outer relation). |
|
|
Term
- Solve a join between relations R1, R2 on attribute C through a join reordering with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
We are reversing the join order ( i.e. R1 becomes the inner relation, R2 becomes the outer relation).
- Cost to read one 100-block chunk of R2=100*1/S(R2)=1,000 disk accesses
- Cost to process each chunk = 1000+T(R1) = 11,000
- Total cost = T(R1)/1000*11,000 = 55,000 disk accesses
It's possible to do better if the tuples in each relation are contiugous (i.e. clustered) |
|
|
Term
- Solve a join between relations R1, R2 on attribute C through Contiguous Relations with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
Tuples in each relation are contiugous (i.e. clustered)
- B(R1)=1000, B(R2)=500
- Cost to read one 100-block chunk of R2=100 disk accesses
- Cost to process each chunk = 100+B(R1)=1100
- Total cost=B(R2)/100*1,100=5,500 disk accesses
We can do better if both relations are contiguous and sorted by C, the join attribute |
|
|
Term
- Solve a join between relations R1, R2 on attribute C through a merge join with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
Both relations are contiguous and sorted by C, the join attribute
Total cost = B(R1)+B(R2)=1000+500
= 1,500 disk accesses
Problem is if R1 and R2 aren't sorted by C, as it is needed to sort them. |
|
|
Term
- Solve a join between relations R1, R2 on attribute C through a Two Pass Algorithm with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
If R1 and R2 are not sorted by C, we need to sort them - this can be done with mergesort:
- Foreach 100 block chunk of R
- Read chunk
- sort in memory
- Write to disk
- Read all chunks + merge + write out
|
|
|
Term
What is the cost of the merge sort? |
|
Definition
Each tuple is read, written, read, written... so:
- Sort cost R1: 4x1000 = 4,000 disk accesses
- Sort cost R2: 4x500 = 2,000 disk accesses
|
|
|
Term
- Solve a join between relations R1, R2 on attribute C through a Merge Join with sort with the following elements:
- T(R1) = 10,000
- T(R2) = 5,000
- S(R1) = S(R2) = 1/10 block
- M = 101 blocks
|
|
Definition
- R1,R2 contiguous but usorted:
- Total Cost = sort cost + join cost
- = 6,000 + 1,500
- = 7,500 disk accesses
- Nested loop cost = 5,500 disk accesses
- Merge join does not necessarily pay off
If R1=10,000blocks contiugous
and
R2=5,000blocks not ordered
Nested loop cost = 505,000 disk accesses
Merge join cost = 75,000 disk accesses
In this case merge join with sort is better
|
|
|
Term
|
Definition
Are trees that represent queries visually by breaking them down into components:
SELECT PNUMBER,DNUM,LNAME,ADDRESS,DATE FROM PROJECT,DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford'
[image] |
|
|
Term
What are the optimization steps? |
|
Definition
Heuristic approach
- Start with canonical form
- Move o operators down the tree
- Reorder subtrees to put most restrictive o first
- Combine x and o to form |><|
- Move pi operators down the tree
|
|
|
Term
Show the diagram for the Query plan, and show where to choose heuristics |
|
Definition
|
|