Shared Flashcard Set

Details

COMP3017 - Lecture 5 - Query Processing
COMP3017 - Lecture 5 - Query Processing
30
Computer Science
Undergraduate 3
05/02/2014

Additional Computer Science Flashcards

 


 

Cards

Term
Draw the diagram for Query Processing
Definition
[image]
Term
What are the types of query plans and their characteristics?
Definition
  • Logical query plan
    • Alegebraic representation of query
    • Operators taken from relational algebra
    • Is abstract
  • Physical query plan
    • Algorithms selected for each operator in plan
    • Execution order for specific operators
Term
Please show in the diagra where does query optimization take place
Definition
[image]
Term
What does the 'selecting logical query plan' stage consist of?
Definition
It takes place after the parse query, before select physical query plan, and it consists of
  • Generate Logical Query plan
  • Rewrite logical query plan
Term
What are Physical Plan Operators?
Definition

Algorithm that implements one of the basic relational operations that are used in query plans

 

e.g. Relational algebra has join operator. How the join is carried out depends on >Structure of relations, > Size of relations, > Presence of indexes and hashes, etc

Term

What is the computational model for Pyhsical plan operators?

 

Definition

Need to choose good physical plan operators:

  • Estimate "cost" of each operator
  • Key measure of cost is the number of disk accesses (far more costly than main memory accesses)

Assumption: arguments of operator are on disk, result is in main memory

Term
What are the cost parameters of Physical Plan operators?
Definition

M: Main memory available for buffers

S(R): Size of a tuple of relation R

B(R): Blocks used to store relation R

T(R): Number of tuples in relation R (Cardinality of R)

V(R,a): Number of distinct values for attribute a in relation R

Term
What is a clustered 1) File, 2) relation and 3) index?
Definition
  1. Tuples from different relations that can be joined (on particular attribute values) stored in blocks together.
    • [R1 R2 S1 S2] [R3 R4 S3 S4]
  2. Tuples from relation are stored together in blocks, but not necessarily sorted
    • [R1 R2 R3 R4] [R5 R6 R7 R8]
  3. Index allows tuples to be read in an order that corresponde to physical order
Term
What does scanning consist of?
Definition
  • e.g. Read all tuples of relation R (that satisfy some predicate)
  • Two variants
    • Table Scan 
    • Index Scan
Term
What are the characteristics of Table scan?
Definition
  • Tuples arranged in blocks
    • All blocks known to the system
    • Possible to get blocks one at a time
  • I/O Cost
    • B(R) disk accesses, if R is clustered
    • T(R) disk accesses, if R is not clustered
Term
What are the characteristics of Index Scan?
Definition
  • An index exists on some attribute of R
    • Use index to find all blocks holding R
    • Retreive blocks for R
  • I/O Cost
    • B(R) + B(IR) disk accesses if clustered
    • B(R) >> B(IR), so treat as only B(R)
    • T(R) disk accesses if not clustered
Term
What are the main characteristics of one-pass algorithms?
Definition
  • Read data from disk only once
  • Typically require that at least one argument fits in main memory
  • Three broad categories:
    • Unary, tuple at a time (i.e. select, project)
    • Unary, full relation (i.e. duplicate elimination, grouping)
    • Binary, full relation
Term
Describe the one-pass unary, tuple at a time
Definition
  • Algorithm:
    • Foreach block of R:
      • Copy block to input buffer
      • Perform operation (select, project) on each tuple in block
      • Move selected/projected tuples to output buffer
  • Cost:
    • In general, B(R) or T(R) disk accesses depending on clustering
    • If operator is a select that compares an attribute to a constant and index exists for attributes used in select, <<B(R) disk accesses
Requires M >= 1

 

Term
Describe the Unary, Full-relation, one pass algorithm and cost
Definition
  • Algorithm (Normal):
    • foreach block of R
      • Copy block to input buffer
      • update accumulator
      • move tuples to output buffer
  • Algorithm (Duplicate Elimination)
    • foreach block of R:
      • copy block to input buffer
      • foreach tuple in block
        • If tuple is not in accumulator
          • copy to output buffer
        • else copy to accumulator
  • Requires M >= B(d(R))+1 block of main memory
    • 1 block for input buffer
    • M-1 blocks for accumulator (records each tuple seen so far)
  • Accumulator implemented as in-memory data structure (tree, hash)
  • If fewer than B(d(R)) blocks available, thrashing likely
  • Cost is B(R) disk accesses
Term
What are the characteristics of grouping in a unary, full relation?
Definition

Grouping operators: min, max, sum, count, avg

  • Accumulator contains per-group values
  • Output contains per-group values
  • Cost is B(R) disk accesses
Term
What are the characteristics of the Binary, Full-relation?
Definition

Union, intersection, difference, product, join

 

In general, cost is B(R) + B(S). (R, S are operand relations)

 

Requirement for one pass operation:

min(B(R), B(S)) < M-1

Term
What are the characteristics of the join, binary ful relation?
Definition
  • Two relations, R(X,Y) and S(Y,Z), B(S)<B(R)
  • Uses main memory search structure keyed on Y
  • foreach block of S:
    • read block
    • add tuples to search structure
  • foreach block of R:
    • copy block to input buffer
    • foreach tuple in block:
      • Find matching tuples in search structure
      • Construct new tuples and copy to output
Term
What are the characteristics of nested-loop joins?
Definition
  • Also known as iteration join
  • Assumming that we're joining the relations R,S on attribute C, the algorithm is as follows:
    • foreach tuple r in R
      • foreach tuple s in S
        • if r.C = s.C then output r,s pair

 

Term
What are some factors that affect cost?
Definition
  • Tuples of a relation stored physically together (clustered)
  • Relations sorted by join attribute
  • Indexes exist?

 

Term
  • Solve a join between relations R1, R2 on attribute C through a tuple-based nested loop join with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition
  • Relations not contiguous
    • One disk access per tuple
  • Cost for each tuple in R1=cost to read tuple + cost to read R2
    • Total Cost = 
    • T(R1)*(1+T(R2)) =
    • 10,000 * (1+5,000) = 
    • 50,010,000 disk accesses

It is possible to do better by using all available main memory (M=101), read outer relation R1 in chunks of 100 blocks, and read all inner relation2 (using 1 block) + join

 

Term
  • Solve a join between relations R1, R2 on attribute C through a block-based nested loop join with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition
  • Cost to read one 100-block chunk of R1 = 100*1/S(R1) = 1,000 disk accesses
  • Cost to process each chunk = 1000 + T(R2) = 6,000
  • Total cost = T(R1)/ 1000 * 6000 = 60,000 disk accesses

It's possible to do better by reversing the join order ( i.e. R1 becomes the inner relation, R2 becomes the outer relation).

Term
  • Solve a join between relations R1, R2 on attribute C through a join reordering with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition
We are reversing the join order ( i.e. R1 becomes the inner relation, R2 becomes the outer relation).
  • Cost to read one 100-block chunk of R2=100*1/S(R2)=1,000 disk accesses
  • Cost to process each chunk = 1000+T(R1) = 11,000
  • Total cost = T(R1)/1000*11,000 = 55,000 disk accesses

It's possible to do better if the tuples in each relation are contiugous (i.e. clustered)

Term
  • Solve a join between relations R1, R2 on attribute C through Contiguous Relations with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition
Tuples in each relation are contiugous (i.e. clustered)
  • B(R1)=1000, B(R2)=500
  • Cost to read one 100-block chunk of R2=100 disk accesses
  • Cost to process each chunk = 100+B(R1)=1100
  • Total cost=B(R2)/100*1,100=5,500 disk accesses

We can do better if both relations are contiguous and sorted by C, the join attribute

Term
  • Solve a join between relations R1, R2 on attribute C through a merge join with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition

Both relations are contiguous and sorted by C, the join attribute

 

Total cost = B(R1)+B(R2)=1000+500

= 1,500 disk accesses

 

Problem is if R1 and R2 aren't sorted by C, as it is needed to sort them.

Term
  • Solve a join between relations R1, R2 on attribute C through a Two Pass Algorithm with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition

If R1 and R2 are not sorted by C, we need to sort them - this can be done with mergesort:

  1. Foreach 100 block chunk of R
    1. Read chunk 
    2. sort in memory
    3. Write to disk
  2. Read all chunks + merge + write out

 

Term
What is the cost of the merge sort?
Definition

Each tuple is read, written, read, written... so:

  • Sort cost R1: 4x1000 = 4,000 disk accesses
  • Sort cost R2: 4x500 = 2,000 disk accesses

 

Term
  • Solve a join between relations R1, R2 on attribute C through a Merge Join with sort with the following elements:
    • T(R1) = 10,000
    • T(R2) = 5,000
    • S(R1) = S(R2) = 1/10 block
    • M = 101 blocks
Definition
  • R1,R2 contiguous but usorted:
  • Total Cost = sort cost + join cost
  • = 6,000 + 1,500
  • = 7,500 disk accesses
  • Nested loop cost = 5,500 disk accesses
    • Merge join does not necessarily pay off

If R1=10,000blocks contiugous

and

R2=5,000blocks not ordered

 

Nested loop cost = 505,000 disk accesses

Merge join cost = 75,000 disk accesses

 

In this case merge join with sort is better

 

 

 

Term
What are query trees?
Definition

Are trees that represent queries visually by breaking them down into components:

 

SELECT PNUMBER,DNUM,LNAME,ADDRESS,DATE FROM PROJECT,DEPARTMENT, EMPLOYEE WHERE DNUM=DNUMBER AND MGRSSN=SSN AND PLOCATION='Stafford'

 

[image]

Term
What are the optimization steps?
Definition

Heuristic approach

  1. Start with canonical form
  2. Move o operators down the tree
  3. Reorder subtrees to put most restrictive o first
  4. Combine x and o to form |><|
  5. Move pi operators down the tree
Term
Show the diagram for the Query plan, and show where to choose heuristics
Definition
[image]
Supporting users have an ad free experience!