Please answer the following multiple-choice questions:
1. [5pts]
a. Continuous queries over incoming data and user subscriptions.
b. Ad hoc Information Retrieval for users as they see incoming data.
c. Streaming DOM trees.
d. Recursive-descent parsing of XML.
2. [5pts] optimized for:
Publish/subscribe systems like XFilter follow a model of supporting:
The Google File System differs from NFS (the Network File System) in that it is
a. Small files.
b. Random-access workloads.
c. Append-oriented workloads.
d. Text file-oriented workloads.
3.
[5pts] Algorithms for the MapReduce programming model can easily be executed on a distributed, sharded stream processing system such as Storm – except those MapReduce algorithms that make use of:
a. Mappers.
b. Drivers.
c. Reducers.
d. Sort-based shuffle.
[5pts] Remote procedure calls are: a. Typically synchronous.
b. Typically asynchronous. c. Always based on REST.
d. Always based on SOAP.
[5pts] REST (REpresentational State Transfer) makes use of:
a. HTTP operations such as GET and PUT.
b. Paths within a space of routes in a hierarchy.
c. Paths within a space of routes across remote sites.
d. A and B.
e. All of the above.
4.
5.
6. [5pts] a. b. c.
d.
7. [5pts] is:
a. b. c. d. e.
The document vector model makes a strong assumption that: Words are stemmed.
Word occurrences are conditionally independent of one another. Words are sparse.
Crawling is fast.
Out of all of the ACID properties, the one most associated with two-phase locking
Atomicity. Consistency. Isolation. Durability. Duration.
8. [5pts]
node has not replied, for the current transaction the subordinate node will:
In two-phase commit, if a subordinate node has voted to commit and the master
a. Deadlock.
b. Wait until the master responds.
c. Abort.
d. Perform two-phase locking.
9. Suppose we are building a sort-based MapReduce engine, as per the original Google paper. Show pseudocode for the distributed algorithm to do (a) the shuffle of the output (key, value) pairs from the mapper, and the (b) reduce with a single reducer.
a. A few different ways to do this. The Google paper does something like the following:
R = 4000 partitions
significantBytes = 2 // 2 bytes are enough to differentiate 4000 partitions
// I’m using bytes cause that’s easier to manipulate subrangeLength = 2^(significantBytes*8) / R = 16.384
// assuming map emits (word, “don’t-care”) EmitIntermediate (String key, String val):
keyPrefix = (key[0] << 8) + key[1] // assumes key.length > 2
partitionKey = keyPrefix % subrangeLength // %=floating point mod [x-(int)(x/y)*y] reducer = hash(partitionKey) % NUM_REDUCERS
sendTo(reducer, [key, val])
b.
// framework sorts all tuples by key reduce (String key, Iterator values):
Emit (key)
10. Consider the following transactions.
Transaction A
LOCK B
LOCK A WRITE B := 100 RELEASE A WRITE A := 100 RELEASE B
Transaction B
LOCK A
READ A
LOCK B
WRITE B := A – 10 WRITE A := A + 10 RELEASE B RELEASE A
(a) Identify 2 of the potential isolation-related issues that can arise.
Examples:
Deadlock if TA successfully acquires LOCK B and TB successfully acquires LOCK A. In a couple of commands, they will try to acquire the others’ lock and cannot proceed.
Lack of serializability when TA releases the lock on A, but then writes. During that time,:
TB can READ A (2nd command)
TA can WRITE A + RELEASE B
TB can then proceed with a value of A from an intermediate state of TA. That violates isolation!
(b) Suggest “fixes” that would address the above.
Examples:
Deadlock -> use a protocol like wait die or wound wait Lack of serializability -> release after writing. Use 2PL