Assignment 2: Counting due 6:00pm February 14
Setting Up Your Development Environment
First, just so everyone's VM is sync'ed up, update all packages via:sudo yum update
If it's the first time you've done this after downloading the VM image, it might take a bit, so grab a cup of coffee.
After the VM is updated, clone the Git repo for the course. If you have the repo already cloned, make sure you do a pull to get the latest updates. When in doubt, type in your console to pull up the most recent commits, and it should match the latest commit here.
You'll find a directory named , which provides a template for your assignment. Copy the contents of the directory into your own assignments private repo, under . Note that the source directory contains (normally invisible) dot-files e.g., , etc.; remember to copy these also. Go ahead and commit the contents so that you can revert to this point easily. Go into : you should be able to type and successfully build the project.
Since we're going to be working with this basic repository structure for subsequent assignments, you should familiarize yourself with the setup. Let's first take a tour: Ant is a build system, and through Ivy, it downloads all dependent jars and places them in . That is, all the jars in are automatically placed there—you shouldn't ever need to worry about copying jars there directly. Also, should not be placed under version control.
How does Ivy know what dependencies to pull in? This is specified in , in this line:<dependency org="edu.umd" name="cloud9" rev="1.4.10" conf="*->*,!sources,!javadoc"/>
Ivy automatically finds and downloads Cloud9 and transitively pulls its dependencies also. Add to this file if you want to use any external libraries.
Source code is kept in : main code goes into , JUnit tests go into . There are source code stubs to get you started. If you use Eclipse as your IDE, you should be able to directly import the project.
After Ant successfully completes the build, the packaged jar is created in . Note that should not be placed under version control since it is built automatically.
For your convenience, generates four run scripts in :
Use to run any normal Java class with a , e.g.:etc/run.sh HelloWorld
Use to run a specific JUnit test, e.g.:etc/junit.sh SampleTest
Use to run a Hadoop job in local (standalone) mode, e.g.:etc/hadoop-local.sh WordCount -input bible+shakes.nopunc.gz -output wc
Use to run a Hadoop job in the VM in pseudo-distributed mode, e.g.:etc/hadoop-cluster.sh WordCount -input bible+shakes.nopunc.gz -output wc -numReducers 5
Ant provides a couple other useful features. To run all test cases:ant test
If you're getting an error along the lines of "the class org.apache.tools.ant.taskdefs.optional.junit.JUnitTask was not found" or "java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison", do:sudo yum install ant-junit sudo yum install ant-trax
To generate Javadoc:ant javadoc
The API docs will be deposited in .
This assignment begins with an optional but recommended component: complete the bigram counts exercise in Cloud9. The solution is already checked in the repo, so it won't be graded. Even if you decide not to write code for the exercise, take some time to sketch out what the solution would look like. The exercises are designed to help you learn: jumping directly to the solution defeats this purpose.
In this assignment you'll be computing pointwise mutual information, which is a function of two events x and y:
The larger the magnitude of PMI for x and y is, the more information you know about the probability of seeing y having just seen x (and vice-versa, since PMI is symmetrical). If seeing x gives you no information about seeing y, then x and y are independent and the PMI is zero.
Write a program that computes the PMI of words in the sample corpus. To be more specific, the event we're after is x occurring on a line in the file or x and y co-occurring on a line. That is, if a line contains A, A, B; then there are not two instances of A and B appearing together, only one. To reduce the number of spurious pairs, we are only interested in pairs of words that co-occur in ten or more lines. Use the same definition of "word" as in the word count demo: whatever Java's gives.
You will build two versions of the program:
- A "pairs" implementation. The implementation must use combiners. Name this implementation .
- A "stripes" implementation. The implementation must use combiners. .
If you feel compelled (for extra credit), you are welcome to try out the "in-mapper combining" technique for both implementations.
Since PMI is symmetrical, PMI(x, y) = PMI(y, x). However, it's actually easier in your implementation to compute both values, so don't worry about duplicates. Also, use so the results of your program are human readable.
Note: just so everyone's answer is consistent, please use log base 10.
Answer the following questions:
Question 0.Briefly describe in prose your solution, both the pairs and stripes implementation. For example: how many MapReduce jobs? What are the input records? What are the intermediate key-value pairs? What are the final output records? A paragraph for each implementation is about the expected length.
Question 1. What is the running time of the complete pairs implementation (in your VM)? What is the running time of the complete stripes implementation (in your VM)?
Question 2. Now disable all combiners. What is the running time of the complete pairs implementation now? What is the running time of the complete stripes implementation?
Question 3. How many distinct PMI pairs did you extract?
Question 4. What's the pair (x, y) with the highest PMI? Write a sentence or two to explain what it is and why it has such a high PMI.
Question 5. What are the three words that have the highest PMI with "cloud" and "love"? And what are the PMI values?
Note that you can compute the answer to questions 3—6 however you wish: a helper Java program, a Python script, command-line manipulation, etc.
Turning in the Assignment
Please follow these instructions carefully!
Make sure your repo has the following items:
- Similar to your first assignment, the answers to the questions go in .
- The pairs implementation should be in .
- The stripes implementation should be in .
- Of course, your repo may contain other Java code, which goes in in .
When grading, I will perform a clean clone of your repo in my VM, and type to build. Your code should build successfully.
Next, I'll type (exactly) the following command to run the pairs implementation (in the VM):etc/hadoop-cluster.sh PairsPMI -input bible+shakes.nopunc.gz -output YOURNAME-pairs -numReducers 5
You can assume that is already in HDFS but otherwise there is nothing else on HDFS. The final output should appear in a directory called . The part files in that directory should be human readable.
Similarly, I'll type the following command to run the stripes implementation (in the VM):etc/hadoop-cluster.sh StripesPMI -input bible+shakes.nopunc.gz -output YOURNAME-stripes -numReducers 5
As in the pairs case, you can assume that is already in HDFS but otherwise there is nothing else on HDFS. The final output should appear in a directory called . The part files in that directory should be human readable.
Before you consider the assignment "complete", I would recommend that you verify everything above works by performing a clean clone of your repo and going through the steps above.
One final suggestion: sometimes Ivy gets into a weird state due to multiple interacting repositories. Just to make sure I can pull in all dependencies, remove the Ivy cache with and make sure the build still works. Ivy should re-download all dependent jars from their original sources.
When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. That's it! There's no need to send me anything—I already know your username from the first assignment. Note that everything should be committed and pushed to origin before the deadline (before class on February 14).
- Did you take a look at the bigram counts exercise?
- Your solution may require more than one MapReduce job.
- Recall from lecture techniques for loading in "side data"?
- Look in for a reference implementation of the pairs and stripes techniques.
- Note that you have access to everything that's in Cloud9, for example, there are many useful types in .
The entire assignment is worth 35 points:
- Each of the questions 1 to 5 is worth 2 points, for a total of 10 points.
- The pairs implementation is worth 10 points and the stripes implementation is worth 10 points. The purpose of question 0 is to help me understand your implementation.
- Getting your code to run is worth 5 points. That is, to earn all five points, I should be able to run your code (building and running), following exactly the procedure above. Therefore, if all the answers are correct and the implementation seems correct, but I cannot get your code to build and run inside my VM, you will only get a score of 30/35.
Back to top
There will be three programming projects and three written homework assignments.
|Project 1: Distributed Password Cracker||08/28/10||09/23/2010|
|Homework 1||9/8/2010||9/15/2010 11:59pm||Homework 1 Solutions|
|Project 2: Distributed File System||9/30/2010|
|Homework 2||11/1/2010||11/10/2010 11:59pm||Homework 2 Solutions|
|Project 3: Hadoop|
Project 3: Hadoop MapReduce Programming
Project 2: A Distributed File System
Project 2 files can be found here:
The project rubric is as follows:
- part1 15 points (5 for LOSSY)
- part2 10 points
- part3: 15 points
- part4: 15 points
- part5: 20 points (5 caching improvement, 5 for LOSSY)
- part6: 10 points
- style: 15 points
In Project 2, you will be doing your development in a Virtual Machine, and we will be supporting VirtualBox as the virtualization software. Please download VirtualBox for your system here: http://www.virtualbox.org. A brief introduction to VirtualBox for the purposes of this project can be found here: VirtualBox Intro for 15-440. You can find the Ubuntu image here: VirtualBox Ubuntu Image with FUSE support (Warning: 1.4GB file)
Project 1: A Simple Distributed Password Cracker
Using a trivially parallelizable, easy computation (brute-force cracking a password), this lab introduces students to the communication and coordination challenges involved in harnessing a cluster or wide-area distributed group of machines to accomplish a common goal.
- Using the provided protocol description, implement the coordinator process and slave processes that run on the worker machines. Your processes must be able to inter-operate with the course-supplied example binaries.
- Implement a reliable work assignment mechanism, operating on top of UDP, that successfully allocates jobs to slave machines, can deal with the loss or delay of UDP packets, and calculates timeouts to reschedule work that was allocated to failed machines.
Project 1 files can be found here:
All homework and the first project is to be done individually. The second and third programming projects will be done in groups of two students.
The later projects are done in groups for two reasons. The first is the size of the class. The second and more important reason is that this is an opportunity to experience the joys and frustrations of working with others. It's a skill you only get better at with practice.
Since 15-440 fulfills the project-class requirement of the CS degree, you will be expected to learn and practice good software engineering, as well as demonstrate mastery of the networking concepts. Both partners in a project group will need to fully understand the project and your solution in order to do well on those exam questions relating to the projects. For example, a typical question might be: "When you implemented X, you came across a particular situation Y that required some care. Explain why this simple solution Z doesn't work and describe how you solved it." We'll pick questions such that it will take some effort to figure out Y. If you didn't take the time to work the problem yourself and just relied on your partner, you won't have enough time during the test to figure it out. Be careful, the insights you'll need will come only from actually solving the problem as opposed to just seeing the solution.
By their nature, the assignments aren't going to be completely comprehensive of everything you'll encounter in the real world or in class. To assist you, we've compiled a list of suggested study problems that you may want to do in addition to the normal homework. They're not graded, but they'd make great topics to discuss with the course staff during office hours.-->
Notes on the Programming Projects
A key objective of this course is to provide a significant experience with system programming, where you must write programs that are robust and that must integrate with a large, installed software base. Oftentimes, these programs are the ones that other people will build upon or use as tools. Systems programming is very different from the application program development you have done in earlier courses:
- It is typically done in a low-level language, such as C, to ensure close control over system resources.
- Especially with server code, it must be designed to run indefinitely. It must handle reliably handle every possible error condition, and it must manage resources such as memory with care.
- It must be secure. Connecting a system to a network makes it vulnerable to malicious attacks initiated anywhere in the world. Poorly designed or implemented network software provides a common entrypoint for attack. System software must be invulnerable to flaws such as string overflows or malformed incoming messages. (This point bears repeating: Any system software must stringently check input it receives from the network or from the user. Do not trust either one! They're often out to get you.)
- The interfaces to other parts of the system are generally specified by documented protocols.
- Distributed systems nearly always involve concurrency, both within individual machines (multiple processes or threads) as well as among the different network components.
- An important part of system programming is to develop comprehensive test methods for the programs. A significant effort should be invested in writing programs that will thoroughly test the system code, including the handling of different error conditions.
Finally, please note that by design, the projects do not always specify every corner case bit of behavior or every design decision you may have to make. A major challenge in implementing real systems is in making the leap from a specification that is often slightly incomplete to a real-world implementation. Don't get frustrated -- our grading will not dock you for making reasonable design decisions! We suggest three general guidelines to follow:
- Be conservative in what you do, be liberal in what you accept from others.. This is the design guideline underlying many Internet services, first uttered as a robustness principle by Jon Postel in the first TCP RFC, RFC 793.
- Browse the newsgroup and FAQ, ask the course staff!
- Make a reasonable design decision and document it. In a perfect world, all aspects of a design would be comletely specified, but most real-world, large, complex systems do not achieve this goal. You will often hear the course staff reply: You may pick either alternative as long as your server does not crash. This advice applies particularly to error handling, where there are a nearly infinite number of possible errors with partially-specified error responses. The goal of the course is to gain experience with creating large systems; we don't expect students to be psychic, merely to exercise good judgement about creating a robust and usable system.
We'll go into more detail about each of these points during the recitation sections. But keep in mind: The programming assignments are larger and more open-ended than in other courses. Doing a good job on the project requires more than just producing code that runs: it should have a good overall organization, be well implemented and documented, and be thoroughly tested.
Last updated: Wed Dec 01 22:31:05 -0500 2010 [validate xhtml]