Skip to main content.

Project 3: Inverted Index

Due: November 13 11:59 p.m.

The goal of this project is to become familiar with Hadoop/MapReduce, a popular model for programming distributed systems that was developed by Google and publicly released by Apache.

Part 0: Set Up

Eclipse Set Up

Clone the GitHub assignment repository.

Click on the InvertedIndex within Working Tree and checkout the project.

You may need to set the source directory. Go to "Configure Build Path", remove the existing source directory, and add "src/main/java" as the source directory.

Word Count Code

Create a .jar file from your Java project and specify the main program as the WordCount class. In Eclipse, if you use "Export --> Java --> Jar", the wizard steps you through the creation of the .jar file.

AWS EMR Overview using Word Count Example

From AWS documentation: "When you finish working with your Starter Account, close your browser tab. Important: If you choose End Lab, you will lose access to your Starter Account. Do not choose End Lab unless you no longer want to use your Starter Account."

Google Cloud with Word Count Example

Setting up with Google Cloud DataProc is similar to Amazon. First, set up your account, following the emailed directions.

Part 1: Build an Inverted Index

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

Doc1:
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doc2:
Buffalo are mammals.

We could construct the following inverted file index:

 
      buffalo -> Doc1, Doc2
      are -> Doc2
      mammals -> Doc2 
      

Your goal is to build an inverted index of words to the documents that contain them.

Create an index of the files in the prelimtest dataset and then the final dataset. The "starter" input files are in the folder preliminput to give you a smaller set to test on to start.

Your end result will be of the form: (word, docname_list).

The big hiccup here is that the default file format doesn't provide you with the name of the file in your map function. You will have to figure some way around this. I suggest checking out Mapper.Contexts, InputFormats, and InputSplits.

Inverted Index Requirements

Cluster

The default cluster is fine for the smaller input set. For the larger input, you may want to increase the number of nodes involved in the cluster.

Downloading the output files

You can click on each file to download them. If you want to download in bulk, you'll need the gsutil.

Resources

Part 2: Querying Inverted Index

Write a query program that queries your inverted file index. Your program will take as input a user-specified word (or phrase) and return the IDs of the documents that contain that word.

Your program should take the directory location of your inverted index (output) files as a command-line argument.

Handle users entering words in various cases, with various punctuation, as well as stop words.

This code can be in either Java or Python. Put the code in the appropriate source location and document its use in the write up.

Optional Extensions

If you get the above working and want to try something else, here are some optional extensions that you can experiment with:

Part 3: Writeup

As usual, describe the project: an overview/introduction, the architecture, and implementation. Show a snippet of the output (full output will be in GitHub).

Include general thoughts/reflection, including answers to the following questions:

Part 4: Submission

GitHub Classroom will make a snapshot of your repository at the deadline. Your repository should contain:

  1. Your writeup (PDF).
  2. All the files for your source code.
  3. A README, containing instructions for running your code and your jar file, e.g., what command-line arguments are required for your inverted index code and how to run your querying program (example call to run the program--including command-line arguments).
  4. Your output; it should be small enough to fit on GitHub. If it's not small enough, then share a folder with me on Box.