Assign 7: Working with Data
Due: Friday before class
Goals for Assignment 7
- practice using PostgreSQL
- practice using MongoDB
- practice using Elasticsearch
- analyze the tradeoffs between them
Background: Our Data
I created fake data about students--specifically, their graduation year, degree, and major(s)--for us to work with. (I'm not sure if students' actual majors is private information or not, so I generated the data. There may be some inconsistencies from reality, but bear with me.)
Querying Data
We are going to focus just on searches so that students get consistent results, regardless of when they do the various parts of the assignment.
For each of our data systems, you will answer the same set of questions:
- How many students are in the data set?
- How many students are graduating this year?
- Do any students have the same last name as you? If so, who are they?
- Who are the CSCI graduates this year?
- How many students are getting a BS in CSCI in the junior class?
- How many different last names do this set of students have?
Objective: Using PostgreSQL (50)
Try out your answers in the cs397
database.
Create an assign7.sql
file that contains the sql
statements to answer the questions below. I should then be able to
run
psql cs397 < assign7.sql and see all the
answers.
Some PostgreSQL Resources:
Objective: Using mongodb (50)
Try out your answers in the cs397
database. The
collection is called students
.
Create an assign7.mongo
file that contains the
commands to answer the questions below. I then should be able to
run
mongo cs397 < assign7.mongo to see your answers.
Some MongoDB Resources
Objective: Using Elasticsearch (50)
Try out your queries on the index cs397
. The type
is student
.
Create an assign7.sh
file that contains the curl
commands to answer the questions below. I then should be able to run
bash assign7.sh and see your answers.
If you get an error about fielddata, try using the field's keyword
instead. For example, if you're trying to look at
the lastname
, use lastname.keyword
instead.
References for Elasticsearch
Objective: Analysis (50)
In a text file called analysis.txt
, answer the
following questions. When answering the questions, also address if
the order that you used the systems biased your analysis?
- Which system was easiest to query? Why? Did the easiness change with the kinds of queries you did? There is also the potential of the bias of the questioner. If I phrased queries differently, would your answer change?
- Which system provided the easiest to understand results? Why? Discuss tradeoffs, with examples as appropriate.
- If you had to update these databases (e.g., as students declare their majors), how difficult would each be, respectively?
- If you were starting a new project that required data, which would you use? Please explain and support your answer with examples of your considerations.
Finishing up: What to turn in for this assignment
Copy the assignment to your turnin directory.
Grading
See grade breakdown above.
- Sufficiently narrow queries that get (only) the data needed to answer the question.
- Clear, well-justified analysis.