Skip to main content.

Assign 2: Unix Filters and Regular Expressions

Due: Wednesday before class

Goals for Assignment 2

After the assignment, you should

  1. Further customize your environment.
  2. Understand pipes and how to use them effectively
  3. Know how to use filter commands, e.g., sort, uniq, cut, paste, grep, etc.
  4. Know how to analyze data files using the above tools.

Objective: Set Up

Create an assign2 subdirectory within your cs397/assignments directory.

Copy all the files from /csdept/courses/cs397/handouts/assign2 into your assign2 directory

Objective: Customizing Your Environment (10 pts)

Open your ~/.bashrc file and create an alias called "peptalk" that is aliased to repeatedly print "You can do it!" Reload your configuration file by running source ~/.bashrc Test that your new alias works. (I'll be able to see that you did this correctly from the next objective.)

Objective: UNIX Practice (20 pts)

In a new terminal, save the following commands and their output in a script file called practice.out by executing the following command:
script practice.out

Do the following operations:

  1. Show all your aliases.
  2. Display the first column of villians.txt and the first column of heroes.txt parallel to each other. You should see matchups between the heroes and villains.
  3. Rerun the previous command but put those matchups into a file called matchups.txt
  4. Display the unique villains from villains.txt.
  5. How many unique villains are there?
  6. Exit the script.

View your practice.out file to make sure that you recorded all the above commands.

Objective: grep Family and Regular Expressions Practice (30 pts)

For the following questions, try the commands out first in another terminal (probably in intermediate steps), and then just show me the final correct command/answer in the script file. Use the most appropriate member of the grep family to solve each problem. Record the command for each question in a file called regex.txt. Include the answer for the command in the file. Then, execute the command in a script file called regex_practice.out

I worded these in such a way to encourage use of options or regular expression tricks.

Unless otherwise stated, you can assume you're just looking for lowercase letters.

  1. How many words in /usr/share/dict/words contain "cei" somewhere in the word?
  2. Is "Aaronic" a word, according to /usr/share/dict/words? (Reminder: get as little output as possible to answer this question.)
  3. How many words in /usr/share/dict/words contain either the sequence "yes" or the sequence "no"?
  4. How many words in /usr/share/dict/words contain at least 3 vowels in a row?
  5. How many words in /usr/share/dict/words contain no upper or lowercase vowels?
  6. How many words in /usr/share/dict/words contain at least 4 o's (need not be consecutive)?
  7. How many words in /usr/share/dict/words begin and end with a vowel, but have no vowels in between?
  8. How many words in /usr/share/dict/words begin and end with the same vowel (a, e, i, o, or u)?
  9. How many words in /usr/share/dict/words begin and end with the same 3-letter sequence?
  10. How many words in /usr/share/dict/words contain 3 copies of the same 3-character sequence (not necessarily consecutively)?
  11. Display the words in /usr/share/dict/words that have 3 consecutive double-letter pairs (like "bookkeeper" has oo, kk, ee)

To help you verify your answers, here are the answers for some of the questions.

Objective: Analyzing Student Information

Analyzing Names (50 pts)

For this objective, since there is the potential for so much output in each of the intermediate steps, figure out the solution for each (numbered) problem. I want the least amount of input that shows me the answer. (For example, if I ask you for a number of something, don't show me all of them and make me count them up.) Then, after you have figured out the solutions to all of the problems, run script student_info.out and demonstrate the final command that shows me the answer, for each problem, sequentially. For each problem, you should be able to solve it using one command.

You probably should review the commands cut, sort, uniq, grep to solve these problems.

students.csv contains the names of all currently enrolled W&L undergraduates. Check out the format of the file.

In a separate text file called name_analysis.txt, write a short report that makes it clear what your answers are to each of these problems. Also, address the question: how precise is the first name analysis, given this text file?

  1. How many names are listed in this file?
  2. What are the 5 most common last names at W&L? The final result/output should have the last names sorted in decreasing order of frequency.

    Example output:

      10 Sprenkle
       8 Smith
       5 Brown
       5 Jolie-Pitt
       4 Washington
    
  3. How many unique first names (i.e., only one person has that name) are there in the W&L undergraduate class?
  4. What are the 5 most popular first names at W&L?
  5. How common is your first name at W&L, i.e., how many students at W&L have the same name as you and where does it rank in popularity?

    Example output:

    52:      6 Sara

    From the above output, I know there are 6 Saras and it is the 52nd most popular name. (Note that this is not the answer that Sarah should get.)

  6. Pose a question that you'd like to answer with this data and answer it. Explain the question and your result in the analysis document.

Analyzing Majors Data (50 pts)

Follow the general process from the previous part. Answer the following questions, and show your work by running script majors_info.out and analyze the data in majors_analysis.txt. Your data file is majors.txt You'll have to figure out what the contents of the file is, but I will say that BN are non-degree-seeking students, since those entries are "off" from the others.

  1. How many different degrees are being pursued at W&L? (e.g., BA, BS, ...)
  2. What is the most popular degree being pursued?
  3. How many students are expected to graduate in 2017?
  4. How many students are still undecided?
  5. How many students are pursuing a second major?
  6. How many students are pursuing CSCI as their first major and where does CSCI rank in popularity for first majors?
  7. The History department offers several different concentrations in their major, as indicated by the last two letters in the major name. What are the various concentration/majors that the history department offers?
  8. Pose one question that you'd like to answer about this data and answer it. Discuss in the analysis document.

Objective: Analyzing Log Files (40)

Long-running applications/services often write output to log files so that people can analyze them or diagnose problems. One of my log files from the web server running on cswiki.wlu.edu is in the handouts directory. The file is called error.log

Look at the file to get an idea of its contents. Then, answer the following questions, with the command you used to answer the question, in a file called log_analysis.txt

  1. How large is the file (both in terms of size and number of lines)?
  2. What version of Apache is the web server running?
  3. There are errors in the file... How can you find all the errors described in the file?
  4. How many SSL errors are recorded in the log?
  5. Most importantly, there are fatal errors listed in the log file. What are the fatal errors?
  6. How can you fix the fatal error? For this question, I want you to search the web to find out how to fix the error. Tell me the page(s) you found and what on that page you would try to fix the error.

Finishing up: What to turn in for this assignment

Copy your directory assign2 and its contents into your turnin directory. (You may want to use your symbolic link or the environment variable.)

Grading (200 pts)