CSC 357 Programming Assignment 1
sgrep -- A Simplified Version of Grep
-- REVISED --



ISSUED: Monday, 2 April 2007
DUE: On or before 11:59:59PM Wednesday 11 April, via handin on falcon/hornet
POINTS POSSIBLE: 100
WEIGHT: 5% of total class grade
READING: Lecture Notes Weeks 1 and 2, K&R Chapters 1-4 and 7, selected parts of Stevens,
various cited man pages

Specification

The deliverable for this assignment is a simplified version of the very useful grep utility. The program is called "sgrep", for "simple grep".

The following are excerpts from the grep man page that describe the functionality relevant to sgrep:

NAME
     grep - search a file for a pattern

SYNOPSIS
     grep [-iln] regular-expression [filename...]

DESCRIPTION

     The grep utility searches text files for a pattern and prints all lines
     that contain that pattern.  If no files are specified, grep assumes
     standard input. Normally, each line found is copied to standard output.
     The file name is printed before each line found if there is more than one
     input file.

     Be careful using the characters $, *, ., [, ], and ^ in the patternlist
     because they are also meaningful to the shell.  It is safest to enclose
     the entire patternlist in single quotes '...'.

     The grep utility uses limited regular expressions like those described on
     the regexp(5) manual page to match the patterns.

OPTIONS
     The following options are supported

     -i   Ignores upper/lower case distinction during comparisons.

     -l   Prints only the names of files with matching lines, separated by
          NEWLINE characters.  Does not repeat the names of files when the
          pattern is found more than once.

     -n   Precedes each line by its line number in the file (first line is 1).

The regular expression argument is a form of generalized pattern that provides flexibility in searching through files. For example, the regular expression "T.*e" matches all strings that start with the letter "T" and end with the letter "e", such as "The", "These", "There".

The sgrep program uses a simplified subset of the regular expressions recognized by grep. The following are excerpts from the regexp man page that describe the structure of regular expressions handled by sgrep:

DESCRIPTION

     A regular expression specifies a set of character strings. A member of
     this set of strings is said to be matched by the regular expression. Some
     characters have special meaning when used in a regular expression; other
     characters stand for themselves.

     The following characters have special meaning in a regular expression:
         .   *   ^   $   [   ]
     All other characters match themselves.

     A period (.) is a one-character RE that matches any character except
     newline.

     A one-character RE followed by an asterisk (*) is a RE that matches 0 or
     more occurrences of the one-character RE.  If there is any choice, the
     longest leftmost string that permits a match is chosen.

     The caret (^) is special only when it appears at the beginning of a RE, and
     means that the RE only matches at the beginning of a line.

     The caret ($) is special only when it appears at the end of a RE, and
     means that the RE only matches at the end of a line.

     A non-empty string of characters enclosed in square brackets ([]) is a
     one-character RE that matches any one character in that string.  The four
     special characters listed above stand for themselves within such a string
     of characters.

     The concatenation of REs is a RE that matches the concatenation of the
     strings matched by each component of the RE.


For sgrep, regular expressions are limited to the following forms:

The following are specific limitations of the sgrep program:

If the limitations are not met, sgrep writes an error message to stderr and terminates with no further output. In the case of the pattern length, program termination is immediate. In the case of the line length being exceeded or an unreadable file, sgrep produces output up to the point of detecting the error, then terminates without processing any further lines or files.

The sgrep program does NOT need to handle abbreviated command-line options, in which two or more option characters can be concatenated together. E.g., the UNIX grep utility accepts the argument "-iln" as an abbreviation of "-i -l -n". Again, sgrep does not need to support such abbreviations.

Sample Inputs and Outputs

Given a file named "input1" with the following contents

Von Neumann was the subject of many dotty professor stories.  Von
Neumann supposedly had the habit of simply writing answers to homework
assignments on the board (the method of solution being, of course,
obvious) when he was asked how to solve problems.  One time one of his
students tried to get more helpful information by asking if there was
another way to solve the problem.  Von Neumann looked blank for a
moment, thought, and then answered, "Yes.".

The following table describes the output of various grep commands.

Command Output
grep V input1 matches lines 1 and 6
grep v input1 matches lines 4 and 6
grep x input1 matches no lines
grep -i v input1 matches lines 1, 4, and 6
grep 'd.*y' input1 matches lines 1, 2, and 5
grep '^a' input1 matches lines 3 and 6
grep ',$' input1 matches line 3
grep '[.,]' input1 matches lines 1, 3, 4, 6, and 7

The -n and -l arguments do not affect how the match is performed, but only the format of the output. Without either of these two arguments, sgrep outputs the entire contents of each matched line. With -n, the line number precedes the matched lines. With -l, only the name of matched files is printed, without the line contents (so -l makes the most sense when there are multiple input files).

You are encouraged to play around with grep, to see how it behaves on various inputs. The behavior of sgrep is a proper subset of UNIX grep. This means that sgrep produces exactly the same output as grep, for the subset of arguments and regular expressions supported by sgrep.

The complete set of required input files is in the online class directory, at


http://www.csc.calpoly.edu/~gfisher/classes/357/programs/1/testing/inputs

The corresponding correct outputs are in

http://www.csc.calpoly.edu/~gfisher/classes/357/programs/1/testing/expected-output

The program 1 testing plan has complete details of input/output behavior your program must exhibit. The plan is in the file

http://www.csc.calpoly.edu/~gfisher/classes/357/programs/1/testing/plan.html

All of these testing files will be available by Thursday 5 April.

Implementation Suggestions

The following C library functions may be particularly useful in your implementation. You can read about these in K&R, Stevens, and the man pages.

Function Description
printf print to stdout
fgets read a line of characters from a FILE* stream, including stdin
strlen calculate the length of a string
strstr locate the first occurrence of one string in another
strtok find delimited tokens in a string
strcpy copy strings
strcmp compare strings
fopen open a file
fclose close a file
feof test to see if a given stream has encountered an end of file

Your implementation can use any of the string processing functions described in the UNIX string(3C) library. However, your implementation canNOT use the the functions provided in the regular expression libraries regex(3C), regcmp(3C), or regexp(5).

Your implementation of sgrep will have to use string variables, and functions that take string parameters. Declaring the string-valued parameters is easy -- just use 'char *' as their type.

To be useful in a program, string variables must be declared as character arrays of a specific size. This applies in particular to the pattern and line string variables you will declare. Section 1.9 of K&R has some useful examples for dealing with string variables, i.e., character arrays.

The specification of sgrep is written to preclude the use of dynamic memory allocation, i.e., malloc, in the sgrep implementation. We will discuss this issue further in the coming weeks.

Deliverables

You must submit a single file named "sgrep.c" as the deliverable. This file contains the implementation of the sgrep program that meets the above specification.

Scoring Details

The testing plan cited above has the precise point breakdown for this program. This plan has all required test cases that your program must pass. There are no extra "hidden" input files that will be used.

The handout on coding conventions specifies point deduction categories for violations of the conventions. Since Program 1 does not require a .h file as a deliverable, the conventions regarding .h files do not apply to Program 1. For this program, the top-level program comment and the comments for each function should appear in the .c file.

Throughout the quarter, your instructor will stress the utility of incremental development, for the purposes of scoring points on your programs. The idea of this is to build a working program in a step-by-step fashion, starting with the simple functionality, and incrementally adding harder functionality. If you get your program to work for some of the simple cases, but not all of the harder ones, you can still score a decent number of points on the assignment.

As concrete example of incremental development for this assignment, the following are steps you can use to implement sgrep. Included for each step are the number of points (out of 100) that successful completion of the step earns.

Step 1: simple string matches, input only from stdin, no patterns 15
Step 2: read from one file given on command line 5
Step 3: read from multiple files given on command line 10
Step 4: -n option 5
Step 5: -i option 5
Step 6: -l option 5
Step 7: patterns with '^' 5
Step 8: patterns with '$' 8
Step 9: patterns with '.' 8
Step 10: patterns with '[...]' 10
Step 11: patterns with '.' and '*' 12
Step 12: patterns with various combinations of operators 8
Step 13: error handling 4
You do not have to follow exactly these steps, in exactly this order. However, if you do, your development will likely go more smoothly, and you can earn some partial credit on the assignment if you cannot get everything to work.

Collaboration

NO collaboration is allowed on this assignment. Everyone must do their own individual work.

How to Submit the Deliverable

Submit your deliverable using the handin program on falcon/hornet. The specific command is

handin gfisher prog1 sgrep.c
Run this command from the directory where your copy of sgrep.c is stored.

You can resubmit your files as many times as you like, up to the submission deadline. Each new submission completely replaces the previously submitted file(s). If you follow a incremental development strategy, you can submit as many partially-working versions as you like, as each step is completed.



index | lectures | labs | programs | handouts | solutions | examples | documentation | bin