CSC 103 Lecture Notes Week 5
Hashing



  1. Collection classes.
    1. The list and tree structures we have considered so far in 103 are typically referred to as collection classes.
    2. Simply put, a collection class is a data type that holds zero or more elements.
    3. As we have seen, different collection class implementations provide their own advantages and disadvantages in terms of performance.
      1. Using an unsorted array to implement a collection provides O(1) performance for accessing the ith element, but O(N) performance for most other operations.
      2. Using a linked list to implement a collection provides O(1) performance for operations at the boundaries of the collection, but O(N) performance for other operations.
      3. Using a balanced tree to implement a collection provides O(log N) performance for most key operations, but not O(1) for anything.
    4. These various performance characteristics help the implementor of a collection determine which is the most appropriate implementation under a certain set of circumstances.
      1. E.g., if one has a fixed amount of data where finding an element by its index position is the most important operation, then using an array can be a good choice.
      2. If one wants to implement a stack or queue with potentially unbounded size, a linked list is good.
      3. For a sorted collection that has a lot of inserts and deletes, a balanced tree is a good choice.

  2. The niche of hashed data structures in collection class performance.
    1. In these notes, and in Chapter 5 of the book, we consider another implementation technique for collection classes called hashing.
    2. A hashed data structure, typically referred to as a hash table provides the following performance pattern:
      1. O(1) insertion, deletion, and search for a specific element
      2. O(N) search for a successive item in sorted order
    3. A hash table is useful in an application that needs to rapidly store and look up collection elements, without concern for the sorted order of the elements in the collection.
    4. A hash table is not good for storing a collection in sorted order.

  3. Lookup tables.
    1. Before we look into the details of hash tables, we should consider the more general top of a lookup table -- a collection in which information is located by some lookup key.
    2. In the collection class examples we've studied so far, the elements of the collection have been simple strings or numbers.
    3. In real-world applications, the elements of a collection are frequently more complicated than this, i.e., they're some form of record structure.
      1. For example, some applications may need a collection to store a simple list of string names, such as
        [ "Baker", "Doe", "Jones", "Smith" ]
        
      2. In many other cases, there may be additional information associated names, such as age, id, and address; e.g.,
        [ {"Baker, Mary", 51, 549886295, "123 Main St."},
          {"Doe, John", 28, 861483372,"456 1st Ave."},
          ...
        ]
        
        1. In such a structure, collection elements are typically called information records.
        2. Information records can be implemented using a class in the normal way, e.g.,
          class InformationRecord {
              String name;        // Person name
              int age;            // Age
              int id;             // Unique id
              String address;     // Home address
          }
          
      3. When one needs to search for information records in a collection, one of the fields is designated as a unique key.
        1. This key uniquely identifies each record so it can be located unambiguously.
        2. In the above InformationRecord, the id field is a good choice for unique key.
    4. A collection structure that holds keyed information records of some form is generally referred to as a lookup table.
      1. The unique key is used to lookup an entry in the table.
      2. If an entry of a given key is found, the entire entry is retrieved.
    5. The implementations we have studied for linked lists and trees can be used as lookup tables, since the type of element has been Object.
    6. In the case of a hash table, the structure is specifically suited for use as a lookup table.

  4. The basic idea of hashing.
    1. Suppose we have a collection of personal information records of the form shown above in the InformationRecord class, where the size of the collection is a maximum of 10,000 records.
    2. Suppose further that we want rapid access to these records by id.
      1. A linked list would be a pretty poor choice for implementing the collection, since search by id would take O(N) time.
      2. If we kept the collection sorted by id, a balanced search tree would give us O(log N) access time.
      3. If we put the records in an array of 1,000,000,000 elements we could get O(1) access by id, but we'd waste a lot of space since we only have 10,000 active records.
    3. Is there some way that we could get O(1) access as in an array without wasting a lot of space?
    4. The answer is hashing, and it's based on the following idea:
      Allocate an array of the desired table size and provide a function that maps any key into the range 0 to TableSize-1.
      1. The function that performs the key mapping is called the hashing function.
      2. This idea works well when the hashing function evenly distributes the keys over the range of the table size.
    5. To ensure good performance of a hash table, we must consider the following issues:
      1. choosing a good hashing function that evenly maps keys to table indices;
      2. choosing an appropriate table size that's big enough but does not waste too much space;
      3. deciding what to do when the hashing function maps two different keys to the same table location, which condition is called a collision.

  5. A simple example.
    1. Suppose again we need a table of 10,000 InformationRecords with the id field used as the lookup key.
      1. We'll choose a hash table size of 10,000 elements.
      2. For the hashing function, we'll use the simple modulus computation of id mod 10000; if keys are randomly distributed this will give a good distribution.
      3. To resolve collisions, we'll use the simple technique of searching down from the point of collision for the first free table entry.
    2. If we insert the entries listed above for Mary Baker and John Doe, the table will look like the illustration in Figure 1.


      Figure 1: Hash table with entries at 6295 and 3372.




      1. The hashing function computes the indices 6295 and 3372, respectively, for the two keys.
      2. The records are placed at these locations in the table array.
    3. Suppose were next to add the record {"Smith, Jane", 39, 861493372, "789 Front St."}
      1. In this case, the hashing function will compute the same location for this record as for Mary Baker, since the id keys for Mary Baker and Jane Smith happen to differ by exactly 10,000.
      2. To resolve the collision, we'll put the Jane Smith entry at the next available location in the table, which is 6296.

  6. Things that can go wrong.
    1. In the preceding example, things worked out well, given the nature of the keys and the bounded table size.
    2. Suppose, however, some or all of the following conditions were to arise:
      1. The number or records grew past 10,000.
      2. Due to some coincidence of locale, a large number of ids differed by exactly 10,000.
      3. We wanted to use the name field as the search key instead of id.
    3. In such cases, we need to reconsider one or all of our choices for hashing function, table size, and/or collision resolution strategy.

  7. Choosing a good hash function.
    1. The choice of hash function depends significantly on what kind of key we have.
    2. In the case of numeric key with random distribution, the simple modulus hashing function works fine.
      1. However, if numeric keys have some non-random properties, such as divisibility by the table size, the modulus hashing function does not work well at all.
      2. If we use a non-numeric key, such as a name string, we must first convert the string into a number of some form before applying mod.
    3. In practical applications, lookup keys are frequently strings, hence some consideration of good string-valued hash functions is in order.

  8. Good hashing of string-valued keys.
    1. Approach 1: add up the character values of the string and then compute the modulus.
      1. The advantage of this approach is that it's simple and reasonably fast if the number of characters in a string is not too large.
      2. The disadvantage is that it may not distribute key values very well at all.
      3. For example, suppose keys are eight characters or fewer (e.g., UNIX login ids) and the table size is 10,000.
      4. Since ASCII string characters have a maximum value of 127, the summing formula only produces values between 0 and 127*8, which equals 1,016.
      5. This only distributes keys to a bit more than 10% of the table.
    2. Approach 2: use a formula that increases the size of the hash key using some multiplier.
      1. This approach is also simple and fast, but it may also not distribute keys well.
      2. E.g., one formula could be to sum the first three characters of a key string as follows:
        char[0] + (27 * char[1]) + (729 * char[2])
        
        and then compute the modulus.
      3. The rationale for the number 27 is that it's the number of letters in the alphabet, plus one for a space character; 729 is 272.
      4. If string name characters are equally likely to occur, this distributes keys in the range 0 to 263 = 17,576.
      5. However, empirical analysis of typical names shows that for the first three characters, there are only 2,851 combinations, which is not good coverage for a 10,000-element table.
    3. Approach 3: sum all key characters with a formula that increases the size and mixes up the letters nicely.
      1. An empirically derived formula to do this is the following:
        (37 * char[0]) + (372 * char[1]) + ... + (37(l-1) * char[l])
        
        where 37 is the empirically-derived constant and l = the string length of the key.
      2. This formula, plus similar ones with variants on the constant multiplier, have been shown to do a good job of mixing up the string characters and providing good coverage even for large table sizes.
    4. Some code for each of these approaches follows.
    /****
     *
     * Class Hashing contains three different string-valued hash functions, as
     * discussed in Lecture Notes Week 5.
     *
     * @author Gene Fisher (gfisher@calpoly.edu)
     * @version 1may01
     */
    
    public class Hashing {
    
        /**
         * Compute a hash index for the given string by summing the string
         * characters and taking the modulus of the given table size.
         */
        public static int hash1(String key, int tableSize) {
            int hashVal = 0;
    
            for (int i = 0; i < key.length(); i++) {
                hashVal += key.charAt(i);
            }
    
            return hashVal % tableSize;
        }
    
        /**
         * Compute a hash index for the given string by summing the first three
         * string characters with the formula:
         *
         *     char[0] + (27 * char[1]) + (729 * char[2])
         *
         * where 27 is the number of letter in the alphabet + 1 for a blank, and
         * 729 is 27<sup>2</sup>.
         *
         * Return the sum mod the given table size.
         */
        public static int hash2(String key, int tableSize) {
            int hashVal = 0;
    
            return (key.charAt(0) + (27 * key.charAt(1)) + (729 * key.charAt(2)))
                % tableSize;
        }
    
        /**
         * Compute a hash index for the given string by summing all of the string
         * characters with the formula:
         *
         *     (37 * char[0]) + (37^2 * char[1]) + ... + (37^(l-1) * char[l])
         *
         * where 37 is an empirically chosen value to provide good distribution and
         * l = key.length().
         *
         * Return the sum mod the given table size.
         */
        public static int hash3(String key, int tableSize) {
            int hashVal = 0;
    
            for (int i = 0; i < key.length(); i++) {
                hashVal = 37 * hashVal + key.charAt(i);
            }
    
            hashVal %= tableSize;
            if (hashVal < 0) {
                hashVal += tableSize;
            }
    
            return hashVal;
        }
    
    
    }
    




index | lectures | labs | handouts | examples | assignments | solutions | doc | grades | help