Word Histogram


General Instructions

This is an individual assignment.

Collaboration at a reasonable level will not result in substantially similar code. Students may only collaborate with fellow students currently enrolled in this course, the TAs and the lecturer. Collaboration means talking through problems, assisting with debugging, explaining a concept, etc. You should not exchange code or write code for others.


Problem Descrtiption

You’re a curious linguist with computer hacking skills and you want to see, roughly, if Zipf’s Law holds for texts contained in files lying around on your disk.

Solution Description

Write a Python module in word_hist.py that includes the following functions. You should copy these function declarations and docstrings verbatim to ensure that we can successfully autograde your submission.

from typing import *

def normalize_text(text: str) -> str:
    """Return copy of text in lowercase with punctuation removed.

    Usage Examples:
    >>> normalize_text("Numchuk skills, bow hunting skills, computer hacking skills...")
    'numchuk skills bow hunting skills computer hacking skills'

def mk_word2count(text: str) -> Dict[str, int]:
    """Return a dictionary mapping words in text to their count in text. Be sure to account for newlines in the text!

    Usage Examples: (Note technique for testing dict equality.)

    >>> mk_word2count('the butcher the baker the candlestick maker') == {'butcher': 1, 'baker': 1, 'candlestick': 1, 'the': 3, 'maker': 1}

def dict2tuples(word_dict: Dict[str, int],
                key: Callable[[Tuple], Any]=None) -> List[Tuple[str, int]]:
    """Convert a str:int dictionary to a sorted (in descending order) list of (str, int) tuples, optionally with a key function for the sorting the tuples

    Usage Examples:
    >>> dict2tuples({'a': 2, 'b': 5, 'c': 1}, key=lambda t: t[1])
    [('b', 5), ('a', 2), ('c', 1)]

def normalize_counts(tuples: Sequence[Tuple[str, int]],
                     max_value: int=100) -> Sequence[Tuple[str, int]]:
    """Return: sequence of tuples with same first elements as input tuples
    but whose second elements are normalized to the range 0 to

    Usage Examples:
    >>> wctups = [('a', 200), ('the', 180), ('an', 160), ('shenannigans', 50)]
    >>> normalize_counts(wctups, 100)
    [('a', 100), ('the', 90), ('an', 80), ('shenannigans', 25)]

def word_hist(bar_list: Sequence[Tuple[str, int]]) -> List[str]:
    """Create a text-based bar chart from bar_list. Return: List[str] with one
    line per tuple in bar_list. Each string in the returned list has a
    right-aligned label from the first element of the corresponding tuple
    in bar_list, a | character, then a number of Xs equal to the
    second element from the tuple.

    Usage Examples:
    >>> from pprint import pprint
    >>> pprint(word_hist([('a', 10),('the', 9),('an', 8),('shenannigans', 2)]))
    ['           a | XXXXXXXXXX',
     '         the | XXXXXXXXX',
     '          an | XXXXXXXX',
     'shenannigans | XX']


Structure your main method as we have been taught:

def main(args):
    # code intended to be executed when run as a script

if __name__=="__main__":
   import sys

The user may supply one to three command line arguments to your Python script. The first argument to the python interpreter, sys.argv[0], is the name of your script, i.e., args[0] = "word_hist.py", so there will be 1 to 4 arguments in sys.argv. sys.argv, a list of strings, should be passed as-is to the main() function to minimize confusion.

Here’s a snippet of code that checks for the existence of a file:

import os.path
os.path.exists("file_name.txt") # returns True if file_name.txt exists

Once you have a valid file name, read the file contents into a string. Here’s a snippet of code that opens a file for reading as text and reads the file contents into a str variable:

infile = open(file_name, 'r') # opens file_name as readable file object infile
text = infile.read()          # dumps text data from infile into text variable
infile.close()                # closes infile

Once you read the contents of the file into a str, use the functions you created above to:

Here’s a sample program run, using the file i-have-a-dream.txt:

$ python hw2.py i-have-a-dream.txt 80 80
            is | XXXXXXXXXXXXXXXXX
            in | XXXXXXXXXXXXXXXXX
          this | XXXXXXXXXXXXXXX
       freedom | XXXXXXXXXXXXXXX
            as | XXXXXXXXXXXXXXX
          from | XXXXXXXXXXXXX
          have | XXXXXXXXXXXXX
           our | XXXXXXXXXXXXX
          with | XXXXXXXXXXXX
             i | XXXXXXXXXXX
         negro | XXXXXXXXXX
           not | XXXXXXXXXX
           one | XXXXXXXXXX
           let | XXXXXXXXXX
           day | XXXXXXXXX
          ring | XXXXXXXXX
         dream | XXXXXXXX
          come | XXXXXXX
        nation | XXXXXXX
         every | XXXXXXX
           for | XXXXXX
            go | XXXXXX
          back | XXXXXX
         today | XXXXXX
           are | XXXXXX
          must | XXXXXX
     satisfied | XXXXXX
            by | XXXXXX
           you | XXXXXX
         their | XXXXXX
       justice | XXXXXX
          able | XXXXXX
          when | XXXXX
           all | XXXXX
            it | XXXX
        cannot | XXXX
           men | XXXX
         white | XXXX
          long | XXXX
           now | XXXX
           but | XXXX
         there | XXXX
      together | XXXX
          time | XXX
         which | XXX
            on | XXX
         faith | XXX
      children | XXX
       america | XXX
            my | XXX
          free | XXX
         check | XXX
           has | XXX
         shall | XXX
         great | XXX
           new | XXX
         years | XXX
           who | XXX
            an | XXX
          into | XXX
            so | XXX
         black | XXX
          hope | XXX
       hundred | XXX
   mississippi | XXX
            up | XXX
            us | XXX
          down | XXX
         until | XXX
      mountain | XXX
         later | XXX

Note that the full output would be very long and words with low rank would have no Xs.

