In this homework you will practice
This is an individual assignment.
Collaboration at a reasonable level will not result in substantially similar code. Students may only collaborate with fellow students currently enrolled in this course, the TAs and the lecturer. Collaboration means talking through problems, assisting with debugging, explaining a concept, etc. You should not exchange code or write code for others.
Notes:
You’re a curious linguist with computer hacking skills and you want to see, roughly, if Zipf’s Law holds for texts contained in files lying around on your disk.
Write a Python module in word_hist.py
that includes the following functions. You should copy these function declarations and docstrings verbatim to ensure that we can successfully autograde your submission.
def normalize_text(text):
"""Return copy of text in lowercase with punctuation removed.
Parameters:
text: str - text to normalize
Return: str which is copy of text converted to lowercase with
punctuation (the chars in string.punctuation) removed.
Usage Examples:
>>> normalize_text("Numchuk skills, bow hunting skills, computer hacking skills...")
'numchuk skills bow hunting skills computer hacking skills'
"""
def mk_word2count(text):
"""Return a dictionary mapping words in text to their count in text. Be sure to account for newlines in the text!
Parameters:
text: str - string containing words separated by spaces
Return: char_dict: dict - dictionary whose keys are words and
associated values are the number of times the word appears in text
Usage Examples: (Note technique for testing dict equality.)
>>> mk_word2count('the butcher the baker the candlestick maker') == {'butcher': 1, 'baker': 1, 'candlestick': 1, 'the': 3, 'maker': 1}
True
"""
def dict2tuples(word_dict, key=None):
"""Convert a str:int dictionary to a sorted list of (str, int) tuples, optionally with a key
Parameters:
word_dict: dict[str -> int]
key: (optional) a key function to extract the element of the tuples by which to sort
Return: a list[(str, int)], sorted in descending order, optionally by a key
Usage Examples:
>>> dict2tuples({'a': 2, 'b': 5, 'c': 1}, key=lambda t: t[1])
[('b', 5), ('a', 2), ('c', 1)]
"""
def normalize_counts(tuples, max_value=100):
"""Normalize the second values in tuples.
Parameters:
tuples: Sequence[(str, int)] - (word, count) tuples
max_value: int - the max value of the normalized counts (min value is 0)
Return: Sequence[(str, int)] with same first elements as tuples
but whose second elements are normalized to the range 0 to
max_value.
Usage Examples:
>>> wctups = [('a', 200), ('the', 180), ('an', 160), ('shenannigans', 50)]
>>> normalize_counts(wctups, 100)
[('a', 100), ('the', 90), ('an', 80), ('shenannigans', 25)]
"""
def word_hist(bar_list):
"""Create a text-based bar chart from bar_list.
Parameters:
bar_list: Sequence[(str, int)] - (label, length) tuples
Return: list[str] with one line per tuple in bar_list. Each line --
a str in the returned list -- has the right-aligned label, a |
character, then length Xs
Usage Examples:
>>> from pprint import pprint
>>> pprint(word_hist([('a', 10),('the', 9),('an', 8),('shenannigans', 2)]))
[' a | XXXXXXXXXX',
' the | XXXXXXXXX',
' an | XXXXXXXX',
'shenannigans | XX']
"""
main
Structure your main method as we have been taught:
def main(args):
# code intended to be executed when run as a script
if __name__=="__main__":
import sys
main(sys.argv)
The user may supply one to three command line arguments to your Python script. The first argument to the python
interpreter, sys.argv[0]
, is the name of your script, i.e., args[0] = "word_hist.py"
, so there will be 1 to 4 arguments in sys.argv
. sys.argv
, a list of strings, should be passed as-is to the main()
function to minimize confusion.
args[0] = "word_hist.py"
).Here’s a snippet of code that checks for the existence of a file:
import os.path
os.path.exists("file_name.txt") # returns True if file_name.txt exists
Once you have a valid file name, read the file contents into a string. Here’s a snippet of code that opens a file for reading as text and reads the file contents into a str
variable:
infile = open(file_name, 'r') # opens file_name as readable file object infile
text = infile.read() # dumps text data from infile into text variable
infile.close() # closes infile
Once you read the contents of the file into a str
, use the functions you created above to:
word_hist
function, andnum_lines
of the histogram, where num_lines
is the number of lines of the histogram to display.Here’s a sample program run, using the file i-have-a-dream.txt:
$ python hw2.py i-have-a-dream.txt 80 80
the | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
of | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
to | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
and | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
a | XXXXXXXXXXXXXXXXXXXXXXXXXXXX
be | XXXXXXXXXXXXXXXXXXXXXXXXX
we | XXXXXXXXXXXXXXXXXXXXXXX
will | XXXXXXXXXXXXXXXXXXXX
that | XXXXXXXXXXXXXXXXXX
is | XXXXXXXXXXXXXXXXX
in | XXXXXXXXXXXXXXXXX
this | XXXXXXXXXXXXXXX
freedom | XXXXXXXXXXXXXXX
as | XXXXXXXXXXXXXXX
from | XXXXXXXXXXXXX
have | XXXXXXXXXXXXX
our | XXXXXXXXXXXXX
with | XXXXXXXXXXXX
i | XXXXXXXXXXX
negro | XXXXXXXXXX
not | XXXXXXXXXX
one | XXXXXXXXXX
let | XXXXXXXXXX
day | XXXXXXXXX
ring | XXXXXXXXX
dream | XXXXXXXX
come | XXXXXXX
nation | XXXXXXX
every | XXXXXXX
for | XXXXXX
go | XXXXXX
back | XXXXXX
today | XXXXXX
are | XXXXXX
must | XXXXXX
satisfied | XXXXXX
by | XXXXXX
you | XXXXXX
their | XXXXXX
justice | XXXXXX
able | XXXXXX
when | XXXXX
all | XXXXX
it | XXXX
cannot | XXXX
men | XXXX
white | XXXX
long | XXXX
now | XXXX
but | XXXX
there | XXXX
together | XXXX
time | XXX
which | XXX
on | XXX
faith | XXX
children | XXX
america | XXX
my | XXX
free | XXX
check | XXX
has | XXX
shall | XXX
great | XXX
new | XXX
years | XXX
who | XXX
an | XXX
into | XXX
so | XXX
black | XXX
hope | XXX
hundred | XXX
mississippi | XXX
up | XXX
us | XXX
down | XXX
until | XXX
mountain | XXX
later | XXX
Note that the full output would be very long and words with low rank would have no X
s.
Submit your word_hist.py
file on Canvas as an attachment. When you’re ready, double-check that you have submitted and not just saved a draft.
Practice safe submission! Verify that your HW files were truly submitted correctly, the upload was successful, and that your program runs with no syntax or runtime errors. It is solely your responsibility to turn in your homework and practice this safe submission safeguard.
This procedure helps guard against a few things.