Extra credit homework
In this homework, you will work with character-level language models. These models take as input a sequence of characters and predict the next character. You will first implement functionalities for an abstract language model, then build a new Temporal Convolutional Network (TCN).
This assignment should be solved individually. No collaboration, sharing of solutions, or exchange of models is allowed. Please, do not directly copy existing code from anywhere other than your previous solutions, or the previous master solution. We will check assignments for duplicates. See below for more details.
Starter code and dataset
You will train your model on Barrack Obama speeches (we tried other presidents, but Obama has the most publicly available transcribed speeches). For this assignment, we use a simplified character set: 26 lowercase characters (a to z), space and period.
A character language model (LanguageModel
in models.py
) generates text by predicting the 28 value (log-)probability distribution of the next character given a string s
(in the function LanguageModel.predict_next
).
For most models predicting the next log-probability for all characters (LanguageModel.predict_all
) is as efficient as predicting the log-probability of the last character only.
This is why in this assignment, you will only implement the predict_all
function, and compute predict_next
from predict_all
.
The predict_all
takes a string s
of length n
as an input.
It predicts the log-probability of the next character for each substring of s[:i]
for $i \in {0, \ldots, n}$, including the emtpy string ''
and the full string s
.
The function returns n+1
values.
To get you started, we implemented a simple Bigram
model, see here for more information.
The starter code further contains an AdjacentLanguageModel
that favors characters that are adjacent in the alphabet.
Use both models to debug your code.
Finally, utils.py
provides some useful functionality.
First, it loads the dataset in SpeechDataset
, and it implements a one_hot
encoding function that converts a string s
into a one-hot encoding of size (28, len(s))
.
You can create a dataset of one-hot encodings by calling SpeechDataset('data/train.txt', transform=one_hot)
.
This might be useful later during training.
You can implement different parts of this homework independently. Feel free to skip parts that seem too hard. However, it might be easiest to follow the order of the assignment.
Log likelihoods of text (10 pts)
We start by implementing a log_likelihood
function in language.py
.
This function takes a string as input and returns the log probability of that string under the current language model.
Test your implementation using the Bigram
or AdjacentLanguageModel
.
python3 -m homework.language -m Bigram
Hint: Remember that the language model can take an empty string as input
Hint: Recall that LanguageModel.predict_all
returns the log probabilities of the next characters for all substrings.
Hint: The log-likelihood is the sum of all individual likelihoods, not the average
You can grade your log-likelihood using:
python3 -m grader homework -v
Relevant Operations
LanguageModel.predict_all
utils.one_hot
- and all previous
Generating text (10 pts)
Next, implement the sample_random
function.
This function takes a language model and samples from it using random sampling.
You can generate a random sample by randomly generating the next character according to its distribution.
The sample terminates if max_len
characters are produced, or a period .
is generated.
Hint: torch.distributions
contains many useful sampling functions.
Again, test your implementation using the Bigram
and grade:
python3 -m grader homework -v
Here is what the master solution (TCN) produces:
some of a backnown my but or the understand thats why weve hardships not around work since there one
they will be begin with consider daughters some as more a new but jig go atkeeral westedly.
yet the world.
and when a letter prides.
in the step of support information and rall higher capacity training fighting and defered melined an
Relevant Operations
- torch.distributions
- and all previous
Beam search (20 pts)
Implement the function beam_search
to generate the top sentences generated by your language mode.
You should generate character-by-character and use beam search to efficiently store the top candidate substrings at each step.
At every step of beam search expand all possible characters.
Terminate a sentence if a period .
is generated or max_length
was reached.
Beam search returns the top n_results
either based on their overall log-likelihood or the average per-character log-likelihood average_log_likelihood=True
.
The per-character log-likelihood will encourage longer sentences, while the overall log-likelihood ofter terminates after a few words.
Hint: You mind find TopNHeap
useful to keep the top beam_size
beams or n_results
around.
Here is a snipped from the master solution TCN with average_log_likelihood=False
thats.
today.
in.
now.
And here with average_log_likelihood=True
and we will continue to make sure that we will continue to the united states of american.
and we will continue to make sure that we will continue to the united states of the united states.
and we will continue to make sure that we will continue to the united states of america.
and thats why were going to make sure that will continue to the united states of america.
Grade your beam search:
python3 -m grader homework -v
Relevant Operations
- and all previous
TCN Model (20 pts)
Your TCN model will use a CausalConv1dBlock
.
This block combines causal 1D convolution with a non-linearity (e.g. ReLU
).
The main TCN
then stacks multiple dilated CausalConv1dBlock
’s to build a complete model.
Use a 1x1 convolution to produce the output.
TCN.predict_all
should use TCN.forward
to compute the log-probability from a single sentence.
Hint: Make sure TCN.forward
uses batches of data.
Hint: Make sure TCN.predict_all
returns log-probabilities, not logits.
Hint: Store the distribution of the first character as a parameter of the model torch.nn.Parameter
Hint: Try to keep your model manageable and small. The master solution trains in 15 min on a GPU.
Hint: Try a residual block.
Grade your TCN model:
python3 -m grader homework -v
Relevant Operations
- torch.nn.Parameter
- torch.nn.functional.log_softmax
- torch.nn.ConstantPad1d
- and all previous
TCN Training (40 pts)
Train your TCN
in train.py
.
You may reuse much of the code from prior homework.
Save your model using save_model
, and test it:
python3 -m grader homework -v
Hint: SGD might work better to train the model, but you might need a high learning rate (e.g. 0.1).
Grading
You can test your code using
python3 -m grader homework -v
In this homework, it is quite easy to cheat the validation grader. We have a much harder to cheat hidden test grader, that is likely going to catch any attempts at fooling it. The point distributions between validation and test will be the same, but we will use additional test cases.
Second, in this homework, it is a little bit harder to overfit, especially if you keep your model small enough. However, still, keep in mind that we evaluate your model on the test set. The performance on the test grader may vary. Try not to overfit to the validation set too much.
We set the testing log-likelihood threshold such that a Bigram
with a log-likelihood of -2.3 gets 0 points and a TCN
with log-likelihood -1.3 get the full score.
Grading is linear.
Submission
Once you finished the assignment, create a submission bundle using
python3 bundle.py homework [YOUR UT ID]
and submit the zip file online. If you want to double-check that your zip file was properly created, you can grade it again
python3 -m grader [YOUR UT ID].zip
Grading
The test grader we provide
python3 -m grader homework -v
will run a subset of test cases we use during the actual testing. The point distributions will be the same, but we will use additional test cases. More importantly, we evaluate your model on the test set. The performance on the test grader may vary. Try not to overfit to the validation set too much.
Submission
Once you finished the assignment, create a submission bundle using
python3 bundle.py homework [YOUR UT ID]
and submit the zip file on canvas. Please note that the maximum file size our grader accepts is 20MB. Please keep your model compact. Please double-check that your zip file was properly created, by grading it again
python3 -m grader [YOUR UT ID].zip
Online grader
We will use an automated grader through canvas to grade all your submissions. There is a soft limit of 5 submisisons per assignment. Please contact the course staff before going over this limit, otherwise your submission might be counted as invalid.
The online grading system will use a slightly modified version of python and the grader:
- Please do not use the
exit
orsys.exit
command, it will likely lead to a crash in the grader - Please do not try to access, read, or write files outside the ones specified in the assignment. This again will lead to a crash. File writing is disabled.
- Network access is disabled. Please do not try to communicate with the outside world.
- Forking is not allowed!
print
orsys.stdout.write
statements from your code are ignored and not returned.
Please do not try to break or hack the grader. Doing so will have negative consequences for your standing in this class and the program.
Running your assignment on google colab
You might need a GPU to train your models. You can get a free one on google colab. We provide you with a ipython notebook that can get you started on colab for each homework.
If you’ve never used colab before, go through colab notebook (tutorial)
When you’re comfortable with the workflow, feel free to use colab notebook (shortened)
Follow the instructions below to use it.
- Go to http://colab.research.google.com/.
- Sign in to your Google account.
- Select the upload tab then select the
.ipynb
file. - Follow the instructions on the homework notebook to upload code and data.
Honor code
This assignment should be solved individually.
What interaction with classmates is allowed?
- Talking about high-level concepts and class material
- Talking about the general structure of the solution (e.g. You should use convolutions and ReLU layers)
- Looking at online solutions, and pytorch samples without directly copying or transcribing those solutions (rule of thumb, do not have your coding window and the other solution open at the same time). Always cite your sources in the code (put the full URL)!
- Using any of your submissions to prior homework
- Using the master solution to prior homework
- Using ipython notebooks from class
What interaction is not allowed?
- Exchange of code
- Exchange of architecture details
- Exchange of hyperparameters
- Directly (or slightly) modified code from online sources
- Any collaboration
- Putting your solution on a public repo (e.g. github). You will fail the assignment if someone copies your code.
Ways students failed in past years (do not do this):
-
Student A has a GPU, student B does not. Student B sends his solution to Student A to train 3 days before the assignment is due. Student A promises not to copy it but fails to complete the homework in time. In a last-minute attempt, Student A submits a slightly modified version of Student B’s solution. Result: Both students fail the assignment.
-
Student A struggles in class. Student B helps Student A and shows him/her his/her solution. Student A promises to not copy the solution but does it anyway. Result: Both students fail the assignment.
-
Student A sits behind Student B in class. Student B works on his homework, instead of paying attention. Student A sees Student B’s solution and copies it. Result: Both students fail the assignment.
-
Student A and B do not read the honor code and submit identical solutions for all homework. Result: Both students fail the class.
Installation and setup
Installing python 3
Go to https://www.python.org/downloads/ to download python 3. Alternatively, you can install a python distribution such as Anaconda. Please select python 3 (not python 2).
Installing the dependencies
Install all dependencies using
python3 -m pip install -r requirements.txt
Note: On some systems, you might be required to use pip3
instead of pip
for python 3.
If you’re using conda use
conda env create environment.yml
The test grader will not have any dependencies installed, other than native python3 libraries and libraries mentioned in requirements.txt
. This includes packages like pandas
. If you use additional dependencies ask on piazza first, or risk the test grader failing.
Manual installation of pytorch
Go to https://pytorch.org/get-started/locally/ then select the stable Pytorch build, your OS, package (pip if you installed python 3 directly, conda if you installed Anaconda), python version, cuda version. Run the provided command. Note that cuda is not required, you can select cuda = None if you don’t have a GPU or don’t want to do GPU training locally. We will provide instruction for doing remote GPU training on Google Colab for free.
Manual installation of the Python Imaging Library (PIL)
The easiest way to install the PIL is through pip
or conda
.
python3 -m pip install -U Pillow
There are a few important considerations when using PIL.
First, make sure that your OS uses libjpeg-turbo
and not the slower libjpeg
(all modern Ubuntu versions do by default).
Second, if you’re frustrated with slow image transformations in PIL use Pillow-SIMD
instead:
CC="cc -mavx2" python3 -m pip install -U --force-reinstall Pillow-SIMD
The CC="cc -mavx2"
is only needed if your CPU supports AVX2 instructions.
pip
will most likely complain a bit about missing dependencies.
Install them, either through conda
, or your favorite package manager (apt
, brew
, …).