Thesis Design Period: January-February/March

You get ready for your thesis project:

  1. You have found an internship.
  2. Set up your private github repo and start logging and commiting everything you do
  3. Create your thesis design
    • Research questions
    • Data description
    • Exploratory Data Analysis (EDA)
    • Ground your work in existing literature
  4. Find an internal (UvA or VU) supervisor.
  5. Finalize your thesis design, submit to Canvas, and have it approved by your internal supervisor.

You let us know what you do and with whom

  1. If you are all set, fill in https://forms.gle/qvzTQypLHBSkZLxK7.
    • Thanks!!
    • I have more than 80 150 students to manage. Without this form I am lost.
    • And you too ;-)

Thesis design = ticket to a supervisor

  • At UvA there is a shortage of supervisors
  • So it will be hard to find one
  • Your thesis design can be your ticket to a supervisor
    • the better the design,
    • the more likely you will succeed,
    • the easier and smoother the thesis period,
    • and so the more you will be liked by supervisors
      • and they will choose to supervise you.

Thesis Design

Thesis Design: whom to make happy?

  • Most of you do your thesis at an internship.
  • This may cause friction....
    • The company wants you to make/produce/do $X$
    • You want to (and must) answer a research question

You are graded on

  • Your research question,
  • and how you have answered it
    • Related work
    • experiment/data collection
    • well described methodology/ experimental setup
    • evaluation
    • reflection
  • In the end, this is also most useful to your company.
    • Even if they do not see this at the start ....

Get the expectations on all sides clear!

  • An explicit thesis design helps a lot with this.

Thesis Design sections

  1. A title, supervisor(s) and abstract
    1. Plus clickable links to email addressess of author, supervisors and private github repository of the thesis project.
  1. A clearly defined research problem and corresponding subquestions
  1. Overview of the state of the art of the literature
    1. Clearly indicate how your approach is grounded in the literature
  1. Methodology
    1. Describe your "resources" (those that are applicable)
      • data
        • Quite extensive EDA (Exploratory Data Analysis)
      • algorithms
      • software
    2. Describe the methods you will use
    3. Describe how you evaluate your results
  1. Risk assessment
    1. Describe the risks, and describe your backup plan for each of them.
  1. Project plan
    1. You describe what you have achieved when.
    2. This typically has the shape of a table with 12 entries (one for each week).
    3. The last entry is clear (Thesis).
    4. Other entries refer always back to your subquestions, methodology, literature.
    5. You describe concrete achievements, not actions. (e.g., instead of data preparation you write all data in XXX format, well-described, ready for analysis using YYY)

Thesis Design assessment form

Below you see the weight of each section and the questions used by the supervisor to assess the sections.

  1. A title, supervisor(s), abstract (10)
    1. Is all clear and neat?
  2. A clearly defined research problem and corresponding subquestions (20)
    1. Can the problem be answered?
    • Do answers to the subquestions indeed help in an understanding of the research problem or even in solving the research problem?
    • Are the subquestions detailed enough?
  3. Overview of the state of the art of the literature (20)
    1. One expects that the research problem is grounded in the literature and that each subquestion or field has a small section of relevant literature.
    • All parts of the thesis should be grounded in or at least connected to the literature.
  4. Methodology (20)
    1. Do I get a clear picture of the used resources?
      1. E.g., for data, do I get a clear picture of the data, its state, its availability, how much it is, how dirty, how much work to process, etc, etc.
    2. Are the methods which will be used described in enough detail, so that I can picture what will be done exactly?
    3. Is the evaluation appropriate? That is, do I understand how each subquestion is answered by the evaluation?
  5. Risk assessment (10)
    1. Is it complete? Is is realistic? Is the backup plan executable?
  6. Project plan (20)
    1. Is it complete? (I.e., every part of the work covered.)
    • Is it realistic?
    • Does it give a clear picture of what will be done when?
    • Is it possible to evaluate whether the student is on schedule at any point in time?

Absolutely necessary

Data

  • You must have seen it, have access to it, done EDA on it, and be confident that the data is sufficient to answer your research question.
  • If your company does not want to give access to the data before April, stop the project at once.

A clearly stated research question which is answered by quantitative experiments.

A solid plan about the experiments that you will do.

Research questions

  • Many students find it very difficult to formulate them.

Not good

  • Can Machine learning help/enhance ....?
  • Can ML improve the current algorithm of the company?

Good example

The goal of the research is to determine whether automated text classification methods can reach the human performance level of customer service agents in the task of labelling customer emails with a contact reason by using only the text body. The research is split into four subquestions.

  1. What is the current human service agent performance and can the system dataset labels be validated by the observers?
  2. How accurately can the text body of the first message from the customer to the customer service be extracted?
  3. What performance level can be achieved with a baseline model by using a well-performing (according to literature) standard multi-class text classification pipeline setup?
  4. How much can the automated classification performance be improved by using different pre-processing steps, fea- ture selection methods, sampling techniques for imbalanced datasets, hyperparameter tuning, and alternative models?

Another nice example

It has been observed that forecasting using neural networks can be significantly improved by deseasonalizing the data prior to feeding it into the network [33]. Based on this finding, this study evaluates the effect on GRU model performance of extracting the seasonal component from the inflow data by a separate model. Inske Groenen 2018

An insufficient thesis

  • You nicely preprocess your dataset
  • You run X different ML algorithms, possibly with a grid search for optimal hyper parameters
  • You present a large table with precision, recall, F1 values.
  • You conclude that algorithm X performs better than baseline/the company algorithm/...

Instead this would be a sufficient end-report of an assignment in a ML class.

This is a master thesis

  • A master is able to explain, to reflect, to analyse.
  • Better make an informed choice of features to use, than simply throw all you have to the algorithm
  • Analyse errors, and figure out why things go wrong
  • Understand why one thing works better than another, and explain it to (semi-)laymen (people in your company).

The reader of your thesis wants to learn something.

  • And not simply the accuracy value of your best experiment.
  • She wants to know why something worked or did not work.

Example theses

Rest of today

  1. Plenary Questions
  2. Individual questions about finding supervisors
In [2]:
!jupyter nbconvert ThesisDesignPeriod.ipynb    --to slides  --post serve 
[NbConvertApp] Converting notebook ThesisDesignPeriod.ipynb to slides
[NbConvertApp] Writing 298430 bytes to ThesisDesignPeriod.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Serving your slides at http://127.0.0.1:8000/ThesisDesignPeriod.slides.html
Use Control-C to stop this server
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 3.80ms
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 0.83ms
^C

Interrupted
In [3]:
!jupyter nbconvert ThesisDesignPeriod.ipynb    --to slides  
[NbConvertApp] Converting notebook ThesisDesignPeriod.ipynb to slides
[NbConvertApp] Writing 295541 bytes to ThesisDesignPeriod.slides.html
In [ ]: