Movies Part 1 - ETL

Introduction

In this assignment you’ll create and execute an ETL pipeline for movie data.

General Instructions

This is an individual assignment.

Since this project counts for an exam grade you may not collaborate with your classmates. You may discuss general concepts related to the assignment, such as how to use the Python libraries in general terms, but you may not discuss any specifics of this assignment with anyone other than the course instructor or the TAs.

Notes:

Problem Description

You work at a movie production company. Your team has been given the task of creating a risk and reward models for science fiction movie proposals. As a junior data scientist on the team, you have been given the task of creating a CSV data set which can be easily loaded into a data frame for analysis.

Solution Description

Write a script called movies_etl.py which performs the ETL portion of an analytics pipeline (details in the sections below). Data analytics pipelines are typically organized in distinct phases (which are often subdivided into smaller steps):

  1. Extract data from sources such as databases, web sites, sensors, etc.
  2. Transform data into a usable form by filling in missing data, normalizing data, cleaning mislabelled data, etc.
  3. Load data into a form that can be read by analytics tools, such as a database or data exchange format like CSV.
  4. Explore …
  5. Model …
  6. Predict …
  7. Present …

This assignment focuses on the first three major steps, which are often called “ETL” for short. You will collect information from The Movie DB, transform the data into a usable form (sometimes by collecting additional data form other sources) and load the data into a CSV file that can be easily read by Pandas in Part 2.

Step 1: Extract Movie Data

First you’ll need an account at The Movie DB, then apply for an API key on the API settings page (you’ll need to be logged in to your account to apply for an API key).

Get movie data from TMDB

  1. Get a list of the TMDB ids of the 20 top grossing science fiction movies from the United States for each year from 1990 to 2018 (that’s 580 movies) using the discover API. The discover API will give you paginated results, that is, pages of 20 results each. Get the first page of 20 results for each year, and be sure to pass the parameter which gives you results sorted by revenue in descending order. A result from the discover API will look like this, from which you’ll extract the ids:
{'page': 1,
 'results': [{'adult': False,
              'backdrop_path': '/lmZFxXgJE3vgrciwuDib0N8CfQo.jpg',
              'genre_ids': [12, 878, 28, 14],
              'id': 299536,
              'original_language': 'en',
              'original_title': 'Avengers: Infinity War',
              'overview': 'As the Avengers and their allies have continued to '
                          'protect the world from threats too large for any '
                          'one hero to handle, a new danger has emerged from '
                          'the cosmic shadows: Thanos. A despot of '
                          'intergalactic infamy, his goal is to collect all '
                          'six Infinity Stones, artifacts of unimaginable '
                          'power, and use them to inflict his twisted will on '
                          'all of reality. Everything the Avengers have fought '
                          'for has led up to this moment - the fate of Earth '
                          'and existence itself has never been more uncertain.',
              'popularity': 127.094,
              'poster_path': '/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg',
              'release_date': '2018-04-25',
              'title': 'Avengers: Infinity War',
              'video': False,
              'vote_average': 8.3,
              'vote_count': 9499},
             {'adult': False,
              'backdrop_path': '/b6ZJZHUdMEFECvGiDpJjlfUWela.jpg',
              'genre_ids': [28, 12, 14, 878],
              'id': 284054,
              'original_language': 'en',
              'original_title': 'Black Panther',
              'overview': "King T'Challa returns home from America to the "
                          'reclusive, technologically advanced African nation '
                          "of Wakanda to serve as his country's new leader. "
                          "However, T'Challa soon finds that he is challenged "
                          'for the throne by factions within his own country '
                          'as well as without. Using powers reserved to '
                          "Wakandan kings, T'Challa assumes the Black Panther "
                          'mantel to join with girlfriend Nakia, the '
                          'queen-mother, his princess-kid sister, members of '
                          "the Dora Milaje (the Wakandan 'special forces') and "
                          'an American secret agent, to prevent Wakanda from '
                          'being dragged into a world war.',
              'popularity': 50.64,
              'poster_path': '/uxzzxijgPIY7slzFvMotPv8wjKA.jpg',
              'release_date': '2018-02-13',
              'title': 'Black Panther',
              'video': False,
              'vote_average': 7.3,
              'vote_count': 9086},
... 18 additional movies elided
  1. Because the discover API doesn’t give you all the data you want, you’ll need to make a call to the movie details API for each movie. A call to the movie details API for the first movie in the list above will give you:
{'adult': False,
 'backdrop_path': '/lmZFxXgJE3vgrciwuDib0N8CfQo.jpg',
 'belongs_to_collection': {'backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg',
                           'id': 86311,
                           'name': 'The Avengers Collection',
                           'poster_path': '/qJawKUQcIBha507UahUlX0keOT7.jpg'},
 'budget': 300000000,
 'genres': [{'id': 12, 'name': 'Adventure'},
            {'id': 878, 'name': 'Science Fiction'},
            {'id': 28, 'name': 'Action'},
            {'id': 14, 'name': 'Fantasy'}],
 'homepage': 'http://marvel.com/movies/movie/223/avengers_infinity_war_part_1',
 'id': 299536,
 'imdb_id': 'tt4154756',
 'original_language': 'en',
 'original_title': 'Avengers: Infinity War',
 'overview': 'As the Avengers and their allies have continued to protect the '
             'world from threats too large for any one hero to handle, a new '
             'danger has emerged from the cosmic shadows: Thanos. A despot of '
             'intergalactic infamy, his goal is to collect all six Infinity '
             'Stones, artifacts of unimaginable power, and use them to inflict '
             'his twisted will on all of reality. Everything the Avengers have '
             'fought for has led up to this moment - the fate of Earth and '
             'existence itself has never been more uncertain.',
 'popularity': 127.094,
 'poster_path': '/7WsyChQLEftFiDOVTGkv3hFpyyt.jpg',
 'production_companies': [{'id': 420,
                           'logo_path': '/hUzeosd33nzE5MCNsZxCGEKTXaQ.png',
                           'name': 'Marvel Studios',
                           'origin_country': 'US'}],
 'production_countries': [{'iso_3166_1': 'US',
                           'name': 'United States of America'}],
 'release_date': '2018-04-25',
 'revenue': 2046239637,
 'runtime': 149,
 'spoken_languages': [{'iso_639_1': 'en', 'name': 'English'}],
 'status': 'Released',
 'tagline': 'An entire universe. Once and for all.',
 'title': 'Avengers: Infinity War',
 'video': False,
 'vote_average': 8.3,
 'vote_count': 9499}

Step 2: Transform Movie Data

It’s usually necessary to clean the extracted data, which may include filling in missing data, fixing mislabelled data, and normalizing data. Here’s we’ll gain experience filling in missing data by triangulating with another source.

There are three movies which don’t have runtime values in The Movie DB. Use data from IMDB Datasets to fill in those missing runtimes. You may assume that the IMDB Datasets are in the same directory as your script.

NOTE: read the gzipped version of the IMDB dataset file using Python’s gzip module.

Step 3: Load Movie Data

Create a cleanly-formatted CSV file with the following header:

tmdb_id,imdb_id,title,release_date,belongs_to_collection,runtime,budget,revenue

The first few lines of your movies.csv output should look something like this:

tmdb_id,imdb_id,title,release_date,belongs_to_collection,runtime,budget,revenue
861,tt0100802,Total Recall,1990-06-01,False,113,10000000,261317921
196,tt0099088,Back to the Future Part III,1990-05-25,True,118,40000000,244527583
1498,tt0100758,Teenage Mutant Ninja Turtles,1990-03-30,True,93,13500000,202000000
1551,tt0099582,Flatliners,1990-08-10,False,115,26000000,61489265
169,tt0100403,Predator 2,1990-11-20,True,108,35000000,57120318
9556,tt0099365,Darkman,1990-08-24,True,96,16000000,48878502
5549,tt0100502,RoboCop 2,1990-06-22,True,117,35000000,45681173
2612,tt0100201,Mr. Destiny,1990-10-12,False,110,20000000,15379253
34086,tt0100666,Spaced Invaders,1990-04-27,False,100,3000000,15369573
19384,tt0099817,Dark Angel,1990-01-26,False,91,7000000,9229401
11309,tt0099740,Hardware,1990-09-14,False,93,1500000,5728953
23535,tt0099277,Class of 1999,1990-04-01,True,99,5200000,2459895
3072,tt0099612,Frankenstein Unbound,1990-11-02,False,82,11500000,334748
27274,tt0099611,Frankenhooker,1990-06-01,False,85,2500000,205068
192074,tt0098996,Aftershock,1990-10-05,False,91,0,0
530,tt0104361,A Grand Day Out,1990-05-18,True,23,0,0
18317,tt0103129,Truly Madly Deeply,1990-11-10,False,106,0,0
30647,tt0098962,12:01 PM,1990-01-01,False,25,0,0
545334,tt0099864,Stephen King's It,1990-11-18,False,180,0,0
218650,tt0099853,Invasion of the Space Preachers,1990-01-01,False,100,100000,0
280,tt0103064,Terminator 2: Judgment Day,1991-07-03,True,137,102000000,520000000
174,tt0102975,Star Trek VI: The Undiscovered Country,1991-12-05,True,113,27000000,96900000

Running Your Script

IMPORTANT: your script should cache data so that you aren’t making 610 web service API calls each time you run your script, and your script should report what it’s doing. You should save cached results in two JSON files whose names encode the TMDB genre id, years, sort order and kind of data in the file:

The first time you run your script it will take several minutes and write the following to the console:

$ python movies_etl YOUR_TMDB_API_KEY
Loading movie ids from themoviedb.org, saving in 878-1990-2018-revenue.desc-ids.json.
Loading movie details from themoviedb.org, saving in 878-1990-2018-revenue.desc-details.json.
Writing data to movies.csv

The second time you run your script it should take only a few seconds because of the cache files produced above and write the following to the console:

$ python movies_etl YOUR_TMDB_API_KEY
Loading movie ids from 878-1990-2018-revenue.desc-ids.json.
Loading movie details from 878-1990-2018-revenue.desc-details.json.
Writing data to movies.csv

Tips and Considerations

Scala

Python

General

Grading

This rubric is designed to give you many opportunities for partial credit and to make it easy to grade. To the extend possible, rubric items are independent.

Turn-in Procedure

Submit your movies_etl file on Canvas as an attachment. When you’re ready, double-check that you have submitted and not just saved a draft.

Verify the Success of Your Submission to Canvas

Practice safe submission! Verify that your HW files were truly submitted correctly, the upload was successful, and that your program runs with no syntax or runtime errors. It is solely your responsibility to turn in your homework and practice this safe submission safeguard.