Friday, April 25, 2008

Loading up Netflix Prize Data into MySQL

Disclaimer

I don't consider myself a mysql expert. My goal was to get data installed in roughly a 1-day effort.

Please feel free to post comments on improvements or questions about any of the method used in this blog.

Goal


The goal of today's blog is loading up the Netflix Prize data into MySQL.

OS

It is my assumption that you are doing this on Windows.

Downloads

To do this, it is assumed that you have the following items downloaded:

1. Netflix Prize Data

This is available from:
http://netflixprize.com

On my system, the download took roughly 2 hours.

2. MySQL (I used MySQL 5.0)

This is available from:
http://dev.mysql.com/downloads/

3. MySQL Connector/J (I used MySQL Connector/J 5.1)

This is available from:
http://dev.mysql.com/downloads/connector/

4. Cygwin Tools (I used 1.5.25)

This is available from:
http://cygwin.com

I used the following:
  • vim
  • bash
  • tar
  • gzip

5. JDK (I used Sun JDK 6 Update 6)

This is available from:
http://java.sun.com/javase/downloads/index.jsp

Tasks

1. Set up cygwin tools

(1) Run setup.exe, hit the NEXT button and select "Install from Internet"

(2) When I get to the choices:

I click on the "View" Button

I select the following:
  • bash
  • gzip
  • tar
  • vim
I hit NEXT and it installs.

2. Untar the Netflix Prize data file

(1) Open a cygwin window

(2) cd to the directory where the Netflix data is found and type:
tar xvfz download*.gz
(3) You will now have the following files created:download/rmse.pl
download/README
download/qualifying.txt
download/probe.txt
download/movie_titles.txt
download/training_set.tar
(4) Next, untar the training set by typing
cd download
tar xvf training_set.tar
(5) You will now have the following files created:

download/training_set/mv_0000001.txt
...
download/training_set/mv_0017770.txt
3. Install and configure MySQL

(1) Run setup.exe and choose "Complete" installation.

(2) Run MySQL Server Instance Configuration Wizard (which is now installed) and click NEXT.

(3) I choose the following options:

  • "Detailed Configuration"
  • "Dedicated MySQL Server Machine"
Note: This choice will significantly slow down your computer but greatly speed up the queries.
  • "Non-Transactional Database Only" (since the goal is speed)
  • "Decision Support (DSS)/OLAP
  • "Strict Mode" (to help prevent mistakes)
  • "Include Bin Directory in Windows PATH"
4. Create MySQL tables for Netflix Data

(1) Open a MySQL Command Line client and hit enter for the password

(2) Create the database

create database netflix;

use netflix;

(3) Create tables used for loading

We'll create two tables here: "rating" to hold the training data without the probe data. "rating_all" to hold all the training data which we can use to generate the qualifying totals.

create table rating (
id int(4) unsigned not null primary key auto_increment,
movieid int(2),
userid int(3), rating int(1), date date,
first date, support int(3), status int(1),
avg double, avg2 double, residual double, checkvalue double);

create table rating_all as select * from rating;

create index index_rating_userid on rating(userid);

create index index_rating_all_userid on rating_all(userid);

create index index_rating_movieid on rating(movieid);

create index index_rating_all_movieid on rating_all(movieid);

alter table rating disable keys;

alter table rating_all disable keys;

create table probe_temp (movieid int(2), userid int(3));


5. Install JDK and MySQL Connector/j

(1) Run the jdk download and it self-installs.

(2) Untar the MySQL Connector by typing:
tar xvfz mysql-connector-java*.gz
6. Load up Netflix Data into MySQL

(1) Load the Probe Data into a temporary file

I wrote the following awk script (fixprobe.awk):


{
if ($0 ~ /:/) {
movieid=$1
}
else {
printf("%s,%s\n",movieid,$1);
}
}


I executed the awk script against the probe.txt file:

awk -F: -f fixprobe.awk probe.txt > probe.fixed.txt


I uses the same script with the qualifying.txt file:



awk -F: -f fixprobe.awk qualifying.txt > qualifying.fixed.txt



I then loaded the file up using the following MySQL command:


use netflix;
load data local infile 'probe.fixed.txt' into table probe_temp
fields terminated by ','
lines terminated by '\n';


(2) Create indexes for probe_temp

create index index_probe_temp_userid on probe_temp(userid);

create index index_probe_temp_movieid on probe_temp(movieid);

(3) Load the Ratings

First, I created a java application (JDK 5 or later) to create a Loader sql script (CreateLoader.java):


public class CreateLoader {
public static void main (String[] arg) {
System.out.printf("use netflix\n");
for (int i = 1; i <= 17770; i++) { System.out.printf("load data local infile 'training_set/mv_%07d.txt' into table rating fields terminated by ',' lines terminated by '\\n' ignore 1 lines (@userid,@rating,@date) set movieid=%d, userid=@userid, rating=@rating, date=@date;\n",i,i); } } }


I compiled the above code and ran it:

javac CreateLoader.java


And then to generate the sql script, I ran:

java CreateLoader > LoadRatings.sql


Then, from the MySQL prompt, I ran:

source LoadRatings.sql

On my machine (1GB RAM, Centrino 100 GB), the above loader runs in roughly 15 minutes.

(4) Create probe table and qualifying table

For the probe table, we do:


create table probe (
id int(4) unsigned not null primary key auto_increment,
movieid int(2), userid int(3), date date, rating int(1), first date, support int(3), avg double, avg2 double, prediction double, checkvalue double);


For the qualifying table, we do:


create table qualifying as select * from probe;


(5) Create a rating table that does not include probe data.


update rating r set status=0 where exists (select movieid from probe_temp where
userid=r.userid and movieid=r.movieid);

On my system, the above update took 40 minutes.

Now, the above update can be speeded up using memory tables if you like.

If you have enough memory, you can do the following:


set MAX_HEAP_TABLE_SIZE=70000000;

alter table probe_temp engine=memory;

update rating r set status=0 where exists (select movieid from probe_temp where movieid=r.movieid and userid=r.userid);


The above update took roughly 3 minutes on my machine. A very nice improvement from 40 minutes.

Otherwise, the trick is to divide probe_temp up into smaller tables that fit into memory. On my system, this worked ok at 5 smaller tables. Below is the SQL I used for table #0 (I would similar code for table #1 and table #2):


create table probe_temp_0 as select userid,movieid from probe_temp where userid % 5 = 0;

alter table probe_temp_0 engine=memory;

create index index_probe_temp_0_movieid on probe_temp_0(movieid);

create index index_probe_temp_0_userid on probe_temp_0(userid);

update rating r set status = 0 where exists (select movieid from probe_temp_0 where userid=r.userid and movieid=r.movieid);

drop table probe_temp_0;
On my system, the above sql with memory table took roughly 2 minutes to run.

After updating the status, we are ready to create the separate table and delete it from rating:


insert into probe(movieid,userid,date,rating) select movieid,userid,date,rating from rating where status=0;

create index index_probe_movieid on probe(movieid);
create index index_probe_userid on probe(userid);

delete from rating where status = 0;


The above update and delete took roughly 15 minutes.

(6) Reenable indexes


alter table rating enable keys;


The enable keys took 45 minutes.

(7) Load up qualifying data


use netflix;
load data local infile 'qualifying.fixed.txt' into table qualifying fields terminated by ','
lines terminated by '\n'
(@movieid,@userid,@date) set movieid=@movieid, userid=@userid, date=@date;

create index index_qualifying_userid on qualifying(userid);

create index index_qualifying_movieid on qualifying(movieid);


(8) Set up View for Checking RMSE

To check the current RMSE from probe, I create the following view:


create view probe_rmse as select sqrt(sum((rating - prediction)*(rating - prediction))/count(rating)) as "rmse" from probe;


Then, to check the RMSE, I issue the following SQL command:


select rmse from probe_rmse;



And now, I am ready to process the Netflix Data.

8 comments:

Anonymous said...

You might note that doing your final steps makes the setup only useful for analyzing the probe dataset. When you want to move on to actually submitting your data, you'll have to recreate your rating table to include the ratings from the probe table. For me, it's easier to just have a rating_tbl and a rating_tmp_tbl.

Larry Freeman said...

Thanks for the suggestion. I agree.

I've updated the blog entry to include "rating" (training - probe) and "rating_all" (all training)

Cheers,

-Larry

Anonymous said...

Cool. Any chance you could repost your command for pulling up the RMSE from the probe data?

Larry Freeman said...

Hi Tekivsteki,

I've just added back the command to check rmse.

Cheers,

-Larry

johno said...

Excellent writeup, thanks a lot.

Here are some corrections/suggestions:

1) You seem to have a redundant `id` column for `rating`, `rating_all`, `probe`, `qualifying`... tables. You could save yourself an index (and some space on disk) by creating a composite primary key from `movie_id` and `user_id` columns.

2) query `... select movieid from probe_temp whereuserid=r.userid and movieid=r.movieid` is slow because you do not have a composite index that can be used. AFAIK MySQL cannot use two indexes on one join. Again, composite (primary) index solves this issue.

Keep up the good work!

Unknown said...

Awesome blog!

Unknown said...

Maybe I'm missing something but don't you want to enable keys also on the rating_all table?

phobien_hoaihai said...

I would like find a netflix dataset but it is unavailable. could you send NetFlix dataset to me if you have it. Thanks so much. (my email: ducde1606@gmail.com)