<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6781135942995440618</id><updated>2011-10-25T00:33:35.364-07:00</updated><title type='text'>Set up and Configuration</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://setupandconfig.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6781135942995440618/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://setupandconfig.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>1</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6781135942995440618.post-9062942997125294565</id><published>2008-04-25T19:45:00.000-07:00</published><updated>2008-10-26T23:12:35.643-07:00</updated><title type='text'>Loading up Netflix Prize Data into MySQL</title><content type='html'>&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Disclaimer&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;I don't consider myself a mysql expert.  My goal was to get data installed in roughly a 1-day effort.&lt;br /&gt;&lt;br /&gt;Please feel free to post comments on improvements or questions about any of the method used in this blog.&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;Goal&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The goal of today's blog is loading up the Netflix Prize data into MySQL.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;OS&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It is my assumption that you are doing this on Windows.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;" &gt;Downloads  &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To do this, it is assumed that you have the following items downloaded:&lt;br /&gt;&lt;br /&gt;1.  &lt;span style="font-weight: bold;"&gt;Netflix Prize Data &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is available from:&lt;br /&gt;&lt;a href="http://netflixprize.com/"&gt;http://netflixprize.com&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;On my system, the download took roughly 2 hours.&lt;br /&gt;&lt;br /&gt;2.  &lt;span style="font-weight: bold;"&gt;MySQL&lt;/span&gt; (I used MySQL 5.0)&lt;br /&gt;&lt;br /&gt;This is available from:&lt;br /&gt;&lt;a href="http://dev.mysql.com/downloads/"&gt;http://dev.mysql.com/downloads/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;3.  &lt;span style="font-weight: bold;"&gt;MySQL Connector/J&lt;/span&gt; (I used MySQL Connector/J 5.1)&lt;br /&gt;&lt;br /&gt;This is available from:&lt;br /&gt;&lt;a href="http://dev.mysql.com/downloads/connector/"&gt;http://dev.mysql.com/downloads/connector/&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;4.  &lt;span style="font-weight: bold;"&gt;Cygwin Tools  &lt;/span&gt;(I used 1.5.25)&lt;br /&gt;&lt;br /&gt;This is available from:&lt;br /&gt;&lt;a href="http://cygwin.com/"&gt;http://cygwin.com&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I used the following:&lt;br /&gt;&lt;ul&gt;&lt;li style="font-weight: bold;"&gt;vim&lt;/li&gt;&lt;li style="font-weight: bold;"&gt;bash&lt;/li&gt;&lt;li style="font-weight: bold;"&gt;tar&lt;/li&gt;&lt;li style="font-weight: bold;"&gt;gzip&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;5. &lt;span style="font-weight: bold;"&gt;JDK&lt;/span&gt;  (I used Sun JDK 6 Update 6)&lt;br /&gt;&lt;br /&gt;This is available from:&lt;br /&gt;&lt;a href="http://java.sun.com/javase/downloads/index.jsp"&gt;http://java.sun.com/javase/downloads/index.jsp&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-size:130%;"&gt;Tasks&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;1.  &lt;span style="font-weight: bold;"&gt;Set up cygwin tools &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(1)  Run &lt;span style="font-weight: bold;"&gt;setup.exe&lt;/span&gt;, hit the NEXT button and select "Install from Internet"&lt;br /&gt;&lt;br /&gt;(2)  When I get to the choices:&lt;br /&gt;&lt;br /&gt;I click on the "View" Button&lt;br /&gt;&lt;br /&gt;I select the following:&lt;br /&gt;&lt;ul&gt;&lt;li style="font-weight: bold;"&gt;bash&lt;/li&gt;&lt;li style="font-weight: bold;"&gt;gzip&lt;/li&gt;&lt;li style="font-weight: bold;"&gt;tar&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;vim&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;I hit NEXT and it installs.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;2.  &lt;span style="font-weight: bold;"&gt;Untar the Netflix Prize data file&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(1)  Open a cygwin window&lt;br /&gt;&lt;br /&gt;(2) cd to the directory where the Netflix data is found and type:&lt;br /&gt;&lt;listing&gt;tar xvfz download*.gz&lt;br /&gt;&lt;/listing&gt;(3)  You will now have the following files created:&lt;listing&gt;download/rmse.pl&lt;br /&gt;download/README&lt;br /&gt;download/qualifying.txt&lt;br /&gt;download/probe.txt&lt;br /&gt;download/movie_titles.txt&lt;br /&gt;download/training_set.tar&lt;br /&gt;&lt;/listing&gt;(4) Next, untar the training set by typing&lt;br /&gt;&lt;listing&gt;cd download&lt;br /&gt;tar xvf training_set.tar&lt;br /&gt;&lt;/listing&gt;(5)  You will now have the following files created:&lt;br /&gt;&lt;listing&gt;&lt;br /&gt;download/training_set/mv_0000001.txt&lt;br /&gt;...&lt;br /&gt;download/training_set/mv_0017770.txt&lt;br /&gt;&lt;/listing&gt;3.&lt;span style="font-weight: bold;"&gt;  Install and configure MySQL&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(1)  Run &lt;span style="font-weight: bold;"&gt;setup.exe&lt;/span&gt; and choose "Complete" installation.&lt;br /&gt;&lt;br /&gt;(2)  Run MySQL Server Instance Configuration Wizard (which is now installed) and click NEXT.&lt;br /&gt;&lt;br /&gt;(3)  I choose the following options:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;"Detailed Configuration"&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;"Dedicated MySQL Server Machine" &lt;/li&gt;&lt;/ul&gt;Note:  This choice will significantly slow down your computer but greatly speed up the queries.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;"Non-Transactional Database Only" (since the goal is speed)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;"Decision Support (DSS)/OLAP &lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;"Strict Mode" (to help prevent mistakes)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;"Include Bin Directory in Windows PATH" &lt;/li&gt;&lt;/ul&gt;4.  &lt;span style="font-weight: bold;"&gt;Create MySQL tables for Netflix Data&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(1)  Open a MySQL Command Line client and hit enter for the password&lt;br /&gt;&lt;br /&gt;(2)  Create the database&lt;br /&gt;&lt;listing&gt;&lt;br /&gt;create database netflix;&lt;br /&gt;&lt;br /&gt;use netflix;&lt;br /&gt;&lt;/listing&gt;&lt;br /&gt;(3)  Create tables used for loading&lt;br /&gt;&lt;br /&gt;We'll create two tables here:  "rating" to hold the training data without the probe data.  "rating_all" to hold all the training data which we can use to generate the qualifying totals.&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table rating (&lt;br /&gt;id int(4) unsigned not null primary key auto_increment,&lt;br /&gt;movieid int(2),&lt;br /&gt;userid int(3), rating int(1), date date,&lt;br /&gt;first date, support int(3), status int(1),&lt;br /&gt;avg double, avg2 double, residual double, checkvalue double);&lt;br /&gt;&lt;br /&gt;create table rating_all as select * from rating;&lt;br /&gt;&lt;br /&gt;create index index_rating_userid on rating(userid);&lt;br /&gt;&lt;br /&gt;create index index_rating_all_userid on rating_all(userid);&lt;br /&gt;&lt;br /&gt;create index index_rating_movieid on rating(movieid);&lt;br /&gt;&lt;br /&gt;create index index_rating_all_movieid on rating_all(movieid);&lt;br /&gt;&lt;br /&gt;alter table rating disable keys;&lt;br /&gt;&lt;br /&gt;alter table rating_all disable keys;&lt;br /&gt;&lt;br /&gt;create table probe_temp (movieid int(2), userid int(3));&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;5.  &lt;span style="font-weight: bold;"&gt;Install JDK and MySQL Connector/j&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(1)  Run the jdk download and it self-installs.&lt;br /&gt;&lt;br /&gt;(2)  Untar the MySQL Connector by typing:&lt;br /&gt;&lt;listing&gt;tar xvfz mysql-connector-java*.gz&lt;br /&gt;&lt;/listing&gt;6. &lt;span style="font-weight: bold;"&gt;Load up Netflix Data into MySQL&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(1)  Load the Probe Data into a temporary file&lt;br /&gt;&lt;br /&gt;I wrote the following awk script (fixprobe.awk):&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;{&lt;br /&gt;if ($0 ~ /:/) {&lt;br /&gt;movieid=$1&lt;br /&gt;}&lt;br /&gt;else {&lt;br /&gt;printf("%s,%s\n",movieid,$1);&lt;br /&gt;}&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I executed the awk script against the probe.txt file:&lt;br /&gt;&lt;listing&gt;&lt;br /&gt;awk -F: -f fixprobe.awk probe.txt &gt; probe.fixed.txt&lt;br /&gt;&lt;/listing&gt;&lt;br /&gt;&lt;br /&gt;I uses the same script with the qualifying.txt file:&lt;br /&gt;&lt;br /&gt;&lt;div style="overflow: auto;"&gt;&lt;br /&gt;&lt;listing&gt;&lt;br /&gt;awk -F: -f fixprobe.awk qualifying.txt &gt; qualifying.fixed.txt&lt;br /&gt;&lt;/listing&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;I then loaded the file up using the following MySQL command:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;use netflix;&lt;br /&gt;load data local infile 'probe.fixed.txt' into table probe_temp &lt;br /&gt;fields terminated by ','&lt;br /&gt;lines terminated by '\n';&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(2) Create indexes for probe_temp&lt;span style="font-family:monospace;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;create index index_probe_temp_userid on probe_temp(userid);&lt;br /&gt;&lt;br /&gt;create index index_probe_temp_movieid on probe_temp(movieid);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(3) Load the Ratings&lt;br /&gt;&lt;br /&gt;First, I created a java application (JDK 5 or later) to create a Loader sql script (CreateLoader.java):&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;public class CreateLoader {&lt;br /&gt;public static void main (String[] arg) {&lt;br /&gt;System.out.printf("use netflix\n");&lt;br /&gt;for (int i = 1; i &lt;= 17770; i++) {                    System.out.printf("load data local infile 'training_set/mv_%07d.txt' into table rating fields terminated by ',' lines terminated by '\\n' ignore 1 lines (@userid,@rating,@date) set movieid=%d, userid=@userid, rating=@rating, date=@date;\n",i,i);              }        }  }  &lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I compiled the above code and ran it:&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;javac CreateLoader.java&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;And then to generate the sql script, I ran:&lt;br /&gt;&lt;pre name="code" class="java"&gt;&lt;br /&gt;java CreateLoader &gt; LoadRatings.sql&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Then, from the MySQL prompt, I ran:&lt;br /&gt;&lt;listing&gt;&lt;br /&gt;source LoadRatings.sql&lt;br /&gt;&lt;/listing&gt;&lt;br /&gt;On my machine (1GB RAM, Centrino 100 GB), the above loader runs in roughly 15 minutes.&lt;br /&gt;&lt;br /&gt;(4) Create probe table and qualifying table&lt;br /&gt;&lt;br /&gt;For the probe table, we do:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table probe (&lt;br /&gt;id int(4) unsigned not null primary key auto_increment,&lt;br /&gt;movieid int(2), userid int(3), date date, rating int(1), first date, support int(3), avg double, avg2 double, prediction double, checkvalue double);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;For the qualifying table, we do:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table qualifying as select * from probe;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(5)  Create a rating table that does not include probe data.&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;update rating r set status=0 where exists (select movieid from probe_temp where &lt;br /&gt;userid=r.userid and movieid=r.movieid);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;On my system, the above update took 40 minutes.&lt;br /&gt;&lt;br /&gt;Now, the above update can be speeded up using memory tables if you like.&lt;br /&gt;&lt;br /&gt;If you have enough memory, you can do the following:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;set MAX_HEAP_TABLE_SIZE=70000000;&lt;br /&gt;&lt;br /&gt;alter table probe_temp engine=memory;&lt;br /&gt;&lt;br /&gt;update rating r set status=0 where exists (select movieid from probe_temp where movieid=r.movieid and userid=r.userid);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The above update took roughly 3 minutes on my machine.  A very nice improvement from 40 minutes.&lt;br /&gt;&lt;br /&gt;Otherwise, the trick is to divide probe_temp up into smaller tables that fit into memory.  On my system, this worked ok at 5 smaller tables.  Below is the SQL I used for table #0 (I would similar code for table #1 and table #2):&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create table probe_temp_0 as select userid,movieid from probe_temp where userid % 5 = 0;&lt;br /&gt;&lt;br /&gt;alter table probe_temp_0 engine=memory;&lt;br /&gt;&lt;br /&gt;create index index_probe_temp_0_movieid on probe_temp_0(movieid);&lt;br /&gt;&lt;br /&gt;create index index_probe_temp_0_userid on probe_temp_0(userid);&lt;br /&gt;&lt;br /&gt;update rating r set status = 0 where exists (select movieid from probe_temp_0 where userid=r.userid and movieid=r.movieid);&lt;br /&gt;&lt;br /&gt;drop table probe_temp_0;&lt;br /&gt;&lt;/pre&gt;On my system, the above sql with memory table took roughly 2 minutes to run.&lt;br /&gt;&lt;br /&gt;After updating the status, we are ready to create the separate table and delete it from rating:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;insert into probe(movieid,userid,date,rating) select movieid,userid,date,rating from rating where status=0;&lt;br /&gt;&lt;br /&gt;create index index_probe_movieid on probe(movieid);&lt;br /&gt;create index index_probe_userid on probe(userid);&lt;br /&gt;&lt;br /&gt;delete from rating where status = 0;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The above update and delete took roughly 15 minutes.&lt;br /&gt;&lt;br /&gt;(6) Reenable indexes&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;alter table rating enable keys;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The enable keys took 45 minutes.&lt;br /&gt;&lt;br /&gt;(7)  Load up qualifying data&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;use netflix;&lt;br /&gt;load data local infile 'qualifying.fixed.txt' into table qualifying fields terminated by ','&lt;br /&gt;lines terminated by '\n'&lt;br /&gt;(@movieid,@userid,@date) set movieid=@movieid, userid=@userid, date=@date;&lt;br /&gt;&lt;br /&gt;create index index_qualifying_userid on qualifying(userid);&lt;br /&gt;&lt;br /&gt;create index index_qualifying_movieid on qualifying(movieid);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(8) Set up View for Checking RMSE&lt;br /&gt;&lt;br /&gt;To check the current RMSE from probe, I create the following view:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;create view probe_rmse as select sqrt(sum((rating - prediction)*(rating - prediction))/count(rating)) as "rmse" from probe;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Then, to check the RMSE, I issue the following SQL command:&lt;br /&gt;&lt;br /&gt;&lt;pre name="code" class="sql"&gt;&lt;br /&gt;select rmse from probe_rmse;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;And now, I am ready to process the Netflix Data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6781135942995440618-9062942997125294565?l=setupandconfig.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://setupandconfig.blogspot.com/feeds/9062942997125294565/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6781135942995440618&amp;postID=9062942997125294565' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6781135942995440618/posts/default/9062942997125294565'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6781135942995440618/posts/default/9062942997125294565'/><link rel='alternate' type='text/html' href='http://setupandconfig.blogspot.com/2008/04/loading-up-netflix-prize-data-into.html' title='Loading up Netflix Prize Data into MySQL'/><author><name>Larry Freeman</name><uri>http://www.blogger.com/profile/06906614246430481533</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://4.bp.blogspot.com/_wqJeUjTB5sE/SKZUgwi8H5I/AAAAAAAABJ0/miQr_4wF1eg/S220/picture.jpg'/></author><thr:total>8</thr:total></entry></feed>
