The R-Podcast Episode 10: Adventures in Data Munging Part 2




The R-Podcast MP3 Feed show

Summary: I'm happy to present episode 10 of the R-Podcast! Season 1 of the R-Podcast concludes with part 2 of my series on data munging, in which I discuss issues surrounding importing data sets contained in HTML tables. I share how I used the XML and RCurl packages to validate and import data from hockey-reference.com for storage into a MySQL database. Our listener feedback segment contains another installment on the Pitfalls of R contributed by listener Frans. I want to thank everyone who has provided such positive feedback throughout the season, and I'm looking forward to providing some exciting new content for season 2. I hope you enjoy the episode and check out our new contact page if you would like to provide any feedback. Thanks for listening! The following resources are mentioned in this episode: New additions to the RStudio team: blog.rstudio.org/2012/08/20/welcome-hadley-winston-and-garrett/ Over 4,000 packages on CRAN: http://blog.revolutionanalytics.com/2012/08/two-r-community-milestones.html NHL Analysis web-scraping scripts on GitHub: https://github.com/thercast/nhl_analysis/tree/master/web-scraping XML package: http://cran.r-project.org/web/packages/XML/index.html RCurl package: http://cran.r-project.org/web/packages/RCurl/ Hockey-Reference data: http://www.hockey-reference.com Using R for Scraping Data Presentation at UseR! 2012: http://www.slideshare.net/rtelmore/user-2012-talk Using RMySQL tutorial: http://playingwithr.blogspot.com/2011/05/accessing-mysql-through-r.html Jeroen Ooms' lme4 web application: http://www.stat.ucla.edu/~jeroen/lme4.html Coursera Course on R: https://www.coursera.org/course/compdata RPubs: http://rpubs.com/ Theme music provided by WillRock from the Return All Robots Remix Album at ocremix.org The closing theme is entitled "The Way" and provided by Jewbei from the Wild Arms: ARMed and Dangerous album at ocremix.org Episode 10 Time Stamps 00:00 The R-Podcast #010 Adventures in Data Munging Part 2 00:33 Introduction 01:50 Wrapping up season 1 ... wait, what? 03:30 Rstudio team expands 05:41 R Community milestone 07:53 Discovering hockey-reference.com 10:54 Tips for readHTMLtable 21:10 Checking for valid data first 29:23 Minor processing needed 35:18 Saving data to MySQL database 45:26 Listener Feedback: Andrew 54:58 Frans: Pitfalls of R segment 2 63:40 Wrapping up: subscribe to the podcast, theRcast@gmail.com, + 1-269-849-9780, Twitter @theRcast 69:14 End