Assistant Director and Manager for Systems and Networking Research Technologist Kirk Anne hosted a data science workshop for the programming and data science club on Sunday April 9, in which students analyzed a dataset to determine whether passengers on the Titanic were likely to survive.
The workshop allowed students to become familiar with data acquisition, cleaning, analysis and presentation.
Students of all programming abilities were welcome at the workshop. Kirk and members of the programming and data science club were available an hour before the event to assist students in downloading the necessary programming software to participate.
“We’re working on trying to find ways to expose students to programming and data science,” Anne said. “Even in the humanities and history and English—these types of techniques are being used for research. This spans all disciplines and the ability to work with data—to present a cohesive and coherent argument—is important and these are the tools to do so.”
During the workshop, students examined several variables that influenced survival rate on the Titanic: the passenger class, gender, age, whether the passenger had siblings or a spouse on board, the number of parents and children on board, the ticket number, the passage fare, the cabin number and where each passenger entered onto the ship.
Attendees typed a series of commands on their screens to access the data and then explored multiple methods to analyze the data, with Anne showing them how to use the Python programming language. Programming and data science club president and mathematics major senior Walter Gerych explained how to use the R programming language.
When evaluating data, Kirk described how there are a variety of factors one can examine: the standard deviation, the minimum, first quartile, median, third quartile and the maximum. Anne suggested using the median, opposed to the average when examining data.
“If you have a billionaire in your town, the average income might be $4 million—but that doesn’t make sense,” Anne said. “The median is usually a slightly better representation for the outliers. If you have one big outlier or one super small outlier, it messes up the data.”
After Anne showed how to cross-examine the data with a series of graphs and tables, attendees found that passengers ranging from 20-40 years of age were more likely to survive when considering age, and women had approximately a 74 percent chance of survival.
Anne also explained that families might have been less likely to survive, as they focused on staying together and on finding each other before the ship sunk, rather than on finding a lifeboat.
Students learned about decision trees during the programming workshop. This algorithmic method helps individuals evaluate data by using a graph that calculates possible outcomes, costs and utility in a tree-similar model.
Anne stressed at the workshop that when using such a model, it is essential to check whether the data is inserted correctly. He explained that there are limitations in solely using this method for analysis, as it is important to think critically about the data and on other potential factors while investigating any data set.
The club’s vice president and cofounder mathematics major senior Aidan Murphy encourages students who are interested in programming and data analysis to not only attend workshops hosted by the club or its meetings, but to also take the time to actively use these skills independently.
“Just like learning any other language, unless you are naturally adept at it, just sitting in the class for Spanish or something isn’t going to be enough to learn the language,” he said. “You have to actively use it, you have to dynamically use it for the best results, and that’s what we hope to have our students do.”