log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
PhD Proposal: Supporting Independent Learning and Rapid Experimentation with Data Science Recommendation Engine
Deepthi Raghunandan
Remote
Friday, December 10, 2021, 12:00-2:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
Data science is the practice of discovering knowledge from data and facilitating decision-making with that knowledge. Knowledge derived from this practice must be provable and reproducible by a community of experts and non-experts. The practice of data science involves three main steps: data wrangling, sensemaking, and data interpretation. Data wrangling refers to collecting, consolidating, and cleaning data. Sensemaking is "the process of searching for a representation and encoding data in that representation to answer task-specific questions" (Russel 1993). Sensemaking is performed iteratively in a sensemaking loop (Pirolli 2005). Each sensemaking iteration works to refine and build on the previous insights---ultimately enabling the analyst to address less specialized audiences. The final step involves interpreting results by providing context, validating, and modeling the knowledge. Data interpretation is often collaborative, involving other data scientists or stakeholders. When knowledge is actionable, interpretations can facilitate decision-making by the team. In combination, these steps make up a data science workflow.

To successfully practice data science, scientists must have access to tools that help them iterate and communicate. For data science programmers, computational notebooks are the most popular platforms for developing the data science workflow. Notebooks enable iteration and communication because they are, most notably, interactive and literate development environments. Interactive development environments enable users to "manage" the state of their program by dictating the lines of code they wish to execute and the order in which to execute them. Each execution provides users with feedback on the program's state, which they use to evaluate their next steps. This iterative interactivity is parallel to how data scientists "make sense" of their data within the sensemaking loop. Interactivity enables scientists to track and evaluate their iterations within the sensemaking loop. Notebooks are literate because they encapsulate code, execution results, visualizations, and insights in one document. Literate environments enable authors to use all the components of their data science workflow to form a computational narrative---a storytelling device to communicate and reproduce their results.

The popularity of computational notebooks and, in turn, the need to teach real-world practices have driven computational notebook data science tutorials. Tutorials built using notebooks enable the audience to discover and explore new material. Their multi-functional interfaces can be beneficial, particularly for data science, where learners must marry data science concepts with programming techniques for insight derivation. However, while templates and tutorials remain static---best practices, libraries, and versions evolve. Keeping up with these trends is becoming increasingly complex, especially for fledgling data scientists. A data science recommendation system that uses current and real-world examples embedded directly into the computational notebook interface can overcome these limitations. To this end, we present Lodestar: an interactive computational notebook sandbox that allows users to quickly explore and construct new data science workflows by selecting from a list of analysis recommendations.

Lodestar derives recommendations from directed graphs (workflows) of known analysis steps, with two input sources: one manually curated from online data science tutorials and another extracted through semi-automatic analysis of a corpus of Jupyter notebooks. Using a Jupyter Notebook corpus, we develop, leverage, and validate methods to identify how data scientists construct data science workflows within a computational notebook in real-life. We use these and related findings to develop a novel design for a mixed-initiative recommendation system on the computational notebook sandbox interface. To do this: we identify and label analysis steps, test and develop a recommendation engine, iteratively develop and evaluate an optimal user interface and, qualitatively evaluate the system to ensure that it meets the needs of fledgling data scientists.

Examining Committee:
Chair:
Department Representative:
Members:
Dr. Niklas Elmqvist    
Dr. David Jacobs  
Dr.  Leilani Battle
Bio

Deepthi Raghunandan is a PhD student who is being advised by Dr. Niklas Emlqvist and Dr. Leilani Battle to build better tools for Data Science education. Before entering the PhD program, Deepthi spent four years as a software developer at Microsoft---an experience which drove home the importance of prioritizing user experience during development. Subsequently, Deepthi spent two years working on personal start-up projects, which planted the motivational seeds for her current research. She is a proud Terp alum who graduated from the University of Maryland with undergraduate degrees in Computer Science and Economics.

This talk is organized by Tom Hurst