Given an unlabeled high dimensional timeseries dataset requiring deep domain expertise to understand (which we did not have), derive some business value. That was the challenge laid out by Solar Turbines, a Catepillar-owned company that builds industrial gas turbines and offers an equipment health monitoring platform to its customers.
Using millions of turbine sensor readings, a combination of PCA, creative time partitioning, and clustering, our group was able to generate machine load profiles and machine similarity measures, classify different types of performance outliers including a valuable subset known as transient states, and build a user interface for domain experts to use these results to efficiently create a labeled dataset for future predictive modeling.
Dora, the data explorer, is a Python API over 3 different data sources (Postgres, Solr, AsterixDB) intended for EDA in a mock product recommendation pipeline. Storage details are hidden from the API consumers and recommendation endpoints allow for a feedback loop with a machine learning model.
Final project for Amit Chourasia's Data Visualization class. With 1/3 of our team stuck in India on visa technicalities, we went into the project hoping to find some secret pattern in a dataset of H1B jobs to make things easier for anyone navigating the work visa maze. We came out with a functional and visually effective view of available positions by job type, employer and location. No secret patterns found...
Filter the H1B job market by state, county, company, and job type. Uses slope charts (2d parallel coordinate chart) to compare the number of jobs available with the average salary for companies and job types. Compared to previous projects for this class, this viz has an improved observer pattern for the different components.
This was a simple project meant to explore time series and geographic data visualization. I really liked our solution. The seasonal markings show peak West Nile season is getting later and later each year, and the interactivity between timeline selection and geographic shading highlights how cases of the disease propagate out from counties with heavy lakes, rivers, or irrigation.
I have been hooked on oceanagraphic data since I started surfing in 1999. As an undergrad, I took Physical Oceanography and my senior thesis in the Computer Science department was "Predicting Significant Ocean Wave Heights Using Genetic Algorithms”, a small survey of early neural network approaches to ocean state modeling with an attempt at applying Genetic Algorithms to the same problem. I've since left the forecasting to those with knowledge of fluid dynamics but I've continued to write code around the abundance of data provided by NOAA. Current work includes serverless wrappers around the different sources of wind, wave, tide, and bathymetry data for a more consistent interface as well as some visualizations built on top of those endpoints. I'm interested in generating labeled datasets for hyper-local surf condition predictions.
As a side project over the summer of 2017, I started exploring genetic data available through the NCBI. With a few pointers from friends at Salk and Scripps, I was on my way to some underinformed science. I didn't get very far before classes started back up, but it's a project and domain I intend to continue with.