Robust Principal Component Analysis via ADMM in Python

Standard

Principal Component Analysis (PCA) is an effective tool for dimensionality reduction, transforming high dimensional data into a representation that has fewer dimensions (although these dimensions are not from the original set of dimensions). This new set of dimensions captures the variation within the high dimensional dataset. How do you find this space? Well, PCA is equivalent to determining the breakdown M = L + E, where L is a matrix that has a small number of linearly independent vectors (our dimensions) and E is a matrix of errors (corruption in the data). The matrix of errors, E, has been minimized. One assumption in this optimization problem, though, is that our corruption, E, is characterized by Gaussian noise [1].

Sparse but large errors affect the recovered low-dimensional space Introduction to RPCA

Sparse but large errors affect the recovered low-dimensional space.
From: Introduction to RPCA

Continue reading

Advertisements

Cognitive Science to Data Science

Standard

I joined the UCSD Cognitive Science PhD program with the aim to investigate multi-agent systems. A few years in I joined a project to investigate the interactions of bottlenose dolphins. The research group had a massive amount of audio and video recordings that was too big to handle without computational techniques. I joined the group to provide the computational support that they needed. During this process, I discovered that working with big data is motivating in its own right and that I wanted to pursue the data scientist path in lieu of academia. Continue reading

Detection Error Tradeoff (DET) curves

Standard

In July 2015, I attended DCLDE 2015 (Detection, Classification, Localization, and Density Estimation), a week-long workshop focusing on methods to improve the state of the art in bioacoustics for marine mammal research.

While I was there, I had a conversation with Tyler Helble about efforts to detect and classify blue whale and fin whale calls recorded off the coast of Southern California. While most researchers use Receiver Operating Characteristic (ROC) curves or Precision Recall (PR) curves to display classifier performance, one metric we discussed was Detection Error Tradeoff (DET) curves [1]. This might be a good metric when you are doing a binary classification problem with two species and you care how it is incorrectly classifying both species. This metric has been used several times in speech processing studies and has been used in the past to look at classification results for marine bioacoustics [2].

Continue reading

Batch text file conversion with Python: {.pdf, .doc, .docx, .odt} → .txt

Standard

AbstractsAttending conferences and presenting research is a frequent event in the life of an academic. Conference organizing committees that plan these events have a lot on their plate. Even small conferences, such as the one I organized in 2015 (iSLC 2015), can be very demanding.

One thing that conference organizers have to consider is how to implement an article and/or abstract submission process that allows attendees, reviewers, and organizers to fluidly access and distribute these documents. Some of these services are free while others are a paid service. Some services provide better and a more adaptive pipeline for this process.

An important feature of these abstract submission sites is allowing the tagging of abstracts so that organizers can appropriately distribute the content to the best reviewers.

Continue reading

Tanzania O-level Performance

Standard

Tanzania O-level Performance

Visualizing the state of education in Tanzania

COGS 220 Final Project

This maps shows the performance of Form IV students at schools across Tanzania. Schools are ranked from best (large, green circles) to worst (small, red circles). Data was scraped from the NECTA (National Examinations Council of Tanzania) website and the Google Maps API was used with school names to retrieve GPS coordinates. Circles outside the country are due to noisy Google API calls. Continue reading