A groundbreaking and relatively new discovery upends classical statistics with relevant implications for data science practitioners and statistical consultants
Data science is a fascinating field. C-level executives are enamored by its promised impact on top line revenue and practitioners are intrigued by the rapid pace of innovation. There’s already so much to know and it seems like every year, a few more things to learn.
This article draws attention to a relatively novel idea that is probably controversial to most data scientists and maybe a handful of statisticians: the bias-variance tradeoff generalization does not generalize and only applies to very…
Isolation forest or “iForest” is an astoundingly beautiful and elegantly simple algorithm that identifies anomalies with few parameters. The original paper is accessible to a broad audience and contains minimal math. In this article, I will explain why iForest is the best anomaly detection algorithm for big data right now, provide a summary of the algorithm, history of the algorithm and share a code implementation.
Best-in-class performance that generalizes. iForest performs better than most other outlier detection (OD) algorithms across a variety of datasets, based on ROC performance and Precision. I took benchmark data from the authors of the Python…
Hint: It’s not programming skills or familiarity with algorithms
The most important aspect of data science is communication. Algorithms, coding languages and software are important to know but these things are easily and quickly looked up when details become shrouded in the dust of time. Given the strong academic backgrounds of most data scientists, it’s not hard for one to learn how to program in a new language in a very short time and even quicker to learn how to read a new language for most data science purposes (with the exception of lower-level languages like C — that’s hard).
I have yet to see a straightforward and comprehensive guide on how to get JAR files onto every worker node of an EMR cluster and yet this is a critically important, common need. This article addresses those needs. The following is a culmination of my notes from personal struggle and innumerable disparate Google results, Stack Overflow posts and official AWS documentation.
Scala libraries hosted on GitHub often have installation instructions that rely on building JAR files from source code using a program called Maven. But as a data scientist, I just want the JAR file. …
Over the years, I have seen many PhD-holding data scientists spend weeks or months building highly effective machine learning pipelines that (theoretically) will deliver real-world value. Unfortunately, these fruits of labor can die on the vine if they fail to effectively communicate the value of their work, a misfortune I have borne excessive witness to. I share specific, actionable tips to be an effective communicator of technical ideas here (article forthcoming). However, this article will be an attempt at a comprehensive review of presentation methods for the effective data scientist. …
A quick guide with code (i.e. my rough notes for replication purposes)
There are a lot of interesting applications for packet capture data. I will refrain from stating them for corporate privacy reasons.
This part is straightforward and useful for starting off. In part 2, I show you how to customize your data collection using the
pcap file generated in this part (part 1).
This article assumes you are already familiar with what XGBoost/CatBoost/etc. do and that you are here to actually get them to work.
Installing packages on a local machine/single node is easy. Doing the same for a cluster environment in order to work with big data is less so and the motivation for this article. I will share code commands and screenshots to help you follow along.
This article is split into two parts and will teach you how to set-up packages such that they are available across all nodes in a cluster environment. In this example, I demonstrate with an…
Because AWS documentation is out-of-date, wrong, verbose yet not specific enough or requires you to read 5–10 different link trees of pages of documentation.
2. Create an EMR instance (guide here) and download a new
.pem. A key-pair consists of a public key that AWS stores and a private key file that you store, i.e. a PEM file (aside: PEM stands for Privacy Enhanced Mail). Together, the two keys enable you to securely connect to your EC2 instance using SSH.
Produce website-worthy visualizations
Flourish is a simple browser-based point-and-click, drag-and-drop data visualization creator suitable for well-structured, tabular data in the form of
.csv or Excel files. Introduced in March 2016¹, it is a relatively new tool compared to entrenched competitors like Tableau (founded in 2003²) and aims at an audience looking for fewer options and features than Tableau. It’s main selling point is a ease-of-use with a focus on the primary objective: effective storytelling with data. An analogy to encapsulate Flourish: Flourish is to Tableau what Beautiful.ai is to PowerPoint; a pared down, GUI-based tool focused on producing beautiful data…
I had many managers and held multiple leadership roles during my life. This is a living list of notes based on those experiences. While the following concepts are widely applicable, I write with the data science team in mind.