Foundational Knowledge for the Advanced Data Scientist

Photo by Daniel Vargas on Unsplash

Introduction

Data science is a fascinating field. C-level executives are enamored by its promised impact on top line revenue and practitioners are intrigued by the rapid pace of innovation. There’s already so much to know and it seems like every year, a few more things to learn.

This article draws attention to a relatively novel idea that is probably controversial to most data scientists and maybe a handful of statisticians:


Making Sense of Big Data

Photo by Simon Godfrey on Unsplash

Isolation forest or “iForest” is an astoundingly beautiful and elegantly simple algorithm that identifies anomalies with few parameters. The original paper is accessible to a broad audience and contains minimal math. In this article, I will explain why iForest is the best anomaly detection algorithm for big data right now, provide a summary of the algorithm, history of the algorithm and share a code implementation.

Why iForest is the best anomaly detection algorithm for big data right now

Best-in-class performance that generalizes. iForest performs better than most other outlier detection (OD) algorithms across a variety of datasets, based on ROC performance and Precision. I took benchmark data from the authors of the Python…


Impactful Data Science

Hint: It’s not programming skills or familiarity with algorithms

The most important aspect of data science is communication. Algorithms, coding languages and software are important to know but these things are easily and quickly looked up when details become shrouded in the dust of time. Given the strong academic backgrounds of most data scientists, it’s not hard for one to learn how to program in a new language in a very short time and even quicker to learn how to read a new language for most data science purposes .


Essential Data Science Skills

Demonstrating where to download JARs and how to install them on AWS EMR clusters for access from EMR Notebooks

Photo by JESHOOTS.COM on Unsplash

I have yet to see a straightforward and comprehensive guide on how to get JAR files onto every worker node of an EMR cluster and yet this is a critically important, common need. This article addresses those needs. The following is a culmination of my notes from personal struggle and innumerable disparate Google results, Stack Overflow posts and official AWS documentation.

Background

Scala libraries hosted on GitHub often have installation instructions that rely on building JAR files from source code using a program called Maven. But as a data scientist, I just want the JAR file. …


Impactful Data Science

Communicating a coherent, data-driven story is the most important skill for today’s data scientist yet the least developed. Better tools can help — learn about a new one today.

Photo by Headway on Unsplash

Over the years, I have seen many PhD-holding data scientists spend weeks or months building highly effective machine learning pipelines that (theoretically) will deliver real-world value. Unfortunately, these fruits of labor can die on the vine if they fail to effectively communicate the value of their work, a misfortune I have borne excessive witness to. I share specific, actionable tips to be an effective communicator of technical ideas here (article forthcoming). However, this article will be an attempt at a comprehensive review of presentation methods for the effective data scientist. …


A quick guide with code (i.e. my rough notes for replication purposes)

Photo by Jordan Harrison on Unsplash

Motivation

There are a lot of interesting applications for packet capture data. I will refrain from stating them for corporate privacy reasons.

Instructions (Part 1: Wireshark GUI )

This part is straightforward and useful for starting off. In part 2, I show you how to customize your data collection using the pcap file generated in this part (part 1).

  1. Open Wireshark.
  2. Select the “local interface” in the “Capture” section at the bottom of “The Wireshark Network Analyzer” (wireshark.exe). This can be something like “Wi-Fi [#]” if using a wireless connection, or “Ethernet [#]” if connected…

advanced data science skills

Setting up your Amazon Web Services (AWS) Elastic MapReduce (EMR) Cluster with XGBoost

Photo by XPS on Unsplash

Introduction

Installing packages on a local machine/single node is easy. Doing the same for a cluster environment in order to work with big data is less so and the motivation for this article. I will share code commands and screenshots to help you follow along.

This article is split into two parts and will teach you how to set-up packages such that they are available across all nodes in a cluster environment. In this example, I demonstrate with an…


Photo by Federico Beccari on Unsplash
  1. Download the latest Stable installation of PuTTY ). The installation should also install needed utilities like puttygen and pageant.

2. Create an EMR instance (guide here) and download a new .pem. A key-pair consists of a public key that AWS stores and a private key file that you store, i.e. a PEM file (). Together, the two keys enable you to securely connect to your EC2 instance using SSH.


Introduction

Flourish is a simple browser-based point-and-click, drag-and-drop data visualization creator suitable for well-structured, tabular data in the form of .csv or Excel files. Introduced in March 2016¹, it is a relatively new tool compared to entrenched competitors like Tableau (founded in 2003²) and aims at an audience looking for fewer options and features than Tableau. It’s main selling point is a ease-of-use with a focus on the primary objective: effective storytelling with data. An analogy to encapsulate Flourish: Flourish is to Tableau what Beautiful.ai is to PowerPoint; a pared down, GUI-based tool focused on producing beautiful data…


I had many managers and held multiple leadership roles during my life. This is a living list of notes based on those experiences. While the following concepts are widely applicable, I write with the data science team in mind.

  • Soft skills are as important as technical ability. The hierarchies of technical teams, in particular, give greater weight to technical proficiency than to soft skills when determining career advancement. But one problem with this approach is that technical ability (i.e. …

Andrew Young

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store