Simple Hadoop (HDFS) Commands for Data Science Cheat Sheet

Hadoop Distributed File System ( HDFS )

I work for a large information services company that to refines petabytes of raw, crude data into insights and products more valuable than oil [1][2][3].

As a consequence of my company’s big data ingestion, data scientists on the R&D team often have to work with raw, .gz and .parquet files in HDFS without explicit guidance on their content (i.e. the columns/fields/variables). Here are some commands that I find useful in my work. Hopefully it helps at least one of…