Simple Hadoop (HDFS) Commands for Data Science Cheat Sheet
Hadoop Distributed File System ( HDFS )
I work for a large information services company that to refines petabytes of raw, crude data into insights and products more valuable than oil [1][2][3].
As a consequence of my company’s big data ingestion, data scientists on the R&D team often have to work with raw, .gz
and .parquet
files in HDFS without explicit guidance on their content (i.e. the columns/fields/variables). Here are some commands that I find useful in my work. Hopefully it helps at least one of…