Stern Center for Research Computing

New York University • Leonard Stern School of Business

Hadoop – Big Data Processing

NYU Stern has a server named — bigdata.stern.nyu.edu that is used to access the tools for processing very large datasets. The bigdata Stern server is a Linux server with Hadoop, Hive, Datameer, Tableau, Mahout, SAS, R, Python, Matlab, MySQL, and STATA. The server is accessible using your Stern credentials.

Hadoop is an open source software framework written in the Java programming language that enables distributed storage and processing of large data sets across many nodes. What is unique about Hadoop is that it brings the processing to where the data resides unlike a typical relational database where the location of the data and the processing are typically independent of each other.

Stern Research Computing has  a small Hadoop cluster. There are approximately 15 processing nodes with about  50 cores and about 6TB of disk.

On this page is a few short, interesting, and informative videos to get you started understanding the basics of Hadoop and Big Data Processing.

What is Hadoop?

How Hadoop Works

Map-Reduce

An important component of hadoop is MapReduce. It is a methodology that allows you to write programs that process large amounts of unstructured data in parallel across the hadoop nodes.