Stern Center for Research Computing

New York University • Leonard Stern School of Business

Getting Started

Linux

If you haven’t used hadoop, and you are not at least a little familiar with unix/linux, you should probably start by taking a quick course on Linux. A gentle tutorial can be found here or for a more comprehensive introduction follow Learning the Shell. The sections on manipulating files and permissions is particularly useful for starting out with hadoop.

Connecting to Stern Research Computers

Once familiar with linux, you need to know how to connect to the Stern research computers.

For Windows you need to use a program like PUTTY. Instructions on getting and using PUTTY can be found here. Mac users can connect using the mac terminal program. Instructions using a Mac terminal can be found here.

Connect to bigdata.stern.nyu.edu instead of vleda.stern.nyu.edu.

Moving Data Into Hadoop File System

Next you need to see how to move data from your desktop or a remote machine to the stern head node, bigdata.stern.nyu.edu, and then move it into the hadoop file system (the HDFS) where you can start processing it using hadoop.

Here is a short video walking you through the process of moving remote data to bigdata.stern.nyu.edu and then into the HDFS. It also has a quick overview of some of the hadoop file system commands.

One easy way to start processing your data is to define it as a HIVE table, where you can query and manipulate it with the Hive QUERY LANGUAGE which is mostly compatible with SQL. Here is a video example, it is a 20 minute video that walks through connecting to bigdata, moving some data into a local folder (/bigtemp), and then moving the data into the HDFS and creating a HIVE table definition so it can be manipulated with HQL.

For a more detailed look at using HIVE, the following video uses a directory which contains all of the municipal bond trades from 1997 -2012, each year in a separate file. A hive table definition is created  on the directory which allows queries to be run across all the years. You can see the video here.

This should get you started on the Stern hadoop cluster. If you have problems, send an email to scrc-list@stern.nyu.edu and we will try to help you.