Data Science Immersive "Installfest"
DSI Computer Setup
Welcome to GA's Data Science Immersive! Before you start class, you'll need to download and install a few tools. Follow this guide to get your computer all set up, and let us know if you have any questions.
Part 1. Operating System
While you can be a data scientist on any operating system, most practicing data scientists choose a Unix-type operating system, typically either Apple's OS X or a popular linux distribution such as Ubuntu or Linux Mint.
- If you are already using Mac or Linux, great! Skip ahead to Part 2 and get started with your installs.
- If you are using a Windows machine, follow the instructions below.
For Windows Users
- Download and install VirtualBox.
- Note: Make sure to also install the "Extension Pack" for your version!
- Download our Ubuntu starter image labeled "For PCs only"
- Note: You'll need at least 2-3 gbs of free disk space.
- Open VirtualBox and import the image
- Double-click the Virtual Machine on the left; the password is "ilovedatascience"
Some parts of week 10 will be run on a Virtualbox Virtual Machine. This is going to be deployed on all computers, both Windows and Mac or Linux, so make sure you follos the instructions
Instructor Note: In order to complete the lessons and labs provided, students will need to install the custom virtual machine that allows them to spin up local versions of these tools. Instructions are as follows:
- Ensure you have a personal Dropbox account. If you don't, please to sign up for a free version. Due to the large size, they will need an account to download the VM files.
- Have students download files from the following link. There are two files to be downloaded:
- Note: This file is 2.5gb so download may take some time. Additionally, you will need to ensure you have enough disk space.
- Download and install Virtualbox
- Download and install Vagrant
Once you have all these installed, you can run the VM by doing:
cd dsi-bigdata-vm vagrant up vagrant ssh
Part 2. Anaconda and Python
In our class, we'll be working closely with tools that utilize the Python programming language. Anaconda is a popular cross-platform tool that helps install and manage python-related data science libraries.
Download Anaconda and follow the installation instructions package for your operating system.
Agree to the terms and let Anaconda go through its default installation. On OS X, there is a graphical installer.
Just type in the package name and Anaconda will install the package and any dependencies. Anaconda should install several packages by default, including:
- python: a programming language very popular with data scientists
- jupyter: an interface for creating interactive python notebooks, great for sharing analyses
- matplotlib: a plotting library for python
- nltk: a toolkit for natural language processing
- numpy: a linear algebra library
- pip & setuptools: software to manage and install python packages
- scikit-learn: a toolkit for machine learning algorithms
- scipy and statsmodels: statistical packages for python
- sqlite: a popular, easy to use database
If one or more of these is missing, just select the packages in the installer, or run the following at the command line (remember, you don't need to type the
$ conda install jupyter python matplotlib nltk numpy pip setuptools scikit-learn scipy sqlite statsmodels
Once Anaconda is installed, you can add additional python packages from the command line as follows:
$ conda install gensim seaborn spacy
Anaconda may also update your packages at this time (which is perfectly ok!).
Just For Linux Users
On Ubuntu, if the
conda install command fails for some reason, restart your terminal or source your
.bashrc like so
$ source ~/.bashrc
Part 3. Confirm Your Python Installation
When you've gotten this far, open up a terminal and enter the Python interpreter:
Depending on your operating system, your terminal should return something like this:
[email protected]:~/Downloads$ python Python 2.7.11 |Anaconda 2.5.0 (32-bit)| (default, Dec 6 2015, 18:08:45) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org >>>
Next, make sure that the necessary packages are installed. For example, to check that
matplotlibis installed, type in your terminal:
>>>> import matplotlib >>>> print matplotlib.__version__ 1.5.1
You may see another version (which is OK). If you get an error like this:
$ import matplotlib ImportError: No module named matplotlib
then you'll need to try to install the Python packages again.
Finally, you can check the installation and versions of all the python libraries. You can do this a couple of different ways.
A) By running this jupyter-notebook:
- Download the installfest-test notebook, labeled "For Everyone"
- Open a terminal, navigate to your download folder, and run
$ cd Downloads $ jupyter-notebook
- Open the notebook by selecting the notebook file
- From the
Restart & run all
If you see any errors then you'll need to reinstall the library that posts the error. Otherwise you should see a bunch of version numbers.
B) By typing
pip freezein the terminal:
- Open a terminal window
$ pip freeze
You will see a list of all the python packages currently installed with their version numbers in the terminal window.
Part 4. Git
We'll also be using git -- a popular version control system used to share code with others -- extensively along with Github. The installation instructions at Github are very good, and you should also make a Github account while you are there (if you don't have one already).
To check if your git installation is successful, open a new terminal window and try to run
gitfrom the command line:
$ git --version
The output should be something like this:
$ git --version git version 2.5.0
Next, you'll need to tell git your name and email. Make sure to use the same email address that you use for github:
$ git config --global user.name "Your Name" $ git config --global user.email [email protected]
These identifiers will be added to your commits and show up when you push your changes to github from the command line!
Part 5. PostgreSQL
PostgreSQL is a database, similar to MySQL, that we'll be using later in class. Install Postgres with the following steps:
Follow the instructions for your operating system below
- Download Postgres.app from www.postgresapp.com
- Move the Postgres.app to your 'Applications' folder.
- Open the Postgres.app (using "right-click + open" since it is an application that isn't from the Mac App Store)
- Look for the elephant in the the menu bar.
$ sudo apt-get install postgresql postgresql-contrib postgresql-client
You need to add yourself as a user in postgres so you can access the
psqlconsole seamlessly. Following the commands below, replace
dsi-studentwith your own user-name and type your own password when prompted.
If you are running Ubuntu, use "ilovedatascience" as your password.
$ sudo -i su - postgres $ createuser dsi-student --superuser --password $ createdb dsi-student $ exit
Test that this works by typing
psql. You should be presented with the postgres shell. To exit type
Part 6. Required Tools
We'll be using Slack, a popular messaging platform, for our class communications.
- Click on the installation instructions for your platform to install the Slack desktop app. You can also sign into Slack using a web interface or via their mobile app!
Note: Add additional market & cohort-specific channel instructions here, as needed.
Chrome is Google's popular web browser, and it comes with a complete set of developer tools built-in. We'll use Chrome to examine code, debug scripts, and view back-end processes. If you don't already have Chrome, make sure to download and install it now.
Tableau is a popular dashboard creation system for visualizing data. As a data scientist you'll need to create visualizations that make your analyses accessible to colleagues, stakeholders, and decision makers.
- Install the sofware for your operating system.
- For now just sign up for a trial account! We'll provide you with a license key for the full version during your first week of class.
Import.io is a popular webscraping tool.
- Grab the download and follow the installation instructions for your OS.
Part 7. Text Editors
A data scientist frequently writes scripts to process data, perform analysis, and create visualizations, webpages, and other end products, so you'll need a good text editor. If you don't already have a preference, try Atom or Sublime. Both editors are available for most platforms.
Instructors should modify these options based on their preferences.
- Download the editor of your choice from their website.
- Install the package by double clicking the file icon or from the command line
- Run your editor from the applications menu, or from the command line, like so:
$ subl $ atom
This example would open up Sublime or Atom, respectively. Whichever editor you choose, be sure to practice using it!
Configure Git with your Text Editor
Finally, you'll want to tell
git which editor it should use for your commits.
If you choose to use Sublime, you would type:
$ git config --global core.editor "subl --wait --new-window"
If you choose to use Atom, you would type:
$ git config --global core.editor "atom --wait"
That's it! Now you're ready to begin GA's Data Science Immersive. See you on the first day of class!