Projects

Project 1

Curation and cleaning of a labeled data set that you will use for the supervised and unsupervised learning tasks in project 2 and 3. The dataset can be built from existing data and should be stored in your GitHub repository.

To submit the assignment submit a link to your project Jupyter notebook on Blackboard

The goal for this assignment is for you to create a usable dataset from an open-source data collection. You will use your curated dataset for a supervised classification task in Project 2 and an unsupervised learning task in Project 3.

Step 1: Find and download a dataset. Here are some potential places to look

Amazon’s AWS datasets: https://aws.amazon.com/opendata/public-datasets/
Data Portals: http://dataportals.org/
Kaggle datasets: http://kaggle.com
NYPL digitizations: http://libguides.nypl.org/eresources
NYC Open Data: http://opendata.cityofnewyork.us/data/
Open Data Monitor: http://opendatamonitor.eu/
QuandDL: http://quandl.com/
UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php

Some guidelines regarding dataset selection:

Since this data is going to be used for supervised learning, one of its features should be a value that you can predict (e.g., if you chose a real estate dataset and are interested in housing prices, then housing prices for each sample should be available in the dataset)
Your dataset should be large enough to perform supervised and unsupervised learning on. A useful rule of thumb is that you should have at least 10 samples per dimension or parameter/feature you are fitting.

Step 2: Divide into a training set and a testing set. In a Jupyter notebook, use scikitlearn to divide your data into training and testing sets. Make sure that the testing and training sets are balanced in terms of target classes

Step 3: Explore your training set. In a Jupyter notebook, import your data into a Pandas data frame and use the following pandas functions to explore your data

DataFrame.info()
DataFrame.describe()

Step 4: Data cleaning. Address any missing values in your training set. Include the code in your Jupyter notebook and create a second, cleaned, version of your dataset. Then apply the same procedure to your test set (if you are putting in replacement values use IMPUTER in scikitlearn).

Recall from the Hands on Machine Learning book (p. 60) that some options are
- “Get rid of the corresponding samples.”
- “Get rid of the whole attribute (column).”
- “Set the values to some value (zero, the mean, the median, etc.).”

Step 5: Visualize the data in your training set. At a minimum, use the following pandas functions to visualize the data in your Jupyter notebook.

DataFrame.hist
plotting.scatter_matrix()

Step 6: Apply transformations to your data. In your Jupetyr notebook apply, squaring, cubing, logarithmic, and exponentials transformations to two features in your dataset. Plot the histograms and scatter matrices of the resultant data.

Project 2

Application of two supervised learning techniques on the dataset you created in Project 1. This assignment should be completed as a Jupyter notebook in your GitHub repository.

To submit the assignment, submit a link to your project Jupyter notebook on Blackboard.

The goal for this assignment is to apply different types of supervised learning algorithms with a range of parameter settings to your data and to observe which performs better.

Step 1: Load your data, including testing/training split from Project 1.

Your testing and training split should be balanced
Your data should be clean and missing data should be addressed

Step 2: (If not already done in Project 1) Prepare your data

Make sure that all your appropriate variables are converted to categorical variables (as ordinal or one hot)
Perform any necessary feature scaling

Step 3: Examine your target attribute. Based on the data exploration you did in Project 1, confirm and examine the attribute you are going to predict.

Examine and plot the distribution of the target attribute in your training set (e.g., is it Gaussian, uniform, logarithmic). This will help you interpret the performance of different algorithms on your data.

Step 4: Selected two of the following supervised learning algorithms, ideally one from the first half of the list and one from the second half of the list

K-Nearest Neighbor
Linear Models
Naïve Bayes
Decision Trees
- Single tree
- Random Forest
- Gradient Descent decision trees
Support Vectors Machines

Step 5: For each of your selected models

Run with the default parameters using cross-validation
- Calculate precision, recall, and F1
(Where possible) adjust 2-3 parameters for each model using grid search
- Report evaluation metrics for the best and worst-performing parameter settings

Tip: You should make notes on what worked well and what didn’t. Such notes will be useful when you write up the paper for your final project.

Project 3

To submit the assignment, submit a link to your project Jupyter notebook on Blackboard.

The goal for this assignment is two apply different types of unsupervised learning techniques on the dataset you created in Project 1.

Step 1: Load your data, including testing/training split from Project 1.

Your testing and training split should be balanced
Your data should be clean and missing data should be addressed
All appropriate variables are converted to categorical variables (as ordinal or one hot)
Any necessary feature scaling should be performed
YOU SHOULD ONLY WORK ON YOUR TRAINING SET

Step 2: PCA for feature selection

Show how many features do you need to retain to capture 95% of the variance
Evaluate whether this improves your best-performing model from Project 2

Step 3: Apply 3-types of clustering on your data and visualize the output of each both with and without PCA run on it first. Calculate both ARI and Silhouette Coefficient for all six of the combinations.

k-Means (use an elbow visualization to determine the optimal numbers of clusters)
Aggolmerate/Hierarchical
DBSCAN

If your data from projects 1 and 2 really doesn’t lend itself to clustering,
you can use the breast_cancer dataset from scikit-learn.

Still submit your attempts on your own data in the notebook.

Tip: You should make notes on what worked well and what didn’t. Such notes will be useful when you write up the paper for your final project.

Final Paper

A 5–8 page paper describing the work you did in projects 1–3 (your dataset and your supervised and unsupervised experiments). The paper should describe both what you did technically and what you learned from the relative performance of the machine learning approaches you applied to your dataset. This assignment should be posted as a PDF in your GitHub repository.

1) Describe your data set, why you chose it, and what you are trying to predict with it

2) Detail what you did to clean your data and any changes in the data representation that you applied. Discuss any challenges that arose during this process.

3) Discuss what you learned by visualizing your data

4) Describe your experiments with the two supervised learning algorithms you chose. This should include a brief description, in your own words, of what the algorithms do and the parameters that you adjusted. You should also report the relative performance of the algorithms on predicting your target attribute, reflecting on the reasons for any differences in performance between models and parameter settings.

5) Describe your experiments using PCA for feature selection, discussing whether it improved any of your results with your best-performing supervised learning algorithm.

6) Discuss the results of using PCA as a pre-processing step for clustering. This should include a brief description, in your own words, of what the algorithms do. If you used the Wine dataset for this, briefly explain why your original dataset wasn’t appropriate.

7) Summarize what you learned across the three projects, including what you think worked and what you would do differently if you had to do it over.

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

DATA 71200 Advanced Data Analysis Methods (Summer 2022)

M.S. Program in Data Analysis and Visualization, CUNY Graduate Center

Project 1

Project 2

Project 3

Final Paper

Need help with the Commons?