IST 557 Data Mining: Techniques and Applications
Assignment 2 SVM
A. Data processing
Download this 20newsgroup dataset: 20news.zip. Unzip the package and there are two folders: folder “rec.sport.hockey” and folder “soc.religion.christian”. Now you have a dataset with binary labels (rec.sport.hockey and soc.religion.christian). You should have 1997 documents in total (1000 samples in “rec.sport.hockey” and 997 samples in “soc.religion.christian”).
Generate a feature-label matrix from the dataset. Each row is a data sample. The first column for each row is the document ID, i.e., the file name. The second column is class label, with 0 for class “rec.sport.hockey” and 1 for class “soc.religion.christian”. The rest of the columns are features. Each column corresponds to a unigram. Only consider these 3000 unigrams in this list: dict.txt Your data matrix should be a 1997*3002 matrix.
The feature value is Term Frequency Inverse Document Frequency (TFIDF) score of the unigram. Refer to here about TFIDF calculation. When generating features, you should calculate TF score and IDF score on your own, which means you can use packages that help you tokenize document, but you should not use packages that help you turn data into vectors directly (e.g., sklearn.feature_extraction.text.TfidfVectorizer.fit_transform()).
To generate unigram for a document, you can follow the steps below (you can click on terms if you don’t understand its meaning):
Turn the document into lowercase;
After those steps, you will get a list of words, which can be considered as the unigram. Because we only consider unigrams in dict.txt, words not in this list should be ignored. Because most unigrams in dict.txt will not occur in the document, there will be many zeros in your features for this document.
Here is the sample code to generate unigram (in Python 3): sample-code-unigram.py The input is a string and output is a list of words (unigram).
What to submit: Code for data processing and the processed data file.
Please name your code as A_data_processing.py. Python is required. Write some comments in your code to help TA understand it.
Please name the processed matrix file as matrix.txt. For this file, each line will be 3002 real numbers that represent a document, split by commas (e.g, 5 numbers that split by commas look like this: 1,2,3,4,5) . There will be 1997 lines in total because you have 1997 documents. Please encode the file using Unicode (not in matrix format in matlab or any other software).
B. Test SVM
1. Apply linear SVM on the data. Use 5-fold cross-validation to evaluate the performance. Report the average and standard deviation of accuracy values over five folds. Describe the parameter setting you choose.
2. Apply SVM RBF kernel. Use 5-fold cross-validation to evaluate the performance. Report the average and standard deviation of accuracy values over five folds. Describe the parameter setting you choose.
What to submit: Include your code and the reported results.
C. Parameter tuning
1. For linear SVM, tune parameter C and show the corresponding accuracy using 5-fold cross-validation.
2. For linear SVM, tune parameter C, and use two-layer 5-fold cross-validation to evaluate the parameter and performance. For each round of evaluation (i.e., outside loop), show the best C and the accuracy evaluated on the test data.
What to submit: Include your code and the reported results.
D. Feature importance study
Apply linear SVM on the whole data. Show the unigram (i.e., feature) and its corresponding weight. Sort this unigram list by the weights in descending order. Include the top-10 unigrams and the bottom-10 unigrams in the report and discuss whether these features make sense to you.
What to submit: Include your code, a sorted list of unigrams and their weights, and the report.
Submission requirement for questions B, C, and D:
Please name your code as $_xxx.py ($ should be replaced by B, C, or D that indicates question number. xxx can be designed by yourself).;
Please put all your reported results in one PDF file named HW2.pdf.
It is important to scale the features for SVM. Normalize the feature values to [-1,1] before you apply SVM.
- Submit 6 files (4 $_xxx.py files start with A, B, C, or D in their name, one matrix.txt file and one HW2.pdf file) by clicking on "add another file" in CANVAS, instead of submitting one zipped file of zipping aforementioned files.
- Late submission penalty will be strictly enforced (see syllabus). Assignment should be completed independently.
Assignment 1: Decision Trees
1. Decision tree implementation.
1.1. Implement the decision tree classification method. You should use Python only. Use the entropy as the criteria to select questions. The stopping threshold is when the number of instances in one node is less than 15.
1.2. Download this dataset . Run your decision tree algorithm on this dataset. Show the decision tree you built. Label the questions on the node. Show the distribution of classes (i.e., how many instances in each class) in the leaf node.
2. Evaluate decision trees.
2.1. Apply decision tree package on WINE dataset (download the dataset from this UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Wine). We recommend using scikit-learn package (http://scikit-learn.org/stable/index.html). Describe the splitting criteria and stopping criteria you choose to use. Plot the tree built by this tool package.
2.2. Randomly select 80% data instances as training, and the remaining 20% data instances as testing. Change the parameter setting on the stopping criteria. Draw a figure showing the training error and testing error w.r.t. different parameter values.
2.3. Fix the parameter setting. Evaluate the decision tree using 5-fold cross-validation.
2.4. Use two nested layers of 5-fold cross-validation to find the best parameter and evaluate it on the test data. Show the best parameter selected for each round and the corresponding accuracy on the testing data.
3. Apply random forests. Describe your parameter setting. Use 5-fold cross-validation to evaluate the performance.
4. Apply XGBoost on your dataset. Describe your parameter setting. Use 5-fold cross-validation to evaluate the performance.
What to submit: you need to submit three files:
1. The code (.py or .zip) for 1.1. If the source code only has one file, there is no need to zip the file.
2. The code (.py or .zip) for questions (other than 1.1). It is better to be one file containing all the codes. If it is one file, there is no need to zip the file. If it has to be several files, please submit a zip file and include a readme file in the folder so TA knows what each file means.
3. The results (.pdf or .docx) for questions (other than 1.1).
Remember to submit the three files by clicking on "add another file" in Canvas, instead of submitting one zipped file of zipping aforementioned three files.
It is required to use comment lines to give some annotations of the codes. Submit all the files to Assignment 1 on CANVAS.
Late submission penalty will be strictly enforced (see syllabus). Assignment should be completed independently.
Quiz Grading statistics..
Grading rubrics for Quiz.
|Question 1 (1.5/5)||As long as you try to deal with punctuation and the corresponding result is right, you will get a full score. One additional fault will deduct 0.5.|
|Question 2 (1.5/5)||Results and figures (with bars) should all be included in the submission. Y-axis of the figure can be months with or without specifications on the year. One additional fault will deduct 0.5.|
|Question 3 (1.0/5)||One wrong/missing answer will deduct 0.5.|
|Question 4 (1.0/5)||One wrong/missing answer will deduct 0.5.|