I'm facing a problem about which I can't find any answer. I have a binary classification problem (output Y=0 or Y=1) with Y=1 the minority class (actually Y=1 indicates default of a company, with proportion=0.02 in the original dataframe). Therefore, I computed oversampling using SMOTE algorithm on my training set only (after splitting my dataframe in training and testing sets). I train a logistic regression on my training set (with proportions of class "defaut"=0.3) and then look at the ROC Curve and MSE to test whether my algorithm predicts well the default. I get very good results in terms both of AUC (AUC=0.89) and MSE (MSE=0.06). However, when I then try to look more preciselly and individually at my predictions, I find that 20% of default aren't well predicted. Do you have a method to evaluate well the quality of my prediction (quality means for me predictions that predict well default). I thought that AUC was a good criterium... So far do you also have a method in order to improve my regression? Thanks
SMOTE Algorithm and Classification: overrated prediction success
282 views Asked by T. Ciffréo At
1
There are 1 answers
Related Questions in R
- in R, recovering strings that have been converted to factors with factor()
- How to reinstall pandoc after removing .cabal?
- How do I code a Mixed effects model for abalone growth in Aquaculture nutrition with nested individuals
- How to save t.test result in R to a txt file?
- how to call function from library in formula with R type provider
- geom_bar define border color with different fill colors
- Different outcome using model.matrix for a function in R
- Creating a combination data.table in R
- Force specific interactions in Package 'earth' in R
- Output from recursive function R
Related Questions in MACHINE-LEARNING
- How to cluster a set of strings?
- Enforcing that inputs sum to 1 and are contained in the unit interval in scikit-learn
- scikit-learn preperation
- Spark MLLib How to ignore features when training a classifier
- Increasing the efficiency of equipment using Amazon Machine Learning
- How to interpret scikit's learn confusion matrix and classification report?
- Amazon Machine Learning for sentiment analysis
- What Machine Learning algorithm would be appropriate?
- LDA generated topics
- Spectral clustering with Similarity matrix constructed by jaccard coefficient
Related Questions in CROSS-VALIDATION
- computed initial MA coefficients are not invertible [Python] [TSA] [ARIMAX] [CrossValidation]
- Big accuracy difference between cross-validation and testing with a test set in weka? is it normal?
- How to predict labels for new data (test set) by the PartitionedEnsemble model in Matlab?
- h2o.runif() always returns the same vector
- Access indices of each CV fold for custom metric function in caret
- python sklearn cross_validation /number of labels does not match number of samples
- LDA cross validation and variable selection
- How to use cross validation in MATLAB
- Parameter selection of SVM
- Using Cross-Validation on a Scikit-Learn Classifer
Related Questions in OVERSAMPLING
- How to keep/extend index when oversample
- Oversampling the dataset with pytorch
- Defore oversampling data showing 0
- RandomOverSampling without Replacement
- Interpolating lines of a Polygon
- Error 'names' attribute [35563] must be the same length as the vector [1] after running SMOGNRegress in R
- How to correct Python Attribute error: 'SMOTE' object has no attribute 'fit_sample'
- Are oversampling and undersampling approaches good to build good models?
- Imbalanced data, regression tree and SMOTE oversampling
- SMOTE Algorithm and Classification: overrated prediction success
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
For every classification problem you can build a confusion matrix.
This is a two way entry matrix, and lets you see not only the true positives/true negatives (TP/TN), which are your correct predictions, but also the false positives (FP)/false negatives (FN), and this is most of the time your true interest.
FP and FN are the errors that your model make, you can track how well your model is doing in detecting either the TP (1-FP) or the TN (1-FN), by using sensitivity or specificity (link).
Note that you can't improve one without lowering the other. So sometimes you need to pick one.
A good compromise is the F1-score, which tries to average the two.
So if you're more interested in defaults (lets imagine that
defaults=Positive Class), you'll prefer a model with a higher sensitivity. But remember to not neglect completely the specificity either.Here an example code in R: