How to select between models when AUC scores are similar?Generic strategy for object detectionQuestion on reservoir samplingHow can I fix this “convex” problem ? Is it just a matter of overfitting?Possible Reason for low Test accuracy and high AUCHow to evaluate data capability to train a model?Valid Approach to Kaggle's Porto Seguro ML Problem?Significant overfitting with CVStatistical test for machine learningHow to generate data if algo itself is involved in the process with a feedback loop?how to interpret a high AUC value but a low F1 score after upsampling minority class?

SOQL: Populate a Literal List in WHERE IN Clause

How could a scammer know the apps on my phone / iTunes account?

Brexit - No Deal Rejection

Could the Saturn V actually have launched astronauts around Venus?

Life insurance that covers only simultaneous/dual deaths

Awsome yet unlucky path traversal

How to change two letters closest to a string and one letter immediately after a string using notepad++

How difficult is it to simply disable/disengage the MCAS on Boeing 737 Max 8 & 9 Aircraft?

A sequence that has integer values for prime indexes only:

Min function accepting varying number of arguments in C++17

Why is the President allowed to veto a cancellation of emergency powers?

If curse and magic is two sides of the same coin, why the former is forbidden?

Identifying the interval from A♭ to D♯

Why do passenger jet manufacturers design their planes with stall prevention systems?

What approach do we need to follow for projects without a test environment?

Interplanetary conflict, some disease destroys the ability to understand or appreciate music

How do I hide Chekhov's Gun?

Is it true that good novels will automatically sell themselves on Amazon (and so on) and there is no need for one to waste time promoting?

Python if-else code style for reduced code for rounding floats

Can I use USB data pins as power source

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

Can a druid choose the size of its wild shape beast?

In a future war, an old lady is trying to raise a boy but one of the weapons has made everyone deaf

How to simplify this time periods definition interface?



How to select between models when AUC scores are similar?


Generic strategy for object detectionQuestion on reservoir samplingHow can I fix this “convex” problem ? Is it just a matter of overfitting?Possible Reason for low Test accuracy and high AUCHow to evaluate data capability to train a model?Valid Approach to Kaggle's Porto Seguro ML Problem?Significant overfitting with CVStatistical test for machine learningHow to generate data if algo itself is involved in the process with a feedback loop?how to interpret a high AUC value but a low F1 score after upsampling minority class?













2












$begingroup$


I use two machine learning algorithms for binary classification and I get this result :



Algo 1 :



 AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


Algo 2 :



 AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


Which one is better?










share|improve this question











$endgroup$
















    2












    $begingroup$


    I use two machine learning algorithms for binary classification and I get this result :



    Algo 1 :



     AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


    Algo 2 :



     AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


    Which one is better?










    share|improve this question











    $endgroup$














      2












      2








      2


      1



      $begingroup$


      I use two machine learning algorithms for binary classification and I get this result :



      Algo 1 :



       AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


      Algo 2 :



       AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


      Which one is better?










      share|improve this question











      $endgroup$




      I use two machine learning algorithms for binary classification and I get this result :



      Algo 1 :



       AUC- Train : 0.75 AUC- Test: 0.65 big Train / overfitting


      Algo 2 :



       AUC- Train : 0.72 AUC- Test: 0.65 small train / small overfitting


      Which one is better?







      machine-learning data-mining metric






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 13 hours ago









      Esmailian

      1,096112




      1,096112










      asked 16 hours ago









      amal amalamal amal

      202




      202




















          3 Answers
          3






          active

          oldest

          votes


















          1












          $begingroup$

          Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



          Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






          share|improve this answer











          $endgroup$












          • $begingroup$
            Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            Yes, that is correct.
            $endgroup$
            – Simon Larsson
            15 hours ago










          • $begingroup$
            Thanks for your help
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            No problem! Don't forget to mark my answer as correct if you got what you asked for.
            $endgroup$
            – Simon Larsson
            15 hours ago


















          1












          $begingroup$

          Algo 2



          Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



          For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



          By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






          share|improve this answer











          $endgroup$












          • $begingroup$
            How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            Genuinely curious btw, incase you know something I have missed. :)
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            @SimonLarsson cool! I made some updates.
            $endgroup$
            – Esmailian
            14 hours ago










          • $begingroup$
            Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
            $endgroup$
            – Simon Larsson
            14 hours ago






          • 2




            $begingroup$
            @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
            $endgroup$
            – Ben Reiniger
            13 hours ago


















          1












          $begingroup$

          Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



          Disclaimer:



          If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



          For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



          >>> from pycm import *

          >>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

          >>> print(cm.recommended_list)
          ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


          and then see the value of the metrics focusing on the recommended metrics by the following code:



          >>> print(cm)
          Predict 0 1 2
          Actual
          0 1 0 0
          1 0 1 2
          2 0 1 0




          Overall Statistics :

          95% CI (-0.02941,0.82941)
          Bennett_S 0.1
          Chi-Squared 6.66667
          Chi-Squared DF 4
          Conditional Entropy 0.55098
          Cramer_V 0.8165
          Cross Entropy 1.52193
          Gwet_AC1 0.13043
          Joint Entropy 1.92193
          KL Divergence 0.15098
          Kappa 0.0625
          Kappa 95% CI (-0.60846,0.73346)
          Kappa No Prevalence -0.2
          Kappa Standard Error 0.34233
          Kappa Unbiased 0.03226
          Lambda A 0.5
          Lambda B 0.66667
          Mutual Information 0.97095
          Overall_ACC 0.4
          Overall_RACC 0.36
          Overall_RACCU 0.38
          PPV_Macro 0.5
          PPV_Micro 0.4
          Phi-Squared 1.33333
          Reference Entropy 1.37095
          Response Entropy 1.52193
          Scott_PI 0.03226
          Standard Error 0.21909
          Strength_Of_Agreement(Altman) Poor
          Strength_Of_Agreement(Cicchetti) Poor
          Strength_Of_Agreement(Fleiss) Poor
          Strength_Of_Agreement(Landis and Koch) Slight
          TPR_Macro 0.44444
          TPR_Micro 0.4

          Class Statistics :

          Classes 0 1 2
          ACC(Accuracy) 1.0 0.4 0.4
          BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
          DOR(Diagnostic odds ratio) None 0.5 0.0
          ERR(Error rate) 0.0 0.6 0.6
          F0.5(F0.5 score) 1.0 0.45455 0.0
          F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
          F2(F2 score) 1.0 0.35714 0.0
          FDR(False discovery rate) 0.0 0.5 1.0
          FN(False negative/miss/type 2 error) 0 2 1
          FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
          FOR(False omission rate) 0.0 0.66667 0.33333
          FP(False positive/type 1 error/false alarm) 0 1 2
          FPR(Fall-out or false positive rate) 0.0 0.5 0.5
          G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
          LR+(Positive likelihood ratio) None 0.66667 0.0
          LR-(Negative likelihood ratio) 0.0 1.33333 2.0
          MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
          MK(Markedness) 1.0 -0.16667 -0.33333
          N(Condition negative) 4 2 4
          NPV(Negative predictive value) 1.0 0.33333 0.66667
          P(Condition positive) 1 3 1
          POP(Population) 5 5 5
          PPV(Precision or positive predictive value) 1.0 0.5 0.0
          PRE(Prevalence) 0.2 0.6 0.2
          RACC(Random accuracy) 0.04 0.24 0.08
          RACCU(Random accuracy unbiased) 0.04 0.25 0.09
          TN(True negative/correct rejection) 4 1 2
          TNR(Specificity or true negative rate) 1.0 0.5 0.5
          TON(Test outcome negative) 4 3 3
          TOP(Test outcome positive) 1 2 2
          TP(True positive/hit) 1 1 0
          TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
            $endgroup$
            – Ben Reiniger
            12 hours ago










          • $begingroup$
            thanks for your reminder.I just edited my answer
            $endgroup$
            – alireza zolanvari
            12 hours ago










          • $begingroup$
            @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
            $endgroup$
            – Esmailian
            10 hours ago










          • $begingroup$
            @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
            $endgroup$
            – alireza zolanvari
            9 hours ago










          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47339%2fhow-to-select-between-models-when-auc-scores-are-similar%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



          Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






          share|improve this answer











          $endgroup$












          • $begingroup$
            Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            Yes, that is correct.
            $endgroup$
            – Simon Larsson
            15 hours ago










          • $begingroup$
            Thanks for your help
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            No problem! Don't forget to mark my answer as correct if you got what you asked for.
            $endgroup$
            – Simon Larsson
            15 hours ago















          1












          $begingroup$

          Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



          Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






          share|improve this answer











          $endgroup$












          • $begingroup$
            Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            Yes, that is correct.
            $endgroup$
            – Simon Larsson
            15 hours ago










          • $begingroup$
            Thanks for your help
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            No problem! Don't forget to mark my answer as correct if you got what you asked for.
            $endgroup$
            – Simon Larsson
            15 hours ago













          1












          1








          1





          $begingroup$

          Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



          Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.






          share|improve this answer











          $endgroup$



          Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).



          Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 15 hours ago

























          answered 15 hours ago









          Simon LarssonSimon Larsson

          4316




          4316











          • $begingroup$
            Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            Yes, that is correct.
            $endgroup$
            – Simon Larsson
            15 hours ago










          • $begingroup$
            Thanks for your help
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            No problem! Don't forget to mark my answer as correct if you got what you asked for.
            $endgroup$
            – Simon Larsson
            15 hours ago
















          • $begingroup$
            Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            Yes, that is correct.
            $endgroup$
            – Simon Larsson
            15 hours ago










          • $begingroup$
            Thanks for your help
            $endgroup$
            – amal amal
            15 hours ago










          • $begingroup$
            No problem! Don't forget to mark my answer as correct if you got what you asked for.
            $endgroup$
            – Simon Larsson
            15 hours ago















          $begingroup$
          Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
          $endgroup$
          – amal amal
          15 hours ago




          $begingroup$
          Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
          $endgroup$
          – amal amal
          15 hours ago












          $begingroup$
          Yes, that is correct.
          $endgroup$
          – Simon Larsson
          15 hours ago




          $begingroup$
          Yes, that is correct.
          $endgroup$
          – Simon Larsson
          15 hours ago












          $begingroup$
          Thanks for your help
          $endgroup$
          – amal amal
          15 hours ago




          $begingroup$
          Thanks for your help
          $endgroup$
          – amal amal
          15 hours ago












          $begingroup$
          No problem! Don't forget to mark my answer as correct if you got what you asked for.
          $endgroup$
          – Simon Larsson
          15 hours ago




          $begingroup$
          No problem! Don't forget to mark my answer as correct if you got what you asked for.
          $endgroup$
          – Simon Larsson
          15 hours ago











          1












          $begingroup$

          Algo 2



          Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



          For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



          By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






          share|improve this answer











          $endgroup$












          • $begingroup$
            How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            Genuinely curious btw, incase you know something I have missed. :)
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            @SimonLarsson cool! I made some updates.
            $endgroup$
            – Esmailian
            14 hours ago










          • $begingroup$
            Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
            $endgroup$
            – Simon Larsson
            14 hours ago






          • 2




            $begingroup$
            @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
            $endgroup$
            – Ben Reiniger
            13 hours ago















          1












          $begingroup$

          Algo 2



          Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



          For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



          By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






          share|improve this answer











          $endgroup$












          • $begingroup$
            How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            Genuinely curious btw, incase you know something I have missed. :)
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            @SimonLarsson cool! I made some updates.
            $endgroup$
            – Esmailian
            14 hours ago










          • $begingroup$
            Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
            $endgroup$
            – Simon Larsson
            14 hours ago






          • 2




            $begingroup$
            @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
            $endgroup$
            – Ben Reiniger
            13 hours ago













          1












          1








          1





          $begingroup$

          Algo 2



          Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



          For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



          By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.






          share|improve this answer











          $endgroup$



          Algo 2



          Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.



          For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.



          By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 12 hours ago

























          answered 14 hours ago









          EsmailianEsmailian

          1,096112




          1,096112











          • $begingroup$
            How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            Genuinely curious btw, incase you know something I have missed. :)
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            @SimonLarsson cool! I made some updates.
            $endgroup$
            – Esmailian
            14 hours ago










          • $begingroup$
            Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
            $endgroup$
            – Simon Larsson
            14 hours ago






          • 2




            $begingroup$
            @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
            $endgroup$
            – Ben Reiniger
            13 hours ago
















          • $begingroup$
            How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            Genuinely curious btw, incase you know something I have missed. :)
            $endgroup$
            – Simon Larsson
            14 hours ago










          • $begingroup$
            @SimonLarsson cool! I made some updates.
            $endgroup$
            – Esmailian
            14 hours ago










          • $begingroup$
            Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
            $endgroup$
            – Simon Larsson
            14 hours ago






          • 2




            $begingroup$
            @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
            $endgroup$
            – Ben Reiniger
            13 hours ago















          $begingroup$
          How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
          $endgroup$
          – Simon Larsson
          14 hours ago




          $begingroup$
          How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
          $endgroup$
          – Simon Larsson
          14 hours ago












          $begingroup$
          Genuinely curious btw, incase you know something I have missed. :)
          $endgroup$
          – Simon Larsson
          14 hours ago




          $begingroup$
          Genuinely curious btw, incase you know something I have missed. :)
          $endgroup$
          – Simon Larsson
          14 hours ago












          $begingroup$
          @SimonLarsson cool! I made some updates.
          $endgroup$
          – Esmailian
          14 hours ago




          $begingroup$
          @SimonLarsson cool! I made some updates.
          $endgroup$
          – Esmailian
          14 hours ago












          $begingroup$
          Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
          $endgroup$
          – Simon Larsson
          14 hours ago




          $begingroup$
          Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
          $endgroup$
          – Simon Larsson
          14 hours ago




          2




          2




          $begingroup$
          @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
          $endgroup$
          – Ben Reiniger
          13 hours ago




          $begingroup$
          @SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
          $endgroup$
          – Ben Reiniger
          13 hours ago











          1












          $begingroup$

          Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



          Disclaimer:



          If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



          For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



          >>> from pycm import *

          >>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

          >>> print(cm.recommended_list)
          ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


          and then see the value of the metrics focusing on the recommended metrics by the following code:



          >>> print(cm)
          Predict 0 1 2
          Actual
          0 1 0 0
          1 0 1 2
          2 0 1 0




          Overall Statistics :

          95% CI (-0.02941,0.82941)
          Bennett_S 0.1
          Chi-Squared 6.66667
          Chi-Squared DF 4
          Conditional Entropy 0.55098
          Cramer_V 0.8165
          Cross Entropy 1.52193
          Gwet_AC1 0.13043
          Joint Entropy 1.92193
          KL Divergence 0.15098
          Kappa 0.0625
          Kappa 95% CI (-0.60846,0.73346)
          Kappa No Prevalence -0.2
          Kappa Standard Error 0.34233
          Kappa Unbiased 0.03226
          Lambda A 0.5
          Lambda B 0.66667
          Mutual Information 0.97095
          Overall_ACC 0.4
          Overall_RACC 0.36
          Overall_RACCU 0.38
          PPV_Macro 0.5
          PPV_Micro 0.4
          Phi-Squared 1.33333
          Reference Entropy 1.37095
          Response Entropy 1.52193
          Scott_PI 0.03226
          Standard Error 0.21909
          Strength_Of_Agreement(Altman) Poor
          Strength_Of_Agreement(Cicchetti) Poor
          Strength_Of_Agreement(Fleiss) Poor
          Strength_Of_Agreement(Landis and Koch) Slight
          TPR_Macro 0.44444
          TPR_Micro 0.4

          Class Statistics :

          Classes 0 1 2
          ACC(Accuracy) 1.0 0.4 0.4
          BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
          DOR(Diagnostic odds ratio) None 0.5 0.0
          ERR(Error rate) 0.0 0.6 0.6
          F0.5(F0.5 score) 1.0 0.45455 0.0
          F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
          F2(F2 score) 1.0 0.35714 0.0
          FDR(False discovery rate) 0.0 0.5 1.0
          FN(False negative/miss/type 2 error) 0 2 1
          FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
          FOR(False omission rate) 0.0 0.66667 0.33333
          FP(False positive/type 1 error/false alarm) 0 1 2
          FPR(Fall-out or false positive rate) 0.0 0.5 0.5
          G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
          LR+(Positive likelihood ratio) None 0.66667 0.0
          LR-(Negative likelihood ratio) 0.0 1.33333 2.0
          MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
          MK(Markedness) 1.0 -0.16667 -0.33333
          N(Condition negative) 4 2 4
          NPV(Negative predictive value) 1.0 0.33333 0.66667
          P(Condition positive) 1 3 1
          POP(Population) 5 5 5
          PPV(Precision or positive predictive value) 1.0 0.5 0.0
          PRE(Prevalence) 0.2 0.6 0.2
          RACC(Random accuracy) 0.04 0.24 0.08
          RACCU(Random accuracy unbiased) 0.04 0.25 0.09
          TN(True negative/correct rejection) 4 1 2
          TNR(Specificity or true negative rate) 1.0 0.5 0.5
          TON(Test outcome negative) 4 3 3
          TOP(Test outcome positive) 1 2 2
          TP(True positive/hit) 1 1 0
          TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
            $endgroup$
            – Ben Reiniger
            12 hours ago










          • $begingroup$
            thanks for your reminder.I just edited my answer
            $endgroup$
            – alireza zolanvari
            12 hours ago










          • $begingroup$
            @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
            $endgroup$
            – Esmailian
            10 hours ago










          • $begingroup$
            @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
            $endgroup$
            – alireza zolanvari
            9 hours ago















          1












          $begingroup$

          Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



          Disclaimer:



          If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



          For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



          >>> from pycm import *

          >>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

          >>> print(cm.recommended_list)
          ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


          and then see the value of the metrics focusing on the recommended metrics by the following code:



          >>> print(cm)
          Predict 0 1 2
          Actual
          0 1 0 0
          1 0 1 2
          2 0 1 0




          Overall Statistics :

          95% CI (-0.02941,0.82941)
          Bennett_S 0.1
          Chi-Squared 6.66667
          Chi-Squared DF 4
          Conditional Entropy 0.55098
          Cramer_V 0.8165
          Cross Entropy 1.52193
          Gwet_AC1 0.13043
          Joint Entropy 1.92193
          KL Divergence 0.15098
          Kappa 0.0625
          Kappa 95% CI (-0.60846,0.73346)
          Kappa No Prevalence -0.2
          Kappa Standard Error 0.34233
          Kappa Unbiased 0.03226
          Lambda A 0.5
          Lambda B 0.66667
          Mutual Information 0.97095
          Overall_ACC 0.4
          Overall_RACC 0.36
          Overall_RACCU 0.38
          PPV_Macro 0.5
          PPV_Micro 0.4
          Phi-Squared 1.33333
          Reference Entropy 1.37095
          Response Entropy 1.52193
          Scott_PI 0.03226
          Standard Error 0.21909
          Strength_Of_Agreement(Altman) Poor
          Strength_Of_Agreement(Cicchetti) Poor
          Strength_Of_Agreement(Fleiss) Poor
          Strength_Of_Agreement(Landis and Koch) Slight
          TPR_Macro 0.44444
          TPR_Micro 0.4

          Class Statistics :

          Classes 0 1 2
          ACC(Accuracy) 1.0 0.4 0.4
          BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
          DOR(Diagnostic odds ratio) None 0.5 0.0
          ERR(Error rate) 0.0 0.6 0.6
          F0.5(F0.5 score) 1.0 0.45455 0.0
          F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
          F2(F2 score) 1.0 0.35714 0.0
          FDR(False discovery rate) 0.0 0.5 1.0
          FN(False negative/miss/type 2 error) 0 2 1
          FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
          FOR(False omission rate) 0.0 0.66667 0.33333
          FP(False positive/type 1 error/false alarm) 0 1 2
          FPR(Fall-out or false positive rate) 0.0 0.5 0.5
          G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
          LR+(Positive likelihood ratio) None 0.66667 0.0
          LR-(Negative likelihood ratio) 0.0 1.33333 2.0
          MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
          MK(Markedness) 1.0 -0.16667 -0.33333
          N(Condition negative) 4 2 4
          NPV(Negative predictive value) 1.0 0.33333 0.66667
          P(Condition positive) 1 3 1
          POP(Population) 5 5 5
          PPV(Precision or positive predictive value) 1.0 0.5 0.0
          PRE(Prevalence) 0.2 0.6 0.2
          RACC(Random accuracy) 0.04 0.24 0.08
          RACCU(Random accuracy unbiased) 0.04 0.25 0.09
          TN(True negative/correct rejection) 4 1 2
          TNR(Specificity or true negative rate) 1.0 0.5 0.5
          TON(Test outcome negative) 4 3 3
          TOP(Test outcome positive) 1 2 2
          TP(True positive/hit) 1 1 0
          TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
            $endgroup$
            – Ben Reiniger
            12 hours ago










          • $begingroup$
            thanks for your reminder.I just edited my answer
            $endgroup$
            – alireza zolanvari
            12 hours ago










          • $begingroup$
            @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
            $endgroup$
            – Esmailian
            10 hours ago










          • $begingroup$
            @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
            $endgroup$
            – alireza zolanvari
            9 hours ago













          1












          1








          1





          $begingroup$

          Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



          Disclaimer:



          If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



          For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



          >>> from pycm import *

          >>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

          >>> print(cm.recommended_list)
          ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


          and then see the value of the metrics focusing on the recommended metrics by the following code:



          >>> print(cm)
          Predict 0 1 2
          Actual
          0 1 0 0
          1 0 1 2
          2 0 1 0




          Overall Statistics :

          95% CI (-0.02941,0.82941)
          Bennett_S 0.1
          Chi-Squared 6.66667
          Chi-Squared DF 4
          Conditional Entropy 0.55098
          Cramer_V 0.8165
          Cross Entropy 1.52193
          Gwet_AC1 0.13043
          Joint Entropy 1.92193
          KL Divergence 0.15098
          Kappa 0.0625
          Kappa 95% CI (-0.60846,0.73346)
          Kappa No Prevalence -0.2
          Kappa Standard Error 0.34233
          Kappa Unbiased 0.03226
          Lambda A 0.5
          Lambda B 0.66667
          Mutual Information 0.97095
          Overall_ACC 0.4
          Overall_RACC 0.36
          Overall_RACCU 0.38
          PPV_Macro 0.5
          PPV_Micro 0.4
          Phi-Squared 1.33333
          Reference Entropy 1.37095
          Response Entropy 1.52193
          Scott_PI 0.03226
          Standard Error 0.21909
          Strength_Of_Agreement(Altman) Poor
          Strength_Of_Agreement(Cicchetti) Poor
          Strength_Of_Agreement(Fleiss) Poor
          Strength_Of_Agreement(Landis and Koch) Slight
          TPR_Macro 0.44444
          TPR_Micro 0.4

          Class Statistics :

          Classes 0 1 2
          ACC(Accuracy) 1.0 0.4 0.4
          BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
          DOR(Diagnostic odds ratio) None 0.5 0.0
          ERR(Error rate) 0.0 0.6 0.6
          F0.5(F0.5 score) 1.0 0.45455 0.0
          F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
          F2(F2 score) 1.0 0.35714 0.0
          FDR(False discovery rate) 0.0 0.5 1.0
          FN(False negative/miss/type 2 error) 0 2 1
          FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
          FOR(False omission rate) 0.0 0.66667 0.33333
          FP(False positive/type 1 error/false alarm) 0 1 2
          FPR(Fall-out or false positive rate) 0.0 0.5 0.5
          G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
          LR+(Positive likelihood ratio) None 0.66667 0.0
          LR-(Negative likelihood ratio) 0.0 1.33333 2.0
          MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
          MK(Markedness) 1.0 -0.16667 -0.33333
          N(Condition negative) 4 2 4
          NPV(Negative predictive value) 1.0 0.33333 0.66667
          P(Condition positive) 1 3 1
          POP(Population) 5 5 5
          PPV(Precision or positive predictive value) 1.0 0.5 0.0
          PRE(Prevalence) 0.2 0.6 0.2
          RACC(Random accuracy) 0.04 0.24 0.08
          RACCU(Random accuracy unbiased) 0.04 0.25 0.09
          TN(True negative/correct rejection) 4 1 2
          TNR(Specificity or true negative rate) 1.0 0.5 0.5
          TON(Test outcome negative) 4 3 3
          TOP(Test outcome positive) 1 2 2
          TP(True positive/hit) 1 1 0
          TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0





          share|improve this answer











          $endgroup$



          Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.



          Disclaimer:



          If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.



          For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:



          >>> from pycm import *

          >>> cm = ConfusionMatrix(matrix="0": "0": 1, "1":0, "2": 0, "1": "0": 0, "1": 1, "2": 2, "2": "0": 0, "1": 1, "2": 0)

          >>> print(cm.recommended_list)
          ["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]


          and then see the value of the metrics focusing on the recommended metrics by the following code:



          >>> print(cm)
          Predict 0 1 2
          Actual
          0 1 0 0
          1 0 1 2
          2 0 1 0




          Overall Statistics :

          95% CI (-0.02941,0.82941)
          Bennett_S 0.1
          Chi-Squared 6.66667
          Chi-Squared DF 4
          Conditional Entropy 0.55098
          Cramer_V 0.8165
          Cross Entropy 1.52193
          Gwet_AC1 0.13043
          Joint Entropy 1.92193
          KL Divergence 0.15098
          Kappa 0.0625
          Kappa 95% CI (-0.60846,0.73346)
          Kappa No Prevalence -0.2
          Kappa Standard Error 0.34233
          Kappa Unbiased 0.03226
          Lambda A 0.5
          Lambda B 0.66667
          Mutual Information 0.97095
          Overall_ACC 0.4
          Overall_RACC 0.36
          Overall_RACCU 0.38
          PPV_Macro 0.5
          PPV_Micro 0.4
          Phi-Squared 1.33333
          Reference Entropy 1.37095
          Response Entropy 1.52193
          Scott_PI 0.03226
          Standard Error 0.21909
          Strength_Of_Agreement(Altman) Poor
          Strength_Of_Agreement(Cicchetti) Poor
          Strength_Of_Agreement(Fleiss) Poor
          Strength_Of_Agreement(Landis and Koch) Slight
          TPR_Macro 0.44444
          TPR_Micro 0.4

          Class Statistics :

          Classes 0 1 2
          ACC(Accuracy) 1.0 0.4 0.4
          BM(Informedness or bookmaker informedness) 1.0 -0.16667 -0.5
          DOR(Diagnostic odds ratio) None 0.5 0.0
          ERR(Error rate) 0.0 0.6 0.6
          F0.5(F0.5 score) 1.0 0.45455 0.0
          F1(F1 score - harmonic mean of precision and sensitivity) 1.0 0.4 0.0
          F2(F2 score) 1.0 0.35714 0.0
          FDR(False discovery rate) 0.0 0.5 1.0
          FN(False negative/miss/type 2 error) 0 2 1
          FNR(Miss rate or false negative rate) 0.0 0.66667 1.0
          FOR(False omission rate) 0.0 0.66667 0.33333
          FP(False positive/type 1 error/false alarm) 0 1 2
          FPR(Fall-out or false positive rate) 0.0 0.5 0.5
          G(G-measure geometric mean of precision and sensitivity) 1.0 0.40825 0.0
          LR+(Positive likelihood ratio) None 0.66667 0.0
          LR-(Negative likelihood ratio) 0.0 1.33333 2.0
          MCC(Matthews correlation coefficient) 1.0 -0.16667 -0.40825
          MK(Markedness) 1.0 -0.16667 -0.33333
          N(Condition negative) 4 2 4
          NPV(Negative predictive value) 1.0 0.33333 0.66667
          P(Condition positive) 1 3 1
          POP(Population) 5 5 5
          PPV(Precision or positive predictive value) 1.0 0.5 0.0
          PRE(Prevalence) 0.2 0.6 0.2
          RACC(Random accuracy) 0.04 0.24 0.08
          RACCU(Random accuracy unbiased) 0.04 0.25 0.09
          TN(True negative/correct rejection) 4 1 2
          TNR(Specificity or true negative rate) 1.0 0.5 0.5
          TON(Test outcome negative) 4 3 3
          TOP(Test outcome positive) 1 2 2
          TP(True positive/hit) 1 1 0
          TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.0






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 12 hours ago

























          answered 14 hours ago









          alireza zolanvarialireza zolanvari

          18313




          18313







          • 1




            $begingroup$
            You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
            $endgroup$
            – Ben Reiniger
            12 hours ago










          • $begingroup$
            thanks for your reminder.I just edited my answer
            $endgroup$
            – alireza zolanvari
            12 hours ago










          • $begingroup$
            @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
            $endgroup$
            – Esmailian
            10 hours ago










          • $begingroup$
            @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
            $endgroup$
            – alireza zolanvari
            9 hours ago












          • 1




            $begingroup$
            You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
            $endgroup$
            – Ben Reiniger
            12 hours ago










          • $begingroup$
            thanks for your reminder.I just edited my answer
            $endgroup$
            – alireza zolanvari
            12 hours ago










          • $begingroup$
            @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
            $endgroup$
            – Esmailian
            10 hours ago










          • $begingroup$
            @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
            $endgroup$
            – alireza zolanvari
            9 hours ago







          1




          1




          $begingroup$
          You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
          $endgroup$
          – Ben Reiniger
          12 hours ago




          $begingroup$
          You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior)
          $endgroup$
          – Ben Reiniger
          12 hours ago












          $begingroup$
          thanks for your reminder.I just edited my answer
          $endgroup$
          – alireza zolanvari
          12 hours ago




          $begingroup$
          thanks for your reminder.I just edited my answer
          $endgroup$
          – alireza zolanvari
          12 hours ago












          $begingroup$
          @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
          $endgroup$
          – Esmailian
          10 hours ago




          $begingroup$
          @alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure.
          $endgroup$
          – Esmailian
          10 hours ago












          $begingroup$
          @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
          $endgroup$
          – alireza zolanvari
          9 hours ago




          $begingroup$
          @Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it.
          $endgroup$
          – alireza zolanvari
          9 hours ago

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47339%2fhow-to-select-between-models-when-auc-scores-are-similar%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How should I use the fbox command correctly to avoid producing a Bad Box message?How to put a long piece of text in a box?How to specify height and width of fboxIs there an arrayrulecolor-like command to change the rule color of fbox?What is the command to highlight bad boxes in pdf?Why does fbox sometimes place the box *over* the graphic image?how to put the text in the boxHow to create command for a box where text inside the box can automatically adjust?how can I make an fbox like command with certain color, shape and width of border?how to use fbox in align modeFbox increase the spacing between the box and it content (inner margin)how to change the box height of an equationWhat is the use of the hbox in a newcommand command?

          Tender dossier with centered articlesHow can I get legal style indentation on section, subsection, subsubsec.. using titlesec?missing item with addtocontents before sectionsubsubsubsection, paragraph and subparagraph count not reset when starting a new section, subsection, etcTikZ won't support HSB color model hsb in article document classAdding a vskip1em before each section - won't compile with itHow to implement a customized hierarchical table of content using titletoc with changing number formatsSection title formatGrouped entries in index don't spill over to next columnParagraph spacing in documentclassarticle with Figure and ListingsRagged Right Index Entries

          Doxepinum Nexus interni Notae | Tabula navigationis3158DB01142WHOa682390"Structural Analysis of the Histamine H1 Receptor""Transdermal and Topical Drug Administration in the Treatment of Pain""Antidepressants as antipruritic agents: A review"