r - dplyr - R: How Do you calculate (percent of) matching observations? - answerstu - answerstu.com answerstu

dplyr - R: How Do you calculate (percent of) matching observations?

So my data set looks like this and the alphabets are factor variables. Pred columns are for predicted observations for each ID and Real columns are the real observations. I want to calculate the overall accuracy of the predicted values for each ID.

ID  Pred1 Pred2 Pred3 Real1 Real2 Real3
1    A      C     E      A     D    B
2    A      B     D      E     C    C
3    E      C     A      A     B    D
4    D      A     B      B     B    D
5    B      A     C      C     A    B

So I want to mutate a column called 'score' which gives you a percentage of the number of matched observations between Pred1,2,3 and Real1,2,3 columns. I only care about finding any 'Pred' values in any of 'Real' columns. If Pred1 is found in one of Real1,Real2 and Real3, then I give a score of 1/3. If Pred1 AND Pred2 (not Pred3) are both found in any of Real1,Real2 and Real3 columns, (*the order does NOT matter. Pred1 can be found in Real2 or Real3- just anywhere in 'Real' columns), then I give a score of 2/3. I hope it makes sense. The order does not matter and I only care about finding any of 'Pred' values in any of 'Real' columns. So I want something like below.

ID  Pred1 Pred2 Pred3 Real1 Real2 Real3  Score
1    A      C     E      A     D    B     1/3
2    A      B     D      E     C    C      0
3    E      C     A      A     B    D     1/3
4    D      A     B      B     E    D     2/3
5    B      A     C      C     A    B      1

I am trying to write a function and tried something like ifelse("Pred1" %in% c("Real1","Real2","Real3") , 1/3 ,0 )) but it didn't work well.. (had error messages with coercing to logical etc which I didn't know how to solve) So I am trying different things too but keep getting stuck with errors... Can anyone help please? Thank you in advance!

2 Answers

  1. Edwin- Reply

    2019-11-14

    It makes it difficult to compare values with different factor levels. We can first convert the columns from factors to characters.

    df[-1] <- lapply(df[-1], as.character)
    

    Find out index of Predicted and Real columns and then for every row check how many of Predicted observations are present in Real ones.

    pred_cols <- grep("^Pred", names(df))
    real_cols <- grep("^Real", names(df))
    
    df$Score <- sapply(1:nrow(df), function(x) 
                sum(df[x, pred_cols] %in% df[x, real_cols]))/length(pred_cols)
    
    df
    #  ID Pred1 Pred2 Pred3 Real1 Real2 Real3 Score
    #1  1     A     C     E     A     D     B  0.33
    #2  2     A     B     D     E     C     C  0.00
    #3  3     E     C     A     A     B     D  0.33
    #4  4     D     A     B     B     B     D  0.67
    #5  5     B     A     C     C     A     B  1.00
    

    As it is row-wise comparison we can also use apply with MARGIN = 1 using the same logic. With this approach we don't need to explicitly convert the columns into characters.

    apply(df, 1, function(x) sum(x[pred_cols] %in% x[real_cols]))/length(pred_cols)
    
  2. Elliott- Reply

    2019-11-14

    One tidyverse possibility could be:

    bind_cols(df %>%
     gather(var, val, -matches("(Real|ID)")) %>%
     select(ID, val), df %>%
     gather(var2, val2, -matches("(Pred|ID)")) %>%
     select(val2)) %>%
     group_by(ID) %>%
     summarise(res = paste0(sum(val %in% val2), "/3")) %>%
     left_join(df, by = c("ID" = "ID"))
    
         ID res   Pred1 Pred2 Pred3 Real1 Real2 Real3
      <int> <chr> <fct> <fct> <fct> <fct> <fct> <fct>
    1     1 1/3   A     C     E     A     D     B    
    2     2 0/3   A     B     D     E     C     C    
    3     3 1/3   E     C     A     A     B     D    
    4     4 2/3   D     A     B     B     B     D    
    5     5 3/3   B     A     C     C     A     B 
    

    It first, separately, transforms from wide to long format the columns that contains Pred and Real. Second, it combines the two by columns. Finally, it groups by "ID", sums the number of matching cases and joins it with the original df.

    Or if the number of pairs is not fixed to 3:

    bind_cols(df %>%
     gather(var, val, -matches("(Real|ID)")) %>%
     select(ID, val), df %>%
     gather(var2, val2, -matches("(Pred|ID)")) %>%
     select(val2)) %>%
     add_count(ID) %>%
     group_by(ID) %>%
     summarise(res = paste(sum(val %in% val2), first(n), sep = "/")) %>%
     left_join(df, by = c("ID" = "ID"))
    

    Or if you want a numeric variable as the result:

    bind_cols(df %>%
               gather(var, val, -matches("(Real|ID)")) %>%
               select(ID, val), df %>%
               gather(var2, val2, -matches("(Pred|ID)")) %>%
               select(val2)) %>%
     add_count(ID) %>%
     group_by(ID) %>%
     summarise(res = sum(val %in% val2)/first(n)) %>%
     left_join(df, by = c("ID" = "ID"))
    
         ID   res Pred1 Pred2 Pred3 Real1 Real2 Real3
      <int> <dbl> <fct> <fct> <fct> <fct> <fct> <fct>
    1     1 0.333 A     C     E     A     D     B    
    2     2 0     A     B     D     E     C     C    
    3     3 0.333 E     C     A     A     B     D    
    4     4 0.667 D     A     B     B     B     D    
    5     5 1     B     A     C     C     A     B 
    

Leave a Reply

Your email address will not be published. Required fields are marked *

You can use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>