﻿ r - dplyr - R: How Do you calculate (percent of) matching observations? - answerstu - answerstu.com answerstu

# dplyr - R: How Do you calculate (percent of) matching observations?

So my data set looks like this and the alphabets are factor variables. Pred columns are for predicted observations for each ID and Real columns are the real observations. I want to calculate the overall accuracy of the predicted values for each ID.

``````ID  Pred1 Pred2 Pred3 Real1 Real2 Real3
1    A      C     E      A     D    B
2    A      B     D      E     C    C
3    E      C     A      A     B    D
4    D      A     B      B     B    D
5    B      A     C      C     A    B
``````

So I want to mutate a column called 'score' which gives you a percentage of the number of matched observations between Pred1,2,3 and Real1,2,3 columns. I only care about finding any 'Pred' values in any of 'Real' columns. If Pred1 is found in one of Real1,Real2 and Real3, then I give a score of 1/3. If Pred1 AND Pred2 (not Pred3) are both found in any of Real1,Real2 and Real3 columns, (*the order does NOT matter. Pred1 can be found in Real2 or Real3- just anywhere in 'Real' columns), then I give a score of 2/3. I hope it makes sense. The order does not matter and I only care about finding any of 'Pred' values in any of 'Real' columns. So I want something like below.

``````ID  Pred1 Pred2 Pred3 Real1 Real2 Real3  Score
1    A      C     E      A     D    B     1/3
2    A      B     D      E     C    C      0
3    E      C     A      A     B    D     1/3
4    D      A     B      B     E    D     2/3
5    B      A     C      C     A    B      1
``````

I am trying to write a function and tried something like ifelse("Pred1" %in% c("Real1","Real2","Real3") , 1/3 ,0 )) but it didn't work well.. (had error messages with coercing to logical etc which I didn't know how to solve) So I am trying different things too but keep getting stuck with errors... Can anyone help please? Thank you in advance!

1. 2019-11-14

It makes it difficult to compare values with different factor levels. We can first convert the columns from factors to characters.

``````df[-1] <- lapply(df[-1], as.character)
``````

Find out index of Predicted and Real columns and then for every row check how many of Predicted observations are present in Real ones.

``````pred_cols <- grep("^Pred", names(df))
real_cols <- grep("^Real", names(df))

df\$Score <- sapply(1:nrow(df), function(x)
sum(df[x, pred_cols] %in% df[x, real_cols]))/length(pred_cols)

df
#  ID Pred1 Pred2 Pred3 Real1 Real2 Real3 Score
#1  1     A     C     E     A     D     B  0.33
#2  2     A     B     D     E     C     C  0.00
#3  3     E     C     A     A     B     D  0.33
#4  4     D     A     B     B     B     D  0.67
#5  5     B     A     C     C     A     B  1.00
``````

As it is row-wise comparison we can also use `apply` with `MARGIN = 1` using the same logic. With this approach we don't need to explicitly convert the columns into characters.

``````apply(df, 1, function(x) sum(x[pred_cols] %in% x[real_cols]))/length(pred_cols)
``````
2. 2019-11-14

One `tidyverse` possibility could be:

``````bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
group_by(ID) %>%
summarise(res = paste0(sum(val %in% val2), "/3")) %>%
left_join(df, by = c("ID" = "ID"))

ID res   Pred1 Pred2 Pred3 Real1 Real2 Real3
<int> <chr> <fct> <fct> <fct> <fct> <fct> <fct>
1     1 1/3   A     C     E     A     D     B
2     2 0/3   A     B     D     E     C     C
3     3 1/3   E     C     A     A     B     D
4     4 2/3   D     A     B     B     B     D
5     5 3/3   B     A     C     C     A     B
``````

It first, separately, transforms from wide to long format the columns that contains `Pred` and `Real`. Second, it combines the two by columns. Finally, it groups by "ID", sums the number of matching cases and joins it with the original df.

Or if the number of pairs is not fixed to 3:

``````bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%
group_by(ID) %>%
summarise(res = paste(sum(val %in% val2), first(n), sep = "/")) %>%
left_join(df, by = c("ID" = "ID"))
``````

Or if you want a numeric variable as the result:

``````bind_cols(df %>%
gather(var, val, -matches("(Real|ID)")) %>%
select(ID, val), df %>%
gather(var2, val2, -matches("(Pred|ID)")) %>%
select(val2)) %>%