r - Efficiently select subset of rows by rank -


how use data.table select subset of rows rank? have large data set , hope efficiently.

> dt <- data.table(id=1:200, category=sample(letters, 200, replace=t)) > dt[,count:=length(id), by=category] > dt       id category count   1:   1        o    13   2:   2        o    13  ---                    199: 170        n     3 200: 171        h     3 

what want efficiently change category 'other' category not in k common ones. along lines of:

dt[rank > 5,category:="other", by=category] 

i'm new data.table , i'm not quite sure how rank in efficient way. here's way works, seems clunky.

counts <- unique(dt$count) decision <- max(counts[rank(-counts)>5]) dt[count<=decision, category:='other'] 

i appreciate advice. honest, don't need 'count' column if it's not necessary.


Comments