Match the closest time stamp within a group in R

Solution for Match the closest time stamp within a group in R
is Given Below:

Assume I have two datasets. Both of them have a common variable – location. Dataset A has time stamp at precision of second level , while Dataset B has time stamp at millisecond level. Is there any efficient way for me to match the two datasets by time interval (e.g. get the most updated weather for Dataset A)for each location in R or python?

Really appreciate any thought or inputs.

Example for Dataset A

Location Date Time # items
New York 2019-01-01 09:00:00 50
New York 2019-01-01 09:15:28 10
New York 2019-01-01 09:16:16 69
New York 2019-01-01 10:09:00 47
New York 2019-01-11 19:34:30 777
New York 2019-01-11 22:10:15 276
Miami 2019-01-01 09:00:01 100
Miami 2019-01-01 16:07:09 145
Miami 2019-01-01 20:05:01 56
Boston 2020-12-21 23:09:02 78

Example for Dataset B:

Location Date Time Weather
New York 2019-01-01 05:56:09.456 Rain
New York 2019-01-01 08:59:23.897 Sunny
New York 2019-01-01 09:14:35.897 Cloudy
Boston 2020-12-31 23:25:09.987 Snow

The idea output would be:

Location Date Time # items Weather Time Weather
New York 2019-01-01 09:00:00 50 08:59:23.897 Sunny
New York 2019-01-01 09:15:28 10 09:14:35.897 Cloudy
New York 2019-01-01 09:16:16 69 09:14:35.897 Cloudy

If I understand correctly, dataset A should be completed by the most recent available weather data for Location from dataset B.

This can by achieved by a rolling join and an update by reference:

library(data.table)
setDT(A)[, dttm := lubridate::ymd_hms(paste(Date, Time))]
setDT(B)[, dttm := lubridate::ymd_hms(paste(Date, Time))]
A[, c("WeatherTime", "Weather") := 
    B[A, on = c("Location", "dttm"), roll = Inf, .(x.dttm, x.Weather)]][]
    Location       Date     Time # items                dttm         WeatherTime Weather
 1: New York 2019-01-01 09:00:00      50 2019-01-01 09:00:00 2019-01-01 08:59:23   Sunny
 2: New York 2019-01-01 09:15:28      10 2019-01-01 09:15:28 2019-01-01 09:14:35  Cloudy
 3: New York 2019-01-01 09:16:16      69 2019-01-01 09:16:16 2019-01-01 09:14:35  Cloudy
 4: New York 2019-01-01 10:09:00      47 2019-01-01 10:09:00 2019-01-01 09:14:35  Cloudy
 5: New York 2019-01-11 19:34:30     777 2019-01-11 19:34:30 2019-01-01 09:14:35  Cloudy
 6: New York 2019-01-11 22:10:15     276 2019-01-11 22:10:15 2019-01-01 09:14:35  Cloudy
 7:    Miami 2019-01-01 09:00:01     100 2019-01-01 09:00:01                <NA>    <NA>
 8:    Miami 2019-01-01 16:07:09     145 2019-01-01 16:07:09                <NA>    <NA>
 9:    Miami 2019-01-01 20:05:01      56 2019-01-01 20:05:01                <NA>    <NA>
10:   Boston 2020-12-21 23:09:02      78 2020-12-21 23:09:02                <NA>    <NA>

Note that weather data is missing for Miami. The weather data for Boston as provided in the sample data is ten days late.

Data

A <- structure(list(Location = c("New York", "New York", "New York", 
"New York", "New York", "New York", "Miami", "Miami", "Miami", 
"Boston"), Date = structure(c(17897L, 17897L, 17897L, 17897L, 
17907L, 17907L, 17897L, 17897L, 17897L, 18617L), class = c("IDate", 
"Date")), Time = c("09:00:00", "09:15:28", "09:16:16", "10:09:00", 
"19:34:30", "22:10:15", "09:00:01", "16:07:09", "20:05:01", "23:09:02"
), `# items` = c(50L, 10L, 69L, 47L, 777L, 276L, 100L, 145L, 
56L, 78L)), row.names = c(NA, -10L), class = "data.frame")

B <- structure(list(Location = c("New York", "New York", "New York", 
"Boston"), Date = structure(c(17897L, 17897L, 17897L, 18627L), class = c("IDate", 
"Date")), Time = c("05:56:09.456", "08:59:23.897", "09:14:35.897", 
"23:25:09.987"), Weather = c("Rain", "Sunny", "Cloudy", "Snow"
)), row.names = c(NA, -4L), class = "data.frame")

Explanation

  • Date and Time are combined into a continuous POSIXct datetime to join on. This will avoid gaps caused by day change.

  • The rolling join

    B[A, on = c("Location", "dttm"), roll = Inf, .(x.dttm, x.Weather)]  
    

    returns

                     x.dttm x.Weather
     1: 2019-01-01 08:59:23     Sunny
     2: 2019-01-01 09:14:35    Cloudy
     3: 2019-01-01 09:14:35    Cloudy
     4: 2019-01-01 09:14:35    Cloudy
     5: 2019-01-01 09:14:35    Cloudy
     6: 2019-01-01 09:14:35    Cloudy
     7:                <NA>      <NA>
     8:                <NA>      <NA>
     9:                <NA>      <NA>
    10:                <NA>      <NA>
    
  • The update by reference (c("WeatherTime", "Weather") := ...) appends two new columns to dataset A without copying the whole object. This might help to mitigate ressource limitations.

Here’s a brute force approach that might work acceptably if your data doesn’t have a huge number of Location-Date matches.

library(dplyr); library(lubridate)

# add timestamp to both
Data_A <- Data_A %>% mutate(timestamp = ymd_hms(paste(Date, Time)))
Data_B <- Data_B %>% mutate(timestamp = ymd_hms(paste(Date, Time)))

# join the two tables
Data_A %>%
  left_join(Data_B, by = c("Location", "Date")) %>%

  # calc time diffs and select best match for each Location/Date
  mutate(time_diff = abs(timestamp.x - timestamp.y)) %>%
  group_by(Location, timestamp.x) %>% # EDIT
  arrange(time_diff) %>%
  slice(1) %>%
  ungroup()