The gmmevents function has 3 primary inputs: time, identity, and location:
The time must be a number representing a real valued time stamp. This can be number of seconds since the start of the day, number of seconds since the start of the study, hour of the day, Julian date, etc. The time stamps should represent a meaningful scale given the group membership definition - for example if an edge is the propensity to observe an individual at the same location on the same day, then time stamps should be the day value. In the example below, the time stamps are in seconds, because flocks of birds visit feeders over a matter of minutes and the group definition is being in the same flock (and these occur over seconds to minutes). The input must be numeric whole numbers.
The identity is the unique identifier for each individual. This should be consistent across all of the data sets. In the example here, PIT tags are given, but in broader analyses, we would convert these to ring (band) numbers because individuals can have different PIT tag numbers in the course of the study but never change ring numbers. The function will accept any string or numeric input.
The location is where the observation took place. This should reflect meaningful observation locations for the study. The function will accept any string or numeric inputs.
If the analysis is being conducted as part of a broader analysis in the same populations, it can be useful to get the results in a consistent form each time. In that case, the global_ids variable can be used to maintain consistency each time an analysis is run, regardless of which individuals were identified in the current input data. That is, the group by individual (gbi) matrix will include a column for every individual provided in global_ids.
Further notes on usage:
The gmm_events functions requires a few careful considerations. First, the amount of memory used is the square of the amount of data - so having many observations in a given location can run out of memory. With 16gb of RAM, generally up to 10,000 observations per location (per day - see next point) seems to be a safe limit.
The input data provided for each location should take into account any artificial gaps in the observation stream. For example, if there are gaps in data collection at a given location, then the location information provided into the gmm_events function should be split into two 'locations' to represent each continuous set of observations. For example, in the PIT tag data set provided there are 8 days of sampling. Providing gmm_events with only the location data from the original data will cause the gap between days to override any gaps between groups (or flocks) within a given day. To overcome this, instead of providing the gmm_events function with just the location, it is important to provide a location by day variable. This variable is then returned in the metadata and the information extracted out again (using strsplit - see example below).
Finally, I have included a new variable called splitGroups. The original function (in both Matlab and R) would return the occasional group that overlapped other groups. This occured when a small group was extracted from the data, and then the remaining observations were formed into a larger group that spanned the smaller group from the same location. For example, say detections of individuals are made in the same location at 2,8,10,11,12,14,20 seconds, and the first group extracted contains 10,11,12 then the remaining data look like an evenly-spread group (2,8,14,20). Setting splitGroups=TRUE identifies such incidences and would split the data into three groups (2,8), (10,11,12), and (14,20).