expandCounts: Repeat individual fish data (including lengths) from tallied counts.

Description

Repeat individual fish data, including lengths, from tallied counts and, optionally, add a random digit to length measurements to simulate actual length of fish in the bin. This is useful as a precursor to summaries that require information, e.g., lengths, of individual fish (e.g., length frequency histograms, means lengths).

Usage

expandCounts(
  data,
  cform,
  lform = NULL,
  removeCount = TRUE,
  lprec = 0.1,
  new.name = "newlen",
  cwid = 0,
  verbose = TRUE,
  ...
)

Value

A data.frame of the same structure as data except that the variable in cform may be deleted and the variable in new.name may be added. The returned data.frame will have more rows than data because of the potential addition of new individuals expanded from the counts in cform.

Arguments

data: A data.frame that contains variables in cform and lform.
cform: A formula of the form ~countvar where countvar generically represents the variable in data that contains the counts of individuals. See details.
lform: An optional formula of the form ~lowerbin+upperbin where lowerbin and upperbin generically represent the variables in data that identify the lower- and upper-values of the length bins. See details.
removeCount: A single logical that indicates if the variable that contains the counts of individuals (as given in cform) should be removed form the returned data.frame. The default is TRUE such that the variable will be removed as the returned data.frame contains individuals and the counts of individuals in tallied bins is not relevant to an individual.
lprec: A single numeric that controls the precision to which the random lengths are recorded. See details.
new.name: A single string that contains a name for the new length variable if random lengths are to be created.
cwid: A single positive numeric that will be added to the lower length bin value in instances where the count exceeds one but only a lower (and not an upper) length were recorded. See details.
verbose: A logical indicating whether progress message should be printed or not.
...: Not yet implemented.

Author

Derek H. Ogle, DerekOgle51@gmail.com

Details

Fisheries data may be recorded as tallied counts in the field. For example, field biologists may have simply recorded that there were 10 fish in one group, 15 in another, etc. More specifically, the biologist may have recorded that there were 10 male Bluegill from the first sampling event between 100 and 124 mm, 15 male Bluegill from the first sampling event between 125 and 149 mm, and so on. At times, it may be necessary to expand these counts such that the repeated information appears in individual rows in a new data.frame. In this specific example, the tallied counts would be repeated such that the male, Bluegill, first sampling event, 100-124 mm information would be repeated 10 times; the male, Bluegill, first sampling event, 125-149 mm information would be repeated 15 times, and so on. This function facilitates this type of expansion.

Length data has often been collected in a “binned-and-tallied” format (e.g., 10 fish in the 100-124 mm group, 15 in the 125-149 mm group, etc.). This type of data collection does not facilitate easy or precise calculations of summary statistics of length (i.e., mean and standard deviations of length). Expanding the data as described above does not solve this problem because the length data are still essentially categorical (i.e., which group the fish belongs to rather than what it's actual length is). To facilitate computation of summary statistics, the data can be expanded as described above and then a length can be randomly selected from within the recorded length bin to serve as a “measured” length for that fish. This function performs this type of expansion by randomly selecting the length from a uniform distribution within the length bin (e.g., each value between 100 and 124 mm has the same probability of being selected).

This function makes some assumptions for some coding situations. First, it assumes that all lowerbin values are actually lower than all upperbin values. The function will throw an error if this is not true. Second, it assumes that if a lowerbin but no upperbin value is given then the lowerbin value is the exact measurement for those fish. Third, it assumes that if an upperbin but no lowerbin value is given that this is a data entry error and that the upperbin value should be the lowerbin value. Fourth, it assumes that it is a data entry error if varcount is zero or NA and lowerbin or upperbin contains values (i.e., why would there be lengths if no fish were captured?).

Examples

Run this code

# all need expansion
( d1 <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max"),
                   lwr.bin=c(15,15.5,16,16,17,17),
                   upr.bin=c(15.5,16,16.5,16.5,17.5,17.5),
                   freq=c(6,4,2,3,1,1)) )
expandCounts(d1,~freq)
expandCounts(d1,~freq,~lwr.bin+upr.bin)

# some need expansion
( d2 <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max"),
                   lwr.bin=c(15,15.5,16,16,17.1,17.3),
                   upr.bin=c(15.5,16,16.5,16.5,17.1,17.3),
                   freq=c(6,4,2,3,1,1)) )
expandCounts(d2,~freq)
expandCounts(d2,~freq,~lwr.bin+upr.bin)

# none need expansion
( d3 <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max"),
                   lwr.bin=c(15,15.5,16,16,17.1,17.3),
                   upr.bin=c(15,15.5,16,16,17.1,17.3),
                   freq=c(6,4,2,3,1,1)) )
expandCounts(d3,~freq)
expandCounts(d3,~freq,~lwr.bin+upr.bin)

# some need expansion, but different bin widths
( d4 <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max"),
                   lwr.bin=c(15,  15,  16,  16,  17.1,17.3),
                   upr.bin=c(15.5,15.9,16.5,16.9,17.1,17.3),
                   freq=c(6,4,2,3,1,1)) )
expandCounts(d4,~freq)
expandCounts(d4,~freq,~lwr.bin+upr.bin)

# some need expansion but include zeros and NAs for counts
( d2a <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max","Max","Max","Max"),
                    lwr.bin=c(15,  15.5,16  ,16  ,17.1,17.3,NA,NA,NA),
                    upr.bin=c(15.5,16  ,16.5,16.5,17.1,17.3,NA,NA,NA),
                    freq=c(6,4,2,3,1,1,NA,0,NA)) )
expandCounts(d2a,~freq,~lwr.bin+upr.bin)
 
# some need expansion but include NAs for upper values
( d2b <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max"),
                    lwr.bin=c(15,  15.5,16  ,16  ,17.1,17.3),
                    upr.bin=c(NA  ,NA  ,16.5,16.5,17.1,17.3),
                    freq=c(6,4,2,3,1,1)) )
expandCounts(d2b,~freq,~lwr.bin+upr.bin)
 
# some need expansion but include NAs for upper values
( d2c <- data.frame(name=c("Johnson","Johnson","Jones","Frank","Frank","Max"),
                    lwr.bin=c(NA,NA,  16  ,16  ,17.1,17.3),
                    upr.bin=c(15,15.5,16.5,16.5,17.1,17.3),
                    freq=c(6,4,2,3,1,1)) )
expandCounts(d2c,~freq,~lwr.bin+upr.bin)

if (FALSE) {
##!!##!!## Change path to where example file is and then run to demo

## Read in datafile (note periods in names)
df <- read.csv("c:/aaawork/consulting/R_WiDNR/Statewide/Surveysummaries2010.csv")
str(df) 
## narrow variables for simplicity
df1 <- df[,c("County","Waterbody.Name","Survey.Year","Gear","Species",
             "Number.of.Fish","Length.or.Lower.Length.IN","Length.Upper.IN",
             "Weight.Pounds","Gender")]
## Sum the count to see how many fish there should be after expansion
sum(df1$Number.of.Fish)

## Simple expansion
df2 <- expandCounts(df1,~Number.of.Fish)

## Same expansion but include random component to lengths (thus new variable)
##   also note default lprec=0.1
df3 <- expandCounts(df1,~Number.of.Fish,~Length.or.Lower.Length.IN+Length.Upper.IN)

}

Run the code above in your browser using DataLab