fixCSV: Tidy a comma separated value (CSV) file

Description

Tidies up a Comma Separated Value (CSV) file, ensuring that each row of the table in the file contains the same number of commas, and no empty rows are left below the table.

Usage

fixCSV(file, skip = 0, overwrite = FALSE)

Arguments

file

character: the name of the CSV file to be ‘fixed’.

skip

integer: the number of lines in the CSV file to skip before the header row of the table. The skipped lines are copied directly to the output file unchanged. The default is skip=0, implying that the header row is the first row of the CSV file.

overwrite

logical: Write output to a separate, ‘FIXED’ file (overwrite=FALSE, the default), or overwrite the original file (overwrite=TRUE)? If overwrite=TRUE, the original file is copied to a .BAK file before being overwritten.

Details

fixCSV tidies up a Comma Separated Value (CSV) file to ensure that the CSV file contains a strictly rectangular block of data for input into R (ignoring any preliminary comment rows via the skip= argument).

CSV formatted files are a plain text file format for tabular data, in which cell entries in the same row of a table are separated by commas. When such files are exported from other applications such as spreadsheet software, the software has to decide whether any empty cells to the right-hand side of, or below, the table or spreadsheet should be represented by trailing commas in the CSV file. Such decisions can result in a ‘ragged’ table in the CSV file, in which some rows contain fewer commas (‘short rows’) or more commas (‘long rows’) than others, or where empty rows below the table are included as comma-only rows in the CSV file.

While R's read.table and related functions can sensibly extend short rows as needed, ragged tables in a CSV file can still result in errors, unwanted empty rows (below the table) or unwanted columns (to the right of the table) when the data is loaded into R.

fixCSV reads in a specified CSV file and removes or adds commas to rows, to ensure that each row in the body of the table contains the same number of cells as the header row of the table. Any empty rows below the table are also removed. The resulting table is then written back to file, either to a new file with ‘FIXED’ added to the filename (argument overwrite=FALSE, the default) or overwriting the original file (overwrite=TRUE - the original file is copied to a .BAK file before being overwritten).

Note that:

The table of data in the CSV file must contain a header row of the correct length, since this row is used to determine the correct number of columns for the table. Note: if this header row is too short, then subsequent rows will be truncated to match the length of the header, so beware. Misspecification of the skip= argument (see below) can similarly lead to such corruption of the ‘fixed’ file.

In the header row, any trailing commas representing empty cells to the right of the (non-empty) header entries are first removed before determining the correct number of columns for the table. Thus the length of the header row (and hence the assumed width of the entire table) is determined by the right-most non-empty cell in the header row.

fixCSV does not remove empty cells, rows or columns within the interior (or on the left side) of the table - it is concerned only with the right and bottom boundaries of the table.

A skip= argument is included to tell fixCSV to ignore the specified number of comment rows preceding the header row. Such rows are simply copied over into the output file unchanged. The default for this parameter is skip=0, so that the first row in the data file is assumed to be the header row. As noted above, misspecification of this argument can seriously corrupt the output.

fixCSV can overwrite your data file(s) (via overwrite=TRUE), and althought it makes a backup of your original file, you should still make sure that you have a separate backup of your data file in a safe place before using this function! The author of this code takes no responsibility for any data loss or corruption as a result of the use of this routine...

Examples

Run this code

## Not run: 
# 
# ## Assuming CSV file 'alleleDataFile.csv' exists in the current
# ## directory.  The following overwrites the CSV file - make sure
# ## you have a backup!
# 
# fixCSV("alleleDataFile.csv",overwrite=TRUE)
# 
# ## End(Not run)

Run the code above in your browser using DataLab