Learn R Programming

parquetize (version 0.5.7)

csv_to_parquet: Convert a csv or a txt file to parquet format

Description

This function allows to convert a csv or a txt file to parquet format.

Two conversions possibilities are offered :

  • Convert to a single parquet file. Argument `path_to_parquet` must then be used;

  • Convert to a partitioned parquet file. Additionnal arguments `partition` and `partitioning` must then be used;

Usage

csv_to_parquet(
  path_to_file,
  url_to_csv = lifecycle::deprecated(),
  csv_as_a_zip = lifecycle::deprecated(),
  filename_in_zip,
  path_to_parquet,
  columns = "all",
  compression = "snappy",
  compression_level = NULL,
  partition = "no",
  encoding = "UTF-8",
  read_delim_args = list(),
  ...
)

Value

A parquet file, invisibly

Arguments

path_to_file

String that indicates the path to the input file (don't forget the extension).

url_to_csv

DEPRECATED use path_to_file instead

csv_as_a_zip

DEPRECATED

filename_in_zip

name of the csv/txt file in the zip. Required if several csv/txt are included in the zip.

path_to_parquet

String that indicates the path to the directory where the parquet files will be stored.

columns

Character vector of columns to select from the input file (by default, all columns are selected).

compression

compression algorithm. Default "snappy".

compression_level

compression level. Meaning depends on compression algorithm.

partition

String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file. If "yes", `"partitioning"` argument must be filled in. In this case, a folder will be created for each modality of the variable filled in `"partitioning"`. Be careful, this argument can not be "yes" if `max_memory` or `max_rows` argument are not NULL.

encoding

String that indicates the character encoding for the input file.

read_delim_args

list of arguments for `read_delim`.

...

additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations.

Examples

Run this code

# Conversion from a local csv file to a single parquet file :

csv_to_parquet(
  path_to_file = parquetize_example("region_2022.csv"),
  path_to_parquet = tempfile(fileext=".parquet")
)

# Conversion from a local txt file to a single parquet file :

csv_to_parquet(
  path_to_file = parquetize_example("region_2022.txt"),
  path_to_parquet = tempfile(fileext=".parquet")
)

# Conversion from a local csv file to a single parquet file and select only
# few columns :

csv_to_parquet(
  path_to_file = parquetize_example("region_2022.csv"),
  path_to_parquet = tempfile(fileext = ".parquet"),
  columns = c("REG","LIBELLE")
)

# Conversion from a local csv file to a partitioned parquet file  :

csv_to_parquet(
  path_to_file = parquetize_example("region_2022.csv"),
  path_to_parquet = tempfile(fileext = ".parquet"),
  partition = "yes",
  partitioning =  c("REG")
)

# Conversion from a URL and a zipped file (csv) :

csv_to_parquet(
  path_to_file = "https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip",
  filename_in_zip = "census2021-ts007-ctry.csv",
  path_to_parquet = tempfile(fileext = ".parquet")
)

# Conversion from a URL and a zipped file (txt) :

csv_to_parquet(
  path_to_file = "https://sourceforge.net/projects/irisdss/files/latest/download",
  filename_in_zip = "IRIS TEST data.txt",
  path_to_parquet = tempfile(fileext=".parquet")
)

if (FALSE) {
# Conversion from a URL and a csv file with "gzip" compression :

csv_to_parquet(
  path_to_file =
  "https://github.com/sidsriv/Introduction-to-Data-Science-in-python/raw/master/census.csv",
  path_to_parquet = tempfile(fileext = ".parquet"),
  compression = "gzip",
  compression_level = 5
)
}

Run the code above in your browser using DataLab