Learn R Programming

sparkbq (version 0.1.1)

spark_read_bigquery: Reading data from Google BigQuery

Description

This function reads data stored in a Google BigQuery table.

Usage

spark_read_bigquery(sc, name,
  billingProjectId = default_billing_project_id(),
  projectId = billingProjectId, datasetId = NULL, tableId = NULL,
  sqlQuery = NULL, type = default_bigquery_type(),
  gcsBucket = default_gcs_bucket(),
  serviceAccountKeyFile = default_service_account_key_file(),
  additionalParameters = NULL, memory = FALSE, ...)

Arguments

sc

spark_connection provided by sparklyr.

name

The name to assign to the newly generated table (see also spark_read_source).

billingProjectId

Google Cloud Platform project ID for billing purposes. This is the project on whose behalf to perform BigQuery operations. Defaults to default_billing_project_id().

projectId

Google Cloud Platform project ID of BigQuery dataset. Defaults to billingProjectId.

datasetId

Google BigQuery dataset ID (may contain letters, numbers and underscores). Either both of datasetId and tableId or sqlQuery must be specified.

tableId

Google BigQuery table ID (may contain letters, numbers and underscores). Either both of datasetId and tableId or sqlQuery must be specified.

sqlQuery

Google BigQuery SQL query. Either both of datasetId and tableId or sqlQuery must be specified. The query must be specified in standard SQL (SQL-2011). Legacy SQL is not supported. Tables are specified as `<project_id>.<dataset_id>.<table_id>`.

type

BigQuery import type to use. Options include "direct", "avro", "json" and "csv". Defaults to default_bigquery_type(). See bigquery_defaults for more details about the supported types.

gcsBucket

Google Cloud Storage (GCS) bucket to use for storing temporary files. Temporary files are used when importing through BigQuery load jobs and exporting through BigQuery extraction jobs (i.e. when using data extracts such as Parquet, Avro, ORC, ...). The service account specified in serviceAccountKeyFile needs to be given appropriate rights. This should be the name of an existing storage bucket.

serviceAccountKeyFile

Google Cloud service account key file to use for authentication with Google Cloud services. The use of service accounts is highly recommended. Specifically, the service account will be used to interact with BigQuery and Google Cloud Storage (GCS).

additionalParameters

Additional spark-bigquery options. See https://github.com/miraisolutions/spark-bigquery for more information.

memory

logical specifying whether data should be loaded eagerly into memory, i.e. whether the table should be cached. Note that eagerly caching prevents predicate pushdown (e.g. in conjunction with filter) and therefore the default is FALSE. See also spark_read_source.

...

Additional arguments passed to spark_read_source.

Value

A tbl_spark which provides a dplyr-compatible reference to a Spark DataFrame.

References

https://github.com/miraisolutions/spark-bigquery https://cloud.google.com/bigquery/docs/datasets https://cloud.google.com/bigquery/docs/tables https://cloud.google.com/bigquery/docs/reference/standard-sql/ https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv https://cloud.google.com/bigquery/pricing https://cloud.google.com/bigquery/docs/dataset-locations https://cloud.google.com/docs/authentication/ https://cloud.google.com/bigquery/docs/authentication/

See Also

spark_read_source, spark_write_bigquery, bigquery_defaults

Other Spark serialization routines: spark_write_bigquery

Examples

Run this code
# NOT RUN {
config <- spark_config()

sc <- spark_connect(master = "local", config = config)

bigquery_defaults(
  billingProjectId = "<your_billing_project_id>",
  gcsBucket = "<your_gcs_bucket>",
  datasetLocation = "US",
  serviceAccountKeyFile = "<your_service_account_key_file>",
  type = "direct")

# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
shakespeare <-
  spark_read_bigquery(
    sc,
    name = "shakespeare",
    projectId = "bigquery-public-data",
    datasetId = "samples",
    tableId = "shakespeare")
# }

Run the code above in your browser using DataLab