Snowflake is a data-warehouse-as-a-service product. It is designed for fast, efficient storage and compute over large datasets, with full ANSI SQL support. As such, Snowflake competes with products such as Apache Spark, Google BigQuery, and Amazon RedShift. You may have already heard about Snowflake, even if you are not in tech—the company behind it, Snowflake Inc., had the biggest software IPO of all time just earlier this year.
In this article we break down how to use Snowflake's database connector to access a running Snowflake instance from within Spell.
What is Snowflake
Large enterprise organizations typically deploy a large number of site-specific business and/or customer-facing databases. For example, the marketing team might have a set of databases that it uses, the sales team might use a different link to their sales tools, etcetera. Internal data applications—reporting, alerting, dashboarding, and even ML—typically need longitudinal access to all of this data at once. Accessing data across many different services is awkward and slow, so large companies typically invest in a centralized data warehouse or data lake. These work by pooling relevant data from the company's various services in a single location, and exposing some powerful compute on the pooled data suitable for use with long-running and expensive analytics queries.
Snowflake is an example of a data warehousing product. Snowflake takes data stored in blob storage (AWS S3, Azure Blob Storage, or Google Cloud Buckets) and a SQL query as input. It spins up what Snowflake calls a virtual warehouse—a cluster of compute nodes of a certain preconfigured size—which reads the data in, executes the query job, and produces a result. To configure the speed (and cost) with which the query is executed, adjust the virtual warehouse size.
Snowflake has a number of interesting properties, which potentially give it a leg up over other data warehousing products it competes with. The Freshpaint.IO blog post "What Is Snowflake and Why Is It Exciting?" offers a good summary. Briefly:
- Snowflake is completely serverless. This means you don't need to perform any cluster management, as you would with e.g. Spark or Redshift.
- Snowflake decouples storage and compute. The compute pool Snowflake provides to you via its virtual warehouse is completely separate from its storage layer (which is just "boring" blob storage). This has some interesting implications for e.g. data sharing.
- Snowflake is pay-as-you-go. It has a very simple pricing model, which makes it easy to project costs.
Accessing Snowflake on Spell
Snowflake exposes its services via a SQL interface, allowing you to connect to it much as you would with any other database, e.g. Postgres or MySQL.
To follow along with this section in code, you can sign up for a free trial.
The easiest way to connect to a running Snowflake instance from within a Spell run or workspace is to do so using the snowflake-connector-python Python package. You can install this package into your run environment using the --pip flag on spell run or spell jupyter, e.g.:
$ spell jupyter --lab \ --machine-type cpu \ --pip snowflake-connector-python \ snowflake
The following code sample, taken from the Snowflake docs, shows how you can use this package to send queries to and receive results from a running database from within Spell:
import snowflake.connector # ctx = snowflake.connector.connect( user='$USERNAME', password='$PASSWORD', account='$ACCOUNT' ) cs = ctx.cursor() try: result = cs.execute(""" SELECT (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, CAST(V:time as timestamp) time, V:city.coord.lat lat, V:city.coord.lon lon FROM snowflake_sample_data.weather.weather_14_total WHERE v:city.name = 'New York' AND v:city.country = 'US' """).fetchall() finally: cs.close() ctx.close()
Replace $USERNAME and $PASSWORD with your login credentials, and replace $ACCOUNT with the unique account ID, e.g. abc12345, that Snowflake assigned to you at account initialization time.
Of course, in practice, it may not be a good idea to pass sensitive database credentials into run commands in plaintext form; for secure access to shared secrets, we recommend making use of authentication middleware, a security design pattern we discuss in more detail in the following article: "Fine-grained access control in Spell using private pip packages".
Executing this queries returns a list of 30805 items composing our result set. We can quickly convert this result to a pandas DataFrame for display:
import pandas as pd df = pd.DataFrame(result) df