How to read and write data in Snowflake using Databricks in PySpark (Beginners)
Databricks and Snowflake are two powerful tools in the world of data management and analysis. Databricks is an open-source platform that provides a collaborative environment for data science teams to work with data at scale. Snowflake, on the other hand, is a cloud-based data warehouse that offers high performance, scalability, and ease of use.
When these two tools are integrated, they provide a powerful solution for managing and analyzing large-scale data sets. In this blog, we will discuss the steps involved in integrating Databricks and Snowflake using PySpark and provide syntax examples.
Step 1: Create a Snowflake Account and Database
Before integrating Databricks with Snowflake, you need to create a Snowflake account and a database. Once you have created the account and database, you need to create a Snowflake user with the required privileges to connect to the database.
Step 2: Configure Databricks
To configure Databricks, you need to create a cluster and install the Snowflake JDBC driver. You can create a cluster by navigating to the Clusters tab in the Databricks workspace and clicking on the Create Cluster button. Once you have created the cluster, you can install the Snowflake JDBC driver by following the instructions provided in the Databricks documentation.
Note: The Databricks version 4.2 native Snowflake Connector allows your Databricks account to read data from and write data to Snowflake without importing any libraries. Older versions of Databricks required importing the libraries for the Spark connector into your Databricks clusters.
Step 3: Connect to Snowflake
To connect to Snowflake from Databricks using PySpark, you need to create a connection string that includes the Snowflake account, database, user, and password. Here is an example of a connection string:
sfOptions = {
"sfURL": "<account>.snowflakecomputing.com",
"sfUser": "<user>",
"sfPassword": "<password>",
"sfDatabase": "<database>",
"sfSchema": "<schema>",
"sfWarehouse": "<warehouse>",
"sfRole": "<role>"
}
snowflake_source_name = "net.snowflake.spark.snowflake"
df = spark.read.format(snowflake_source_name).options(**sfOptions).option("query", "<query>").load()
Tip: Avoid exposing your Snowflake username and password in notebooks by using Secrets, which are demonstrated in the notebooks.
Step 4: Load Data into Databricks from Snowflake
Once you have established a connection between Databricks and Snowflake, you can load data from Snowflake into Databricks using the following syntax:
df = spark.read.format(snowflake_source_name).options(**sfOptions).option("query", "<query>").load()
In this syntax, you need to replace <query> with your SQL query to retrieve data from Snowflake.
Step 5: Write Data from Databricks to Snowflake
You can also write data from Databricks to Snowflake using the following syntax:
df.write.format(snowflake_source_name).options(**sfOptions).option("dbtable", "<table>").mode("append").save()
In this syntax, you need to replace <table>
with the name of the table in Snowflake where you want to write the data.
Conclusion
In this blog, we discussed the steps involved in integrating Databricks and Snowflake using PySpark and provided syntax examples. By following these steps, you can leverage the power of these two tools to manage and analyze large-scale data sets efficiently.