公告

日历

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("create-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.executor.memory", "512m")   
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

get_ipython().run_line_magic('load_ext', 'sparksql_magic')
get_ipython().run_line_magic('config', 'SparkSql.limit=20')

%%sparksql
CREATE OR REPLACE TABLE default.netflix_titles (
    show_id STRING,
    type STRING,
    title STRING,
    director STRING,
    cast STRING,
    country STRING,
    date_added STRING,
    release_year STRING,
    rating STRING,
    duration STRING,
    listed_in STRING,
    description STRING 
) USING DELTA LOCATION '../data/delta_lake/netflix_titles';

This statement created the directory 'Chapter03/data/delta_lake/netflix_titles' instead of the expected location 'data/delta_lake/netflix_titles'.

Used absolute path for the location:

%%sparksql
CREATE OR REPLACE TABLE default.netflix_titles (
    show_id STRING,
    type STRING,
    title STRING,
    director STRING,
    cast STRING,
    country STRING,
    date_added STRING,
    release_year STRING,
    rating STRING,
    duration STRING,
    listed_in STRING,
    description STRING 
) USING DELTA LOCATION '~/zd/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles';

This time, the directory was created under 'Chapter03/spark-warehouse':

zzh@ZZHPC:~/zd/Github/Data-Engineering-with-Databricks-Cookbook-main/data$ find ./ -type d -name netflix_titles
zzh@ZZHPC:~/zd/Github/Data-Engineering-with-Databricks-Cookbook-main/data$ cd ..
zzh@ZZHPC:~/zd/Github/Data-Engineering-with-Databricks-Cookbook-main$ find ./ -type d -name netflix_titles
./Chapter03/spark-warehouse/~/zd/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles

So I know the base of the relative path for the write operation is in the 'spark-warehouse' directory whose location is the same with the .ipynb file.

%%sparksql
CREATE OR REPLACE TABLE default.netflix_titles (
    show_id STRING,
    type STRING,
    title STRING,
    director STRING,
    cast STRING,
    country STRING,
    date_added STRING,
    release_year STRING,
    rating STRING,
    duration STRING,
    listed_in STRING,
    description STRING 
) USING DELTA LOCATION '../../data/delta_lake/netflix_titles';

This time the target directory was created as expected.

Read your data:

df = (spark.read
      .format("csv")
      .option("header", "true")
      .load("../data/netflix_titles.csv"))

For the read operation, the base of the relative path is the directory where the .ipynb file is in.

Write the data to Delta Lake:

df.write.format("delta").mode("overwrite").saveAsTable("default.netflix_titles")

Query the Delta table:

%%sparksql 
SELECT * FROM default.netflix_titles LIMIT 3;

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("read-delta-table")
               .master("spark://ZZHPC:7077")
           .config("spark.executor.memory", "512m")   
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

Ivy Default Cache set to: /home/zzh/.ivy2/cache
The jars for the packages stored in: /home/zzh/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-50abd92f-2c69-46f4-b5da-18ff6217d96a;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.4.0 in central
	found io.delta#delta-storage;2.4.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 148ms :: artifacts dl 8ms
	:: modules in use:
	io.delta#delta-core_2.12;2.4.0 from central in [default]
	io.delta#delta-storage;2.4.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-50abd92f-2c69-46f4-b5da-18ff6217d96a
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/8ms)
25/02/04 10:37:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

get_ipython().run_line_magic('load_ext', 'sparksql_magic')
get_ipython().run_line_magic('config', 'SparkSql.limit=20')

df = spark.read.format("delta").load("../data/delta_lake/netflix_titles")
df.show(3)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           null|Ama Qamata, Khosi...| South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|September 24, 2021|        2021| TV-MA| 1 Season|Crime TV Shows, I...|To protect his fa...|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
only showing top 3 rows

%%sparksql
SELECT * FROM delta.`../data/delta_lake/netflix_titles` LIMIT 3;

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 get_ipython().run_cell_magic('sparksql', '', 'SELECT * FROM delta.`../data/delta_lake/netflix_titles` LIMIT 3;\n')

File ~/venvs/zpy311/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2543, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2541 with self.builtin_trap:
   2542     args = (magic_arg_s, cell)
-> 2543     result = fn(*args, **kwargs)
   2545 # The code below prevents the output from being displayed
   2546 # when using magics with decorator @output_can_be_silenced
   2547 # when the last Python token in the expression is a ';'.
   2548 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File ~/venvs/zpy311/lib/python3.11/site-packages/sparksql_magic/sparksql.py:40, in SparkSql.sparksql(self, line, cell, local_ns)
     37     print("active spark session is not found")
     38     return
---> 40 df = spark.sql(bind_variables(cell, user_ns))
     41 if args.cache or args.eager:
     42     print('cache dataframe with %s load' % ('eager' if args.eager else 'lazy'))

File ~/venvs/zpy311/lib/python3.11/site-packages/pyspark/sql/session.py:1440, in SparkSession.sql(self, sqlQuery, args, **kwargs)
   1438 try:
   1439     litArgs = {k: _to_java_column(lit(v)) for k, v in (args or {}).items()}
-> 1440     return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
   1441 finally:
   1442     if len(kwargs) > 0:

File ~/venvs/zpy311/lib/python3.11/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File ~/venvs/zpy311/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw)
    171 converted = convert_exception(e.java_exception)
    172 if not isinstance(converted, UnknownException):
    173     # Hide where the exception came from that shows a non-Pythonic
    174     # JVM exception message.
--> 175     raise converted from None
    176 else:
    177     raise

AnalysisException: Unsupported data source type for direct query on files: delta.; line 1 pos 14

Changed the files location:

%%sparksql
SELECT * FROM delta.`../../data/delta_lake/netflix_titles` LIMIT 3;

Still got the same error.

Changed the files location again:

%%sparksql
SELECT * FROM delta.`data/delta_lake/netflix_titles` LIMIT 3;

Still got the same error.

Changed the files location to absolute path:

%%sparksql
SELECT * FROM delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` LIMIT 3;

It succeeded:

%%sparksql
DESCRIBE HISTORY '/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles'

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` SET TBLPROPERTIES (
  delta.logRetentionDuration = 'interval 60 days',
  delta.deletedFileRetentionDuration = 'interval 14 days'
);

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, lit

builder = (SparkSession.builder
           .appName("upsert-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
           .config("spark.sql.catalogImplementation", "in-memory"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

%load_ext sparksql_magic
%config SparkSql.limit=20

# For PySpark:
deltaTable = DeltaTable.forPath(spark, "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")

deltaTable.toDF().show(3)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|           null|Ama Qamata, Khosi...| South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|September 24, 2021|        2021| TV-MA| 1 Season|Crime TV Shows, I...|To protect his fa...|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
only showing top 3 rows

# Update director to not have nulls
deltaTable.update(
  condition = expr("director IS NULL"),
  set = { "director": lit("") })
deltaTable.toDF().show(3)

+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|       director|                cast|      country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|Kirsten Johnson|                null|United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|               |Ama Qamata, Khosi...| South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglands|Julien Leclercq|Sami Bouajila, Tr...|         null|September 24, 2021|        2021| TV-MA| 1 Season|Crime TV Shows, I...|To protect his fa...|
+-------+-------+--------------------+---------------+--------------------+-------------+------------------+------------+------+---------+--------------------+--------------------+
only showing top 3 rows

%%sparksql
UPDATE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` SET director = "" WHERE director IS NULL;

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("merge-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

%load_ext sparksql_magic
%config SparkSql.limit=20

%%sparksql 
CREATE OR REPLACE TABLE default.movie_and_show_titles ( 
    show_id STRING, 
    type STRING, 
    title STRING, 
    director STRING, 
    cast STRING, 
    country STRING, 
    date_added STRING, 
    release_year STRING, 
    rating STRING, 
    duration STRING, 
    listed_in STRING, 
    description STRING  
) USING DELTA LOCATION '/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/movie_and_show_titles';

deltaTable_titles = DeltaTable.forPath(spark, "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/movie_and_show_titles")

deltaTable_titles.toDF().show(5)

+-------+----+-----+--------+----+-------+----------+------------+------+--------+---------+-----------+
|show_id|type|title|director|cast|country|date_added|release_year|rating|duration|listed_in|description|
+-------+----+-----+--------+----+-------+----------+------------+------+--------+---------+-----------+
+-------+----+-----+--------+----+-------+----------+------------+------+--------+---------+-----------+

df_netflix = spark.read.format("delta").load("/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")
df_netflix_deduped = df_netflix.dropDuplicates(["type", "title", "director", "date_added"])

(deltaTable_titles.alias('movie_and_show_titles')
 .merge(df_netflix_deduped.alias('updates'),
        """lower(movie_and_show_titles.type) = lower(updates.type)
          AND lower(movie_and_show_titles.title) = lower(updates.title)
          AND lower(movie_and_show_titles.director) = lower(updates.director)
          AND movie_and_show_titles.date_added = updates.date_added""")
 .whenMatchedUpdate(set = {
    "show_id": "updates.show_id",
     "type": "updates.type",
     "title" : "updates.title",
     "director" : "updates.director",
     "cast" : "updates.cast",
     "country" : "updates.country",
     "date_added" : "updates.date_added",
     "release_year" : "updates.release_year",
     "rating" : "updates.rating",
     "duration" : "updates.duration",
     "listed_in" : "updates.listed_in",
     "description" : "updates.description"})
 .whenNotMatchedInsert(values = {
    "show_id": "updates.show_id",
     "type": "updates.type",
     "title" : "updates.title",
     "director" : "updates.director",
     "cast" : "updates.cast",
     "country" : "updates.country",
     "date_added" : "updates.date_added",
     "release_year" : "updates.release_year",
     "rating" : "updates.rating",
     "duration" : "updates.duration",
     "listed_in" : "updates.listed_in",
     "description" : "updates.description"})
  .execute())

%%sparksql
DESCRIBE HISTORY "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/movie_and_show_titles"

df_titles = (spark.read.format("csv")
             .option("header", "true")
             .load("../data/titles.csv"))
df_titles_deduped = df_titles.dropDuplicates(["type", "title"])

df_titles_deduped.createOrReplaceTempView("titles_deduped")

The createOrReplaceTempView() function in PySpark is used to create a temporary view of a DataFrame, which can then be queried using SQL syntax. It essentially registers the DataFrame as a temporary table in the Spark session, allowing you to run SQL queries against it.

(deltaTable_titles.alias('movie_and_show_titles')
 .merge(df_titles_deduped.alias('updates'),
        """lower(movie_and_show_titles.type) = lower(updates.type)
          AND lower(movie_and_show_titles.title) = lower(updates.title)
          AND movie_and_show_titles.release_year = updates.release_year""")
 .whenMatchedUpdate(set ={
     "show_id" : "updates.id",
     "type" : "updates.type",
     "title" : "updates.title",
     "country" : "updates.production_countries",
     "release_year" : "updates.release_year",
     "rating" : "updates.age_certification",
     "duration" : "updates.runtime",
     "listed_in" : "updates.genres",
     "description" : "updates.description"})
 .whenNotMatchedInsert(values = {
     "show_id" : "updates.id",
     "type" : "updates.type",
     "title" : "updates.title",
     "country" : "updates.production_countries",
     "release_year" : "updates.release_year",
     "rating" : "updates.age_certification",
     "duration" : "updates.runtime",
     "listed_in" : "updates.genres",
     "description" : "updates.description"})
  .execute())

%%sparksql
MERGE INTO default.movie_and_show_titles
USING titles_deduped
ON lower(default.movie_and_show_titles.type) = lower(titles_deduped.type) 
    AND lower(default.movie_and_show_titles.title) = lower(titles_deduped.title) 
    AND default.movie_and_show_titles.release_year = titles_deduped.release_year
WHEN MATCHED THEN
  UPDATE SET
    show_id = titles_deduped.id,
    type = titles_deduped.type,
    title = titles_deduped.title,
    country = titles_deduped.production_countries,
    release_year = titles_deduped.release_year,
    rating = titles_deduped.age_certification,
    duration = titles_deduped.runtime,
    listed_in = titles_deduped.genres,
    description = titles_deduped.description
WHEN NOT MATCHED THEN 
  INSERT (
    show_id,
    type,
    title,
    country,
    release_year,
    rating,
    duration,
    listed_in,
    description
  )
  VALUES (
    titles_deduped.id,
    titles_deduped.type,
    titles_deduped.title,
    titles_deduped.production_countries,
    titles_deduped.release_year,
    titles_deduped.age_certification,
    titles_deduped.runtime,
    titles_deduped.genres,
    titles_deduped.description
  )

%%sparksql
DESCRIBE HISTORY "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/movie_and_show_titles"

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("change-data-feed-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

%load_ext sparksql_magic
%config SparkSql.limit=20

Create Bronze Table

(as an appendOnly table)

%%sparksql
CREATE OR REPLACE TABLE default.movie_and_show_titles_cdf (
    show_id STRING,
    type STRING,
    title STRING,
    director STRING,
    cast STRING,
    country STRING,
    date_added STRING,
    release_year STRING,
    rating STRING,
    duration STRING,
    listed_in STRING,
    description STRING 
) USING DELTA LOCATION '/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/movie_and_show_titles_cdf'
TBLPROPERTIES (delta.enableChangeDataFeed = true, medallionLevel = 'bronze');

4. Write data into the bronze table:

df = (spark.read.format("csv")
      .option("header", "true")
      .load("../data/netflix_titles.csv"));
df.write.format("delta").mode("append").saveAsTable("default.movie_and_show_titles_cdf")

%%sparksql
CREATE OR REPLACE TABLE default.movie_and_show_titles_cleansed (
    show_id STRING,
    type STRING,
    title STRING,
    director STRING,
    cast STRING,
    country STRING,
    date_added STRING,
    release_year STRING,
    rating STRING,
    duration STRING,
    listed_in STRING,
    description STRING 
) USING DELTA LOCATION '/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/movie_and_show_titles_cleansed'
TBLPROPERTIES (delta.enableChangeDataFeed = true, medallionLevel = 'silver', updatedFromTable= 'default.movie_and_show_titles_cdf', updatedFromTableVersion= '-1');

lastUpdateVersion = int(spark.sql("SHOW TBLPROPERTIES default.movie_and_show_titles_cleansed ('updatedFromTableVersion')").first()["value"])+1
lastUpdateVersion

# 0

latestVersion = spark.sql("DESCRIBE HISTORY default.movie_and_show_titles_cdf").first()["version"]
latestVersion

# 1

%%sparksql
CREATE OR REPLACE TEMPORARY VIEW bronzeTable_latest_version as
SELECT * FROM (
    SELECT *, 
           RANK() OVER (PARTITION BY (lower(type), lower(title), lower(director), date_added) ORDER BY _commit_version DESC) as rank  
    FROM table_changes('default.movie_and_show_titles_cdf', {lastUpdateVersion}, {latestVersion})
    WHERE type IS NOT NULL AND title IS NOT NULL AND director IS NOT NULL AND  _change_type != 'update_preimage'
)
WHERE rank = 1;

The table_changes function in SQL is used to retrieve row-level changes from a Delta table between two specific versions. It is part of Delta Change Data Feed (CDF), which allows users to track inserts, updates, and deletes over time.

SELECT * 
FROM table_changes('<schema>.<table_name>', startVersion, endVersion)

%%sparksql 
MERGE INTO default.movie_and_show_titles_cleansed t 
USING bronzeTable_latest_version s 
ON lower(t.type) = lower(s.type)
  AND lower(t.title) = lower(s.title)
  AND lower(t.director) = lower(s.director)
  AND t.date_added = s.date_added
WHEN MATCHED AND s._change_type='update_postimage' OR s._change_type='update_preimage' THEN UPDATE SET *
WHEN MATCHED AND s._change_type='delete' THEN DELETE
WHEN NOT MATCHED AND s._change_type='insert' THEN INSERT *

10.

%%sparksql
ALTER TABLE default.movie_and_show_titles_cleansed SET TBLPROPERTIES(updatedFromTableVersion = {latestVersion});

11.

%%sparksql
DROP VIEW bronzeTable_latest_version

12.

%%sparksql
DELETE FROM default.movie_and_show_titles_cdf WHERE country is NULL

%%sparksql
UPDATE default.movie_and_show_titles_cdf SET director = '' WHERE director is NULL

#get the value of the last Updated Version from the silver table
lastUpdateVersion = int(spark.sql("SHOW TBLPROPERTIES default.movie_and_show_titles_cleansed ('updatedFromTableVersion')").first()["value"])+1
lastUpdateVersion

# 2

#get the value of the last Updated Version from the bronze table
latestVersion = spark.sql("DESCRIBE HISTORY default.movie_and_show_titles_cdf").first()["version"]
latestVersion

# 3

%%sparksql
CREATE OR REPLACE TEMPORARY VIEW bronzeTable_latest_version as
SELECT * FROM (
    SELECT *, 
        RANK() OVER (PARTITION BY (lower(type), lower(title), lower(director), date_added) ORDER BY _commit_version DESC) as rank  
    FROM table_changes('default.movie_and_show_titles_cdf', {lastUpdateVersion}, {latestVersion})
    WHERE type IS NOT NULL AND title IS NOT NULL AND director IS NOT NULL AND  _change_type != 'update_preimage'
)
WHERE rank = 1;

%%sparksql 
MERGE INTO default.movie_and_show_titles_cleansed t 
USING bronzeTable_latest_version s 
ON lower(t.type) = lower(s.type)
AND lower(t.title) = lower(s.title)
AND lower(t.director) = lower(s.director)
AND t.date_added = s.date_added
WHEN MATCHED AND s._change_type='update_postimage' OR s._change_type='update_preimage' THEN UPDATE SET *
WHEN MATCHED AND s._change_type='delete' THEN DELETE
WHEN NOT MATCHED AND s._change_type='insert' THEN INSERT *

10.

%%sparksql
ALTER TABLE default.movie_and_show_titles_cleansed SET TBLPROPERTIES(updatedFromTableVersion = {latestVersion});

11.

%%sparksql
DROP VIEW bronzeTable_latest_version

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("optimize-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

%load_ext sparksql_magic
%config SparkSql.limit=20

df = spark.read.format("delta").load("/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")

%%sparksql
DESCRIBE HISTORY "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles"

Optimize performance with file management - Compaction (bin-packing)

deltaTable = DeltaTable.forPath(spark, "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")  
# For Hive metastore-based tables: deltaTable = DeltaTable.forName(spark, tableName)
deltaTable.optimize().executeCompaction()

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bigint>,deletionVectorStats:struct<numDeletionVectorsRemoved:bigint,numDeletionVectorRowsRemoved:bigint>,numTableColumns:bigint,numTableColumnsWithStats:bigint>]

%%sparksql
-- Optimizes the path-based Delta Lake table
OPTIMIZE "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles"

Data skipping - Z-Ordering (multi-dimensional clustering)

deltaTable = DeltaTable.forPath(spark, "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")  # path-based table
# For Hive metastore-based tables: deltaTable = DeltaTable.forName(spark, tableName)
deltaTable.optimize().executeZOrderBy("country")

DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bigint>,deletionVectorStats:struct<numDeletionVectorsRemoved:bigint,numDeletionVectorRowsRemoved:bigint>,numTableColumns:bigint,numTableColumnsWithStats:bigint>]

%%sparksql
OPTIMIZE "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles" ZORDER BY (country)

%%sparksql
DESCRIBE HISTORY "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles"

Partitioning

df = (spark.read.format("json")
      .option("multiLine", "true")
      .load("../data/nobel_prizes.json"))
df.printSchema()

root
 |-- category: string (nullable = true)
 |-- laureates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstname: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- motivation: string (nullable = true)
 |    |    |-- share: string (nullable = true)
 |    |    |-- surname: string (nullable = true)
 |-- overallMotivation: string (nullable = true)
 |-- year: string (nullable = true)

# Write the data to a Delta Lake table with partitioning
(df.write.format("delta")
 .mode("overwrite")
 .partitionBy("year")
 .save("/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/nobel_prizes"))

# Query the partitioned table
df = spark.read.format("delta").load("/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/nobel_prizes")
df.show(10, truncate=60)

+----------+------------------------------------------------------------+------------------------------------------------------------+----+
|  category|                                                   laureates|                                           overallMotivation|year|
+----------+------------------------------------------------------------+------------------------------------------------------------+----+
| chemistry|[{Paul J., 281, "for their work in atmospheric chemistry,...|                                                        null|1995|
| economics|[{Robert E., 714, "for having developed and applied the h...|                                                        null|1995|
|literature|[{Seamus, 672, "for works of lyrical beauty and ethical d...|                                                        null|1995|
|     peace|[{Joseph, 560, "for their efforts to diminish the part pl...|                                                        null|1995|
|   physics|[{Martin L., 147, "for the discovery of the tau lepton", ...|"for pioneering experimental contributions to lepton phys...|1995|
|  medicine|[{Edward B., 452, "for their discoveries concerning the g...|                                                        null|1995|
| chemistry|[{Otto, 169, "in recognition of his services to organic c...|                                                        null|1910|
|literature|[{Paul, 580, "as a tribute to the consummate artistry, pe...|                                                        null|1910|
|     peace|[{Permanent International Peace Bureau, 477, "for acting ...|                                                        null|1910|
|   physics|[{Johannes Diderik, 15, "for his work on the equation of ...|                                                        null|1910|
+----------+------------------------------------------------------------+------------------------------------------------------------+----+
only showing top 10 rows

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("time-travel-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

%load_ext sparksql_magic
%config SparkSql.limit=20

%%sparksql
DESCRIBE DETAIL delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles`

df = spark.read.format("delta").option("versionAsOf", 1).load("/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")
df.show(5, truncate=15)

+-------+-------+---------------+---------------+---------------+-------------+---------------+------------+------+---------+---------------+---------------+
|show_id|   type|          title|       director|           cast|      country|     date_added|release_year|rating| duration|      listed_in|    description|
+-------+-------+---------------+---------------+---------------+-------------+---------------+------------+------+---------+---------------+---------------+
|     s1|  Movie|Dick Johnson...|Kirsten Johnson|           null|United States|September 25...|        2020| PG-13|   90 min|  Documentaries|As her fathe...|
|     s2|TV Show|  Blood & Water|           null|Ama Qamata, ...| South Africa|September 24...|        2021| TV-MA|2 Seasons|Internationa...|After crossi...|
|     s3|TV Show|      Ganglands|Julien Leclercq|Sami Bouajil...|         null|September 24...|        2021| TV-MA| 1 Season|Crime TV Sho...|To protect h...|
|     s4|TV Show|Jailbirds Ne...|           null|           null|         null|September 24...|        2021| TV-MA| 1 Season|Docuseries, ...|Feuds, flirt...|
|     s5|TV Show|   Kota Factory|           null|Mayur More, ...|        India|September 24...|        2021| TV-MA|2 Seasons|Internationa...|In a city of...|
+-------+-------+---------------+---------------+---------------+-------------+---------------+------------+------+---------+---------------+---------------+
only showing top 5 rows

Alternatively, you could do this in SQL:

%%sparksql
SELECT * FROM delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` VERSION AS OF 1 LIMIT 5;

%%sparksql
DESCRIBE HISTORY delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles`;

deltaTable = DeltaTable.forPath(spark, "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")  # path-based tables, or
deltaTable.restoreToVersion(3)

Alternatively, you could do this in SQL:

%%sparksql
RESTORE TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` TO VERSION AS OF 5;

%%sparksql
DESCRIBE HISTORY "/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles"

from delta import configure_spark_with_delta_pip, DeltaTable
from pyspark.sql import SparkSession

builder = (SparkSession.builder
           .appName("manage-delta-table")
           .master("spark://ZZHPC:7077")
           .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
           .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog"))

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

%load_ext sparksql_magic
%config SparkSql.limit=20

df = spark.read.format("delta").load("/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles")
df.show(5, truncate=25)

+-------+-------+---------------------+---------------+-------------------------+-------------+------------------+------------+------+---------+-------------------------+-------------------------+
|show_id|   type|                title|       director|                     cast|      country|        date_added|release_year|rating| duration|                listed_in|              description|
+-------+-------+---------------------+---------------+-------------------------+-------------+------------------+------------+------+---------+-------------------------+-------------------------+
|     s1|  Movie| Dick Johnson Is Dead|Kirsten Johnson|                     null|United States|September 25, 2021|        2020| PG-13|   90 min|            Documentaries|As her father nears th...|
|     s2|TV Show|        Blood & Water|               |Ama Qamata, Khosi Ngem...| South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV Shows...|After crossing paths a...|
|     s3|TV Show|            Ganglands|Julien Leclercq|Sami Bouajila, Tracy G...|         null|September 24, 2021|        2021| TV-MA| 1 Season|Crime TV Shows, Intern...|To protect his family ...|
|     s4|TV Show|Jailbirds New Orleans|               |                     null|         null|September 24, 2021|        2021| TV-MA| 1 Season|   Docuseries, Reality TV|Feuds, flirtations and...|
|     s5|TV Show|         Kota Factory|               |Mayur More, Jitendra K...|        India|September 24, 2021|        2021| TV-MA|2 Seasons|International TV Shows...|In a city of coaching ...|
+-------+-------+---------------------+---------------+-------------------------+-------------+------------------+------------+------+---------+-------------------------+-------------------------+
only showing top 5 rows

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` ALTER COLUMN title NOT NULL

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` ADD CONSTRAINT validType CHECK (type IN ('Movie', 'Show','TV Show'));

%%sparksql
CREATE OR REPLACE TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles_shallow_clone` 
SHALLOW CLONE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles`;

Immutability: Both DataFrames and RDDs in Spark are immutable. Shallow cloning essentially just keeps references to the original data, and changes to the clone will not affect the original data.

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` ADD COLUMNS (ID INT)

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` ALTER COLUMN country AFTER show_id;

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` ALTER COLUMN release_year AFTER show_id;

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
)

%%sparksql
DESCRIBE TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles`;

%%sparksql
DESCRIBE DETAIL delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles`;

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` RENAME COLUMN listed_in TO genres

%%sparksql
ALTER TABLE delta.`/zdata/Github/Data-Engineering-with-Databricks-Cookbook-main/data/delta_lake/netflix_titles` DROP COLUMN ID

posted on 2025-02-03 12:56 ZhangZhihuiAAA 阅读(36) 评论(0) 收藏举报

刷新页面返回顶部


博客园 © 2004-2025 浙公网安备 33010602011771号浙ICP备2021040463号-3

导航

Create Bronze Table

4. Write data into the bronze table:

Optimize performance with file management - Compaction (bin-packing)

Data skipping - Z-Ordering (multi-dimensional clustering)

Partitioning