{"id":1897,"date":"2025-01-12T21:17:04","date_gmt":"2025-01-12T12:17:04","guid":{"rendered":"https:\/\/oboki.net\/workspace\/?p=1897"},"modified":"2025-02-09T19:34:41","modified_gmt":"2025-02-09T10:34:41","slug":"exploring-the-iceberg","status":"publish","type":"post","link":"https:\/\/oboki.net\/workspace\/data-engineering\/exploring-the-iceberg\/","title":{"rendered":"Iceberg \ub9db\ubcf4\uae30"},"content":{"rendered":"<p>\ud68c\uc0ac\uc5d0\uc11c \uc544\uc774\uc2a4\ubc84\uadf8\ub97c \ub4dc\ub514\uc5b4 \ub3c4\uc785\ud558\uac8c \ub41c\ub2e4\uace0 \ud574\uc11c \ucc3e\uc544\ubcf4\ub358 \uc911 \ub098\ub984 \ucd5c\uc2e0\uc758 \ub9ac\ubdf0 \ub17c\ubb38\uc774 \uc788\uc5b4\uc11c \ubd24\ub294\ub370 \ub0b4\uc6a9\uc774 \uad1c\ucc2e\uc740 \uac83 \uac19\ub2e4.<\/p>\n<blockquote>\n<p><a href=\"https:\/\/papers.ssrn.com\/sol3\/papers.cfm?abstract_id=4987315\">Disruptor in Data Engineering &#8211; Comprehensive Review of Apache Iceberg<\/a><\/p>\n<\/blockquote>\n<p>&quot;\ud14c\uc774\ube14 \ud3ec\ub9f7&quot;\uc778 \uc544\uc774\uc2a4\ubc84\uadf8\uc758 \ubcf8\uaca9 \uc124\uba85\uc5d0 \uc55e\uc11c Open File Format, Open Table Format, Data Warehouse\/Data Lake\/Lakehouse \uc640 \uac19\uc774 \ubc30\uacbd\uc774 \ub418\ub294 \uc6a9\uc5b4\ub4e4\ub3c4 \ud55c\ubc88 \uc9da\uc5b4\uc8fc\uace0 \uc2dc\uc791\ud558\ub294\ub370 Catalog, Metadata, Snapshot \uc138 \uacc4\uce35\uc73c\ub85c \uc774\ub904\uc9c4 \uc544\ud0a4\ud14d\ucc98\ub3c4 \uc798 \uc815\ub9ac\ub3fc \uc788\uace0 row-level deletes \uac00 \uc5b4\ub5bb\uac8c \ub3d9\uc791\ud558\ub294\uc9c0 \uc6d0\ub9ac\ub3c4 \ucda9\ubd84\ud788 \uc124\uba85\uc744 \ud574\uc8fc\uace0 \uc788\ub2e4.<\/p>\n<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/iceberg.apache.org\/assets\/external\/iceberg.apache.org\/assets\/images\/iceberg-metadata.png\" alt=\"\" \/><\/p>\n<p>\ub300\uccb4\uc7ac \ud14c\uc774\ube14 \ud3ec\ub9f7\uc778 hudi, deltalake \uc640\uc758 \ube44\uad50 \uc790\ub8cc\ub3c4 \uc788\uace0 \uc9c0\uc6d0 \uae30\ub2a5\uc774\ub791 \uc7a5\uc810, \ud55c\uacc4\uc810 \ub4f1\ub4f1 \uc815\ub9ac\uac00 \uc798 \ub3fc \uc788\ub2e4.<\/p>\n<h3>5.3 Capabilities<\/h3>\n<blockquote>\n<p>Iceberg as a table format provides several capabilities that\u2019s causing increased adoption of this<br \/>\ntechnology at a rapid pace in the industry:<\/p>\n<\/blockquote>\n<ul>\n<li>ACID compliance\n<ul>\n<li>data manipulations are atomic providing transaction capability<\/li>\n<\/ul>\n<\/li>\n<li>Hidden partitioning\n<ul>\n<li>Hive requires manual partition column definitions which is inefficient and complex. Iceberg addresses this limitation by providing a hidden partitioning concept. Iceberg manages partitioning internally by using data structure and metadata and users remain completely unaware of the partitioning scheme [24].<\/li>\n<\/ul>\n<\/li>\n<li>Partition Evolution\n<ul>\n<li>if performance degrades over time, Iceberg changes the partitioning scheme on its own enabling partition evolution.<\/li>\n<\/ul>\n<\/li>\n<li>Schema evolution\n<ul>\n<li>Schema evolution is straightforward in Iceberg as only metadata fields are updated. Standard operations such adding a column, dropping an existing column or renaming it are all possible in Iceberg using metadata file manipulation. The data files remain unchanged [26].<\/li>\n<\/ul>\n<\/li>\n<li>Time Travel\n<ul>\n<li>Change in Iceberg causes creating a new version of metadata called snapshot. Old snapshot remains in the system for a while. This allows users to time travel over data using date range or version number of a snapshot [27].<\/li>\n<\/ul>\n<\/li>\n<li>Concurrency\n<ul>\n<li>Iceberg allows concurrent reads and writes by multiple engines at the same time leveraging optimistic concurrency control[10, 28]. When there are multiple concurrent requests, Iceberg checks for conflicts at the file level, allowing multiple updates in a partition as long as there are no conflicts.<\/li>\n<\/ul>\n<\/li>\n<li>File filtering\n<ul>\n<li>The metadata files contain min, max values for a column.This allows query that searches over petabytes of data, come back with result in single digit seconds producing significant fast performance.<\/li>\n<\/ul>\n<\/li>\n<li>Table Migration\n<ul>\n<li>Iceberg provides a mechanism to create Iceberg metadata files using metadata files of other table formats such as Delta Lake using its table migration feature. It comes in 2 flavors- full data migration and in-place data migration. The full data migration creates a copy of the data files along with creating the Iceberg metadata files. The in-place migration on the other hand reuses the data files and only creates new Iceberg metadata files [29].<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>6.3 Architecture &amp; Capability Comparison<\/h3>\n<table>\n<thead>\n<tr>\n<th>Iceberg<\/th>\n<th>Deta Lake<\/th>\n<th>Hudi<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Supports Parquet, ORC and Avro data file format<\/td>\n<td>Supports only Parquet format<\/td>\n<td>Supports Parquet, ORC and Avro format<\/td>\n<\/tr>\n<tr>\n<td>Metadata implementation hierarchical in nature catalog, metadata layer(metadata file, manifest-listandmanifest file)<\/td>\n<td>Metadata implementation tabular in nature transaction log file and checkpoint file<\/td>\n<td>Metadata implementation tabular in nature- partition metadata and timeline metadata<\/td>\n<\/tr>\n<tr>\n<td>No caching support for performance optimization<\/td>\n<td>Delta Cache support for performance optimization<\/td>\n<td>No caching support for performance optimization<\/td>\n<\/tr>\n<tr>\n<td>Supports Copy-On-Write(COW) and Merge-On-Read(MOR) for write operations<\/td>\n<td>Supports only Copy-On-Write(COW)<\/td>\n<td>Supports both Copy-On-Write(COW) and Merge-On-Read(MOR)<\/td>\n<\/tr>\n<tr>\n<td>Hidden partitioning and partition evolution supported<\/td>\n<td>Explicit partitioning required, partition pruning supported<\/td>\n<td>Explicit partitioning required and partition evolution supported<\/td>\n<\/tr>\n<tr>\n<td>Snapshot versions are used for timetravel<\/td>\n<td>Transaction log based versioning for timetravel<\/td>\n<td>Timetravel based on incremental commits and versions<\/td>\n<\/tr>\n<tr>\n<td>Schemaevolution supported including addition and rename of fields<\/td>\n<td>Supports schema evolution without requiring schema migration<\/td>\n<td>Schema evolution supported with automatic handling of schema migration<\/td>\n<\/tr>\n<tr>\n<td>Concurrency writes are supported using Optimistic Concurrency Control and snapshots<\/td>\n<td>Transaction log based Optimistic Concurrency Control<\/td>\n<td>Optimistic concurrency control using multi-version concurrency control<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Iceberg Table Spec<\/h2>\n<p>\ubcf4\ub2e4 \uc815\ud655\ud558\uace0 \uc790\uc138\ud55c \uc815\ubcf4\ub294 \uacf5\uc2dd \ubb38\uc11c\ub97c \ucc38\uace0\ud574\uc57c\ud55c\ub2e4.<\/p>\n<blockquote>\n<p><a href=\"https:\/\/iceberg.apache.org\/spec\/\">https:\/\/iceberg.apache.org\/spec\/<\/a><\/p>\n<\/blockquote>\n<h2>quickstart<\/h2>\n<p>\ub9db\ubcf4\uae30\ub85c \uacf5\uc2dd \ubb38\uc11c\uc5d0\uc11c \uc81c\uacf5\ud558\ub294 \ud035\uc2a4\ud0c0\ud2b8 \uac00\uc774\ub4dc\ub97c \ub530\ub77c spark-sql \ub3c4 \uc774\uc6a9\ud574\ubd24\ub294\ub370<br \/>\n<a href=\"https:\/\/iceberg.apache.org\/spark-quickstart\/\">https:\/\/iceberg.apache.org\/spark-quickstart\/<\/a><\/p>\n<p>\uc81c\uacf5\ub418\ub294 \ub3c4\ucee4 \uba85\uc138\ub97c \uadf8\ub300\ub85c \ub530\ub77c\uc11c<\/p>\n<pre><code class=\"language-yml\">services:\n  spark-iceberg:\n    image: tabulario\/spark-iceberg\n    container_name: spark-iceberg\n    build: spark\/\n    networks:\n      iceberg_net:\n    depends_on:\n      - rest\n      - minio\n    volumes:\n      - .\/warehouse:\/home\/iceberg\/warehouse\n      - .\/notebooks:\/home\/iceberg\/notebooks\/notebooks\n    environment:\n      - AWS_ACCESS_KEY_ID=admin\n      - AWS_SECRET_ACCESS_KEY=password\n      - AWS_REGION=us-east-1\n    ports:\n      - 8888:8888\n      - 8080:8080\n      - 10000:10000\n      - 10001:10001\n  rest:\n    image: apache\/iceberg-rest-fixture\n    container_name: iceberg-rest\n    networks:\n      iceberg_net:\n    ports:\n      - 8181:8181\n    environment:\n      - AWS_ACCESS_KEY_ID=admin\n      - AWS_SECRET_ACCESS_KEY=password\n      - AWS_REGION=us-east-1\n      - CATALOG_WAREHOUSE=s3:\/\/warehouse\/\n      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO\n      - CATALOG_S3_ENDPOINT=http:\/\/minio:9000\n  minio:\n    image: minio\/minio\n    container_name: minio\n    environment:\n      - MINIO_ROOT_USER=admin\n      - MINIO_ROOT_PASSWORD=password\n      - MINIO_DOMAIN=minio\n    networks:\n      iceberg_net:\n        aliases:\n          - warehouse.minio\n    ports:\n      - 9001:9001\n      - 9000:9000\n    command: [&quot;server&quot;, &quot;\/data&quot;, &quot;--console-address&quot;, &quot;:9001&quot;]\n  mc:\n    depends_on:\n      - minio\n    image: minio\/mc\n    container_name: mc\n    networks:\n      iceberg_net:\n    environment:\n      - AWS_ACCESS_KEY_ID=admin\n      - AWS_SECRET_ACCESS_KEY=password\n      - AWS_REGION=us-east-1\n    entrypoint: |\n      \/bin\/sh -c &quot;\n      until (\/usr\/bin\/mc config host add minio http:\/\/minio:9000 admin password) do echo &#039;...waiting...&#039; &amp;&amp; sleep 1; done;\n      \/usr\/bin\/mc rm -r --force minio\/warehouse;\n      \/usr\/bin\/mc mb minio\/warehouse;\n      \/usr\/bin\/mc policy set public minio\/warehouse;\n      tail -f \/dev\/null\n      &quot;\nnetworks:\n  iceberg_net:<\/code><\/pre>\n<p>\uc11c\ube44\uc2a4\ub97c \uad6c\uc131\ud558\uace0<\/p>\n<pre><code class=\"language-bash\">docker compose up -d\ndocker exec -it spark-iceberg spark-sql<\/code><\/pre>\n<p>\ub2e4\uc74c\uacfc \uac19\uc774 DELETE operation \uc744 \ucd94\uac00\ud574 \ucffc\ub9ac\ud574\ubd24\ub294\ub370 \uc798 \ub3d9\uc791\ud55c\ub2e4.<\/p>\n<pre><code class=\"language-sql\">CREATE TABLE\n       demo.nyc.taxis\n     ( vendor_id          BIGINT\n     , trip_id            BIGINT\n     , trip_distance      FLOAT\n     , fare_amount        DOUBLE\n     , store_and_fwd_flag STRING\n     ) PARTITIONED BY (vendor_id);\n\nINSERT\n  INTO demo.nyc.taxis\nVALUES\n     ( 1, 1000371, 1.8, 15.32, &#039;N&#039; ),\n     ( 2, 1000372, 2.5, 22.15, &#039;N&#039; ),\n     ( 2, 1000373, 0.9, 9.01 , &#039;N&#039; ),\n     ( 1, 1000374, 8.4, 42.13, &#039;Y&#039; );\n\nSELECT *\n  FROM demo.nyc.taxis;\n-- 1       1000371 1.8     15.32   N\n-- 1       1000374 8.4     42.13   Y\n-- 2       1000372 2.5     22.15   N\n-- 2       1000373 0.9     9.01    N\n\nDELETE\n  FROM demo.nyc.taxis\n WHERE store_and_fwd_flag = &#039;N&#039;;\n\nSELECT *\n  FROM demo.nyc.taxis;\n-- 1       1000374 8.4     42.13   Y<\/code><\/pre>\n<p>\ud035\uc2a4\ud0c0\ud2b8 \ub3c4\ucee4 \uba85\uc138\uc5d0 \ud3ec\ud568\ub41c mc \ub97c \ud65c\uc6a9\ud574 SQL operation \uc0ac\uc774\uc0ac\uc774 warehouse \uc0c1\ud0dc\ub97c \uc810\uac80\ud574\ubd24\ub294\ub370 \ub9c8\uc9c0\ub9c9 \uc0c1\ud0dc\ub294 \ub2e4\uc74c\uacfc \uac19\ub2e4.<\/p>\n<pre><code class=\"language-bash\"># tree taxis\ntaxis\n\u251c\u2500\u2500 data\n\u2502   \u251c\u2500\u2500 vendor_id=1\n\u2502   \u2502   \u251c\u2500\u2500 00000-4-2732d28f-f6ec-4654-9bdc-736051cc9b7e-0-00001.parquet\n\u2502   \u2502   \u2514\u2500\u2500 00000-9-7e70c547-66cd-41d4-9403-d5aac793cf38-0-00001.parquet\n\u2502   \u2514\u2500\u2500 vendor_id=2\n\u2502       \u2514\u2500\u2500 00000-4-2732d28f-f6ec-4654-9bdc-736051cc9b7e-0-00002.parquet\n\u2514\u2500\u2500 metadata\n    \u251c\u2500\u2500 00000-22489044-d255-4126-b1c1-be202143039c.metadata.json\n    \u251c\u2500\u2500 00001-47cba4e3-2a1f-4cb4-85ef-ea023c692eae.metadata.json\n    \u251c\u2500\u2500 00002-34f71565-4d72-479e-966a-a449d0a93f91.metadata.json\n    \u251c\u2500\u2500 4d1505dc-d959-4426-a205-b563e48f4268-m0.avro\n    \u251c\u2500\u2500 5c72ac74-0755-4fc1-b368-b55b0e363f43-m0.avro\n    \u251c\u2500\u2500 5c72ac74-0755-4fc1-b368-b55b0e363f43-m1.avro\n    \u251c\u2500\u2500 snap-10300693507100651-1-5c72ac74-0755-4fc1-b368-b55b0e363f43.avro\n    \u2514\u2500\u2500 snap-2391828917792661493-1-4d1505dc-d959-4426-a205-b563e48f4268.avro<\/code><\/pre>\n<p>\ub370\uc774\ud130 \uac74\uc218\uac00 \uc801\uc5b4\uc11c \uadf8\ub7f0\uac83\uc778\uc9c0 spark-sql \uc744 \ud1b5\ud574\uc11c \uadf8\ub7f0 \uac83\uc778\uc9c0\ub294 \uc798 \ubaa8\ub974\uaca0\uc9c0\ub9cc DELETE sql \uc774 \ucd5c\uc885 \uc2a4\ub0c5\uc0f7\uc5d0\uc11c\ub294 overwrite \ub85c \ucc98\ub9ac\ub3fc\uc788\ub2e4.<\/p>\n<pre><code class=\"language-json\">{\n  ...\n  &quot;snapshots&quot; : [ {\n    &quot;sequence-number&quot; : 1,\n    &quot;snapshot-id&quot; : 2391828917792661493,\n    &quot;summary&quot; : {\n      &quot;operation&quot; : &quot;append&quot;,\n      &quot;spark.app.id&quot; : &quot;local-1736957601732&quot;,\n      &quot;added-data-files&quot; : &quot;2&quot;,\n      &quot;added-records&quot; : &quot;4&quot;,\n      &quot;added-files-size&quot; : &quot;3074&quot;,\n      &quot;changed-partition-count&quot; : &quot;2&quot;,\n      &quot;total-records&quot; : &quot;4&quot;,\n      &quot;total-files-size&quot; : &quot;3074&quot;,\n      &quot;total-data-files&quot; : &quot;2&quot;,\n      &quot;total-delete-files&quot; : &quot;0&quot;,\n      &quot;total-position-deletes&quot; : &quot;0&quot;,\n      &quot;total-equality-deletes&quot; : &quot;0&quot;\n    },\n    &quot;manifest-list&quot; : &quot;s3:\/\/warehouse\/nyc\/taxis\/metadata\/snap-2391828917792661493-1-4d1505dc-d959-4426-a205-b563e48f4268.avro&quot;,\n    &quot;schema-id&quot; : 0\n  }, {\n    &quot;sequence-number&quot; : 2,\n    &quot;snapshot-id&quot; : 10300693507100651,\n    &quot;parent-snapshot-id&quot; : 2391828917792661493,\n    &quot;summary&quot; : {\n      &quot;operation&quot; : &quot;overwrite&quot;,\n      &quot;spark.app.id&quot; : &quot;local-1736957601732&quot;,\n      &quot;added-data-files&quot; : &quot;1&quot;,\n      &quot;deleted-data-files&quot; : &quot;2&quot;,\n      &quot;added-records&quot; : &quot;1&quot;,\n      &quot;deleted-records&quot; : &quot;4&quot;,\n      &quot;added-files-size&quot; : &quot;1494&quot;,\n      &quot;removed-files-size&quot; : &quot;3074&quot;,\n      &quot;changed-partition-count&quot; : &quot;2&quot;,\n      &quot;total-records&quot; : &quot;1&quot;,\n      &quot;total-files-size&quot; : &quot;1494&quot;,\n      &quot;total-data-files&quot; : &quot;1&quot;,\n      &quot;total-delete-files&quot; : &quot;0&quot;,\n      &quot;total-position-deletes&quot; : &quot;0&quot;,\n      &quot;total-equality-deletes&quot; : &quot;0&quot;\n    },\n    &quot;manifest-list&quot; : &quot;s3:\/\/warehouse\/nyc\/taxis\/metadata\/snap-10300693507100651-1-5c72ac74-0755-4fc1-b368-b55b0e363f43.avro&quot;,\n    &quot;schema-id&quot; : 0\n  } ],\n  ...\n}<\/code><\/pre>\n<p>delete-files \ub85c \uc77c\ubd80 \ub370\uc774\ud130\uac00 \uc0ad\uc81c\ub41c \ucc99\ud558\ub294 \uc2dc\uc810\uc744 \uad00\uce21\ud558\uace0 \uc2f6\uc740\ub370 \uc870\uae08 \ub354 \uc870\uc0ac\ud574\ubd10\uc57c\ud560 \uac83 \uac19\ub2e4.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\ud68c\uc0ac\uc5d0\uc11c \uc544\uc774\uc2a4\ubc84\uadf8\ub97c \ub4dc\ub514\uc5b4 \ub3c4\uc785\ud558\uac8c \ub41c\ub2e4\uace0 \ud574\uc11c \ucc3e\uc544\ubcf4\ub358 \uc911 \ub098\ub984 \ucd5c\uc2e0\uc758 \ub9ac\ubdf0 \ub17c\ubb38\uc774 \uc788\uc5b4\uc11c \ubd24\ub294\ub370 \ub0b4\uc6a9\uc774 \uad1c\ucc2e\uc740 \uac83 \uac19\ub2e4. Disruptor in Data Engineering &#8211; Comprehensive Review of Apache Iceberg &quot;\ud14c\uc774\ube14 \ud3ec\ub9f7&quot;\uc778 \uc544\uc774\uc2a4\ubc84\uadf8\uc758 \ubcf8\uaca9 \uc124\uba85\uc5d0 \uc55e\uc11c Open File Format, Open Table Format, Data Warehouse\/Data Lake\/Lakehouse \uc640 \uac19\uc774 \ubc30\uacbd\uc774 \ub418\ub294 \uc6a9\uc5b4\ub4e4\ub3c4 \ud55c\ubc88 \uc9da\uc5b4\uc8fc\uace0 \uc2dc\uc791\ud558\ub294\ub370 Catalog, Metadata, Snapshot \uc138 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,218],"tags":[],"class_list":["post-1897","post","type-post","status-publish","format-standard","hentry","category-data-engineering","category-papers"],"_links":{"self":[{"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/posts\/1897","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/comments?post=1897"}],"version-history":[{"count":0,"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/posts\/1897\/revisions"}],"wp:attachment":[{"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/media?parent=1897"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/categories?post=1897"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/oboki.net\/workspace\/wp-json\/wp\/v2\/tags?post=1897"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}