Skip to content

Blog rovingdev

Go back

What is '_spark_metadata' Directory in Spark Structured Streaming ?

Edit page

Spark Structured Streaming is a real-time data processing framework in Apache Spark. It enables continuous, scalable, and fault-tolerant data ingestion, transformation, and analysis from various sources, like Kafka or files. Unlike traditional batch processing, it treats data as an unending stream, allowing for low-latency analytics. Queries are expressed using familiar SQL or DataFrame API, making it user-friendly. It provides built-in support for event-time processing and ensures exactly-once processing semantics, making it suitable for various real-time applications like monitoring, ETL, and more.

Some unrelated stream image In this article, we’ll explore a critical component of Spark Structured Streaming: the _spark_metadata directory.


In this article, we will assume the nature of streaming jobs as:

Terminal window
Apache Kafka -> Apache Spark -> Apache Hadoop(HDFS)

Architecture

What is _spark_metadata Directory?

How to run multiple structured streaming jobs for the same output path?

package org.apache.spark.sql.execution.streaming
import scala.util.control.NonFatal
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.internal.Logging
import org.apache.spark.internal.io.FileCommitProtocol
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.execution.datasources.{FileFormat, FileFormatWriter}
object FileStreamSink extends Logging {
// The name of the subdirectory that is used to store metadata about which files are valid.
val metadataDir = "_spark_metadata"
.
.
.

Structure of _spark_metadata directory:

Terminal window
/tmp/landing/streaming_store/_spark_metadata/0
/tmp/landing/streaming_store/_spark_metadata/1
$ hadoop fs -cat /tmp/landing/streaming_store/_spark_metadata/0
v1
{"path":"hdfs://tmp/landing/streaming_store/part-00000-34bdc752-70a2-310f-92dd-7ca9c000c34b-c000.snappy.parquet","size":2287,"isDir":false,"modificationTime":1616075186000,"blockReplication":1,"blockSize":33554432,"action":"add"}

What will happen if you delete _spark_metadata directory?

PS: 🚨 _spark_metadata should not be deleted at all. 🚨


Edit page
Share this post on:

Previous Post
Apache Spark Unit Testing Strategies
Next Post
Unlocking Docker: A Beginner's Guide to Ubuntu and Node.js installation