Meet “spark-sight”: Spark Performance at a Glance | by Alfredo Fomitchenko | May, 2022

An open-source mission I created

Photograph by Jez Timms on Unsplash

I like Spark, however I don’t love the Spark UI.

Because of this, at first, I used to be excited to seek out out that Data Mechanics was creating Delight, sharing with the world

  • an open-source agent gathering data from throughout the Spark software;
  • a closed-source dashboard displaying the efficiency of a Spark software by way of CPU effectivity and reminiscence utilization.

Though the service is freed from cost,

  1. that you must combine into your Spark software their customized listener that collects and sends out information to their servers;
  2. it could be tough in your boss to approve such apply (privateness considerations), or outright unable to (e.g. your software runs in a Glue job inside a VPC with out web entry).

Keen to assemble every part I learn about Spark, I took on the problem of recreating the identical wonderful expertise of Delight for everyone to get pleasure from and contribute to.

Because of this I’m sharing with you spark-sight.

spark-sight is a much less detailed, extra intuitive illustration of what’s going on inside your Spark software by way of efficiency:

  • CPU time spent doing the “precise work”
  • CPU time spent doing shuffle studying and writing
  • CPU time spent doing serialization and deserialization
  • (coming) Reminiscence utilization per executor
  • (coming) Reminiscence spill depth per stage

spark-sight is just not meant to exchange the Spark UI altogether, fairly it supplies a hen’s-eye view of the levels permitting you to determine at a look which parts of the execution might have enchancment.

The Plotly determine consists of two charts with synced x-axis.

Prime: effectivity by way of CPU cores out there for duties

Backside: levels timeline

To put in it,

$ pip set up spark-sight

To fulfill it,

$ spark-sight --help

To launch it,

$ spark-sight 
--path "/path/to/spark-application-12345"
--cpus 32
--deploy_mode "cluster_mode"
$ spark-sight `
--path "C:pathtospark-application-12345" `
--cpus 32 `
--deploy_mode "cluster_mode"

A brand new browser tab shall be opened.

For extra data, head over to the spark-sight Github repo.

The Spark occasion log is an easy textual content file that Spark is natively capable of retailer someplace so that you can open and look by means of:

--conf spark.eventLog.enabled=true
--conf spark.eventLog.dir=file:///c:/someplace
--conf spark.historical past.fs.logDirectory=file:///c:/someplace

As described within the Spark documentation, hidden someplace within the textual content file are the efficiency information:

  • SparkListenerTaskEnd occasions: for a way lengthy the duty was shuffling, serializing, deserializing, and doing the “precise work” it was alleged to do within the first place
  • SparkListenerStageCompleted occasions: for a way lengthy the corresponding stage was within the submitted state

Given a stage, the effectivity of a stage is the ratio between

  1. Used CPU time:
    complete CPU time of “precise work”
    throughout all of the duties of the stage
  2. Out there CPU time:
    complete CPU time (idle or busy)
    throughout all cluster nodes in the course of the stage submission

No.

Onto the cluster, a number of levels may be submitted on the identical time. So you may’t actually compute the effectivity of a stage by itself, as a result of within the meantime the cluster might be executing different levels.

We have to change the definition of effectivity.

Given a time interval, the effectivity of the time interval is the ratio between

  1. Used CPU time:
    complete CPU time of “precise work”
    a̶c̶r̶o̶s̶s̶ ̶a̶l̶l̶ ̶t̶h̶e̶ ̶t̶a̶s̶okay̶s̶ ̶o̶f̶ ̶t̶h̶e̶ ̶s̶t̶a̶g̶e̶
    throughout all of the duties throughout all of the levels submitted in that point interval
  2. Out there CPU time:
    complete CPU time (idle or busy)
    ̶a̶c̶r̶o̶s̶s̶ ̶a̶l̶l̶ ̶c̶l̶u̶s̶t̶e̶r̶ ̶n̶o̶d̶e̶s̶ ̶d̶u̶r̶i̶n̶g̶ ̶t̶h̶e̶ ̶s̶t̶a̶g̶e̶ ̶s̶u̶b̶m̶i̶s̶s̶i̶o̶n̶
    throughout all cluster nodes in that point interval

You might assume stage boundaries could also be sufficient, however duties can run throughout these boundaries. Within the following diagram, discover activity 1 working throughout the boundary:

How do you break up activity metrics in two?

For activity 1, compound data concerning CPU utilization is reported. For example, the next conditions are reported equivalently as

  • the duty ran for 10 seconds
  • the duty used the CPU for 4 seconds
{
"Occasion":"SparkListenerTaskEnd",
"Stage ID":0,
"Process Data":
"Process ID":0,
"Process Metrics":
"Executor CPU Time": 4000000000 (nanoseconds)

The only answer for splitting the duty is splitting CPU utilization and the opposite metrics proportionally with respect to the ensuing period of the break up (technically talking, we assume a uniform distribution of chance throughout the interval).

Discover that this approximation could create artifacts, e.g. going above 100%.

I used the Python library Plotly, really easy to allow you to streamline a easy visualization like this one, offering a light-weight and interactive interface.

Discover that the visualization has an enchancment over the time intervals mentioned above. Actually, the highest bar chart additional splits the time intervals figuring out when the primary and final activity of the stage has truly began.

Medium time period

I plan so as to add charts for

  • Reminiscence utilization per executor
  • Reminiscence spill depth per stage

after which convert the easy determine right into a full-fledged Sprint (Plotly) software to enhance the UX.

Long run

I plan so as to add

  • the flexibility to learn the Spark occasion log from different information sources (e.g. S3, GCS, Azure Storage, HDFS, …)
  • displaying a number of Spark functions on the identical time in order that efficiency may be in contrast (e.g. you ran the identical software with totally different spark.sql.shuffle.partitions, spark.executor.reminiscence, …)
  • displaying the effectivity of a number of Spark functions working on the identical cluster on the identical time
  • considering non-static configuration of the cluster (now it’s assumed the variety of CPUs doesn’t change)

If you happen to discover this mission helpful, listed below are the directions

  1. Head over to the spark-sight Github repo
  2. Don’t be light on the star button

If you happen to encounter any issues, listed below are the directions

  1. Head over to the spark-sight Github repo
  2. Don’t be light on the problem button

If you wish to be notified about future enhancements, listed below are the directions

  1. Head over to the spark-sight Github repo
  2. Don’t be light on the watch button

More Posts