Change One Line of Code to Make Your Spark Jobs Work Again | by Sarah Floris | May, 2022

Easy methods to repair these Spark machine studying or knowledge transformation jobs which can be failing with the brand new Apple M1 Chips

Picture by Maxim Hopman on Unsplash

After greater than a decade of making and testing industry-leading chips for the iPhone, iPad, and Apple Watch, Apple launched the Apple M1 chip for the Macbooks.¹ The M1 chip processors characteristic “the world’s quickest CPU core in low-power silicon, the world’s greatest CPU efficiency per watt, the world’s quickest built-in graphics in a private laptop, and breakthrough machine studying efficiency with the Apple Neural Engine,” delivering “as much as 3.5x quicker CPU efficiency, as much as 6x quicker GPU efficiency, and as much as 15x quicker machine studying, all whereas enabling battery life as much as 2x longer”¹

Although these numbers sounded improbable, it was a little bit of a shocker after I realized that many of the software program I take advantage of as a knowledge engineer or scientist now not labored. And this concern was heightened as a result of I now not had an possibility to purchase a MacBook laptop computer with an Intel Processor.

I’ll take you on an journey to determine how I observed that my Spark jobs have been now not working and the way I ended up having to alter just one line of code to make them work once more.

Let’s speak concerning the software program I take advantage of frequently. I’m a knowledge engineer who units up her setting utilizing Docker, Kubernetes, Airflow, and Spark.

The corporate that develops Docker and Kubernetes had a fast turnaround to put in Docker on the Apple M1 chip by implementing a brand new command buildx the place you may construct the picture with multiplatform help.

docker buildx construct --platform linux/amd64,linux/arm64 .

Sadly, even when you can obtain Docker and Kubernetes for the Apple M1 chip, this doesn’t assure the pictures will run as I rapidly came upon as I used to be investigating my jobs. I used to be observing odd habits after I ran k9s, a Kubernetes command line that helps me handle my clusters. You’ll observe a display screen much like the one proven on this screenshot.

Have a look at the red-circled column labeled “RESTART.”

Picture by Creator.

In that column, any quantity above 0 or 1 normally signifies that the picture just isn’t appropriately constructed or there’s some error. After I was working, I might see that quantity improve to over 300, and it was solely the pods that needed to do with Spark servers or the Hadoop ecosystem.

Spark permits interactive evaluation of huge datasets and offers high-level APIs in Java, Scala, Python, and R. The spark API can be utilized to put in writing new purposes or use the pre-built libraries to resolve frequent issues in machine studying akin to clustering evaluation or classification.

Spark requires Java, particularly Java 8 or 11. I had downloaded Java utilizing brew:

brew set up openjdk@8

which installs Oracle’s Java distribution. And Oracle’s distribution doesn’t help ARM64 till Java 17. I wanted to determine a method round this and that’s after I came upon about Azul.

I discussed earlier that Oracle’s distribution doesn’t presently help ARM64 till Java 17. As soon as I figured that out, I discovered a Java model that does help ARM64. Azul is likely one of the corporations that does present java 8 and 11 for ARM64. Right here’s the installer.

The default set up folder is

/Library/Java/JavaVirtualMachines/<zulu_folder>/Contents/Residence

the place the zulu_folder is the identify of the Azul bundle you downloaded.

Make certain it’s put in appropriately by working:

java -version

Putting in the Appropriate Java Model through Dockerfile

Sadly, the picture that I used to be utilizing to obtain spark doesn’t help ARM64 but (see this backlog item from Bitnami), so I needed to create my very own Dockerfile and construct it.

I began out with the Azul picture that helps ARM64 after which provides the extra packages for Spark and Hadoop jar recordsdata.

This Dockerfile was impressed by https://datachef.co/blog/run-spark-applications-on-aws-fargate/.

First, I proceeded to construct the file utilizing the brand new docker characteristic buildx and tagged it as spark utilizing -t

docker buildx construct --platform linux/arm64 -t spark:newest .

after which run the spark picture

docker run spark:newest

to get this lovely display screen

Screenshot by Creator

And now, we’re again in enterprise.

And that solved the issue for me after I construct the spark server. This was just one Dockerfile of many, so I nonetheless have some constructing to do. However I hope that you’re now capable of examine your large dataset’s knowledge high quality, run these remodel jobs, or run your machine studying fashions utilizing the newfound powers of the Apple M1 Chip.

Thanks for studying and I’ll see you subsequent time!

[1] https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/

More Posts