ETL Pipelines and Neural Network Classification

In this article, we will compare Python and Scala for the creation of an ETL (Extract, Transform, Load) Pipeline and implementation of a Neural Network Classifier. We will be looking at the ease of use and implementation speeds for that task, discussing some of the library support as well some pros and cons of the complexity handling mechanisms present in both languages. We will be using the MNIST of Handwritten Digits dataset.

Tools and Data

MNIST of Handwritten Digits is one of the most popular open-source contemporary Machine Learning datasets. It comprises of 60k 28×28 training images and 10k 28×28 test images of handwritten binary numbers amounting to 10 classes in total. We will be using it because of its popularity and the numerous examples you can find online. Since MNIST is a great dataset, which comes in a horrible format, we would not be using the original packaging provided by Yann LeCunn, but a modified version, which is in flattened csv format provided by pjreddie. Because this is in a csv format, this would mean that each row of the csv is an ”image” and each column is a ”pixel” and we also have a column for the actual label of the image.

Over the last few decades, data has become the new ”gold” as some analysts put it. More and more companies’ business models revolve around processing and taking some action on massive clusters of data. This rapid development of the industry and technologies opens a different question – what language should be used to implement the infrastructure for that? You need something that performs fast, allows for easy extendability, is expressive (allows implementation of complex operations in small amounts of code), has good library support, and is easy to read and comprehend by a human operator.

We also want to avoid unnecessary implementation overhead. Thus our options diminish – languages like Java and C++ might be fast and robust, but they require a lot of extra code to be written, which can significantly stagger the productivity of a data scientist. However, that makes Python perfect for the job and since we mentioned how Java would be a good contender, if it didn’t require you to write so much extra code, enter Java’s younger cousin, specifically designed to be more expressive and require less implementation overhead – Scala.

For the Python side of things we will be using a fairly standard stack of packages when it comes to ML and especially NNs:

Tensorflow
Numpy
Pandas
SciKit – Learn

As for the Scala side of things, we will be using:

Intel’s BigDL
Apache Spark which depending on your choice of implementation can be more of a dependency of BigDL than a tool for you, more on that later.

BigDL and Apache Spark have full support in Python, so anything you do in Scala using those libraries is 1 to 1 replicable in Python. The underlying implementation is the same, it’s just that you are provided with wrappers in a different language, therefore performance should be almost, if not the same.

We will not be discussing GPU training and speeds as the Scala/Spark support is not that great, so we will be focusing on CPU-based training and inference with emphasis on parallelization and how easy it is to process and transform data. The reason is that a typical Machine Learning Engineer’s and a Data Scientists’

job in most companies would require you to load some data, do cleaning (remove unwanted outliers, do some normalization or some other augmentation, etc.), do some additional transformations and then feed it into an ML algorithm, sometimes a Neural Network, sometimes a simpler algorithm. Such type of streamlined work is called an ETL (Extract, Transform, Load) Pipeline and for those pipelines extendability, performance, and ease of use are the most important factors, all of which are largely determined by the language and its ecosystem.

Code

Project structure and environment setup

For the purposes of this article, we will be using the pip packet manager. Pip and Python have one general flaw – packages are installed globally, which can significantly hurt, or even break your OS’ performance. Thankfully we have a simple solution – virtual environments. In the old days, you would go about using virtualenv. But recently there have been even better tools, which leverage this and make it even easier for you to set up a project environment. My personal recommendation and favorite to use is poetry. The guide on their website is really easy to follow and very exhaustive, so we will not be going over setup in this article.

Scala might be considered a bit more organized in this regard. Here package management tools are bundled into the so-called build tools. Since Scala is a compiled language, these tools take care of packages outside of the scope of your project and then build your project to byte code which is later executed on the Java Virtual Machine. We will be using sbt, but many alternatives exist, link at the end of the article. Since Scala runs on the JVM, it has a hidden ace up its sleeve – it can use Java packages with native performance with no extra requisites. In fact, the two languages are so closely related, for the most part, they share the same build tools.

Objectives

Essentially we need to implement the following steps:

Load the data.
Arrange it into tensors
Normalize the data.
Train a NN model in parallel.
Evaluate model.

Of course, if you want to be fancy, you can then arrange the result of the model evaluation into graphs, etc, but we will not be covering this here, as there are way too many tutorials online already. The Neural Network we will be implementing is called LeNet-5 again by Yann LeCun as it is one of the oldest and most popular examples of a Convolutional Neural Network and a lot of people are familiar with it, in fact, a lot of Neural Network packages include it by default.

When it comes to the Python ones, you need to have Python 3 and a Tensorflow version >=2.0 (recommended to have the newest versions for both). However, for Scala things are a bit more complicated. Scala is similar to Python in that it has 2 main stable versions Scala 2 and Scala 3. The former has been the standard for quite a long time, and over the last few years Scala 3 has been taking over, but the process is still not finished, so a lot of libraries and tools still do not offer complete support under Scala 3. Therefore the recommended versions for the Scala packages are as follows:

Scala 2.12
Spark 3.1.2
BigDL 0.13.0

Make sure you check out the build.sbt file in the Scala code.

Next, we need to download our data. It is as easy as following the link, download the train set and test set, unpack them into a folder, and you have your csv’s ready to be used by the code!

Loading and processing the data

On the Python side of things are relatively straightforward – we use the Pandas library to read the csv data, take the NumPy array from it and then use the Numpy library to apply a function over all the rows which reshapes them from a flat vector to a 28×28 image. This is a fast operation as the reshaping itself is actually vectorized (i.e. uses specific machine code which executes the same operation on all data points simultaneously) and voila! We have our train and test data arranged into tensors with labels. All done in the load_transform_encode_dataset() function.

In the Scala side of things, the actual file reading is made to be a bit more fancy – you can choose between reading the csv in a spark dataframe (not to be confused with the pandas dataframe, even though the latter is modeled after the former), a Spark RDD or simply a scala array. This is to showcase a couple of things – how polymorphic functions look like in Scala and so you would be able to see the relative difference in performance between the approaches. The Spark approaches are robust and much simpler to use, however, come at a small processing cost – creating these objects is expensive computationally. Then the processing of the data is somewhat similar to the python one – we still map a higher-order function over the rows of the data structure, reshaping the pixel columns into a 2D tensor.

It is worth noting that the reshape function in both the Python and Scala versions is robust as long as we keep the column ordering the same. An alternative would be to go over each pixel one by one and put it in the cell which corresponds to the column index (i.e. 23×22 would go on row 23 column 22 in the image) but this operation is sequential and it would take a long, long time to complete (approximately > 700 times more, as there are 784 pixels in a 28×28 image and you go through them 1 by 1).

Model Definition and Training

The model definition in both implementations is almost completely identical, with some slight differences. This is mainly due to the fact that the BigDL library took inspiration from TensorFlow, PyTorch, etc popular libraries, which are mostly used in Python. Defining a sequential model is as easy as initializing the empty object and adding them one by one, as visible in the makeLeNet5() function in both implementations. So is the configuration of an optimizer, loss, evaluation metrics, train, and validation data, with some slight differences.

Model Evaluation

Evaluation and results are also almost identical, with no considerable differences. One thing worth noting is that in python’s TensorFlow you define model evaluation metrics both for the fit and the evaluation during the optimizer setup and for scala’s bigdl you can define separate methods when calling the evaluation itself. The returned objects from both can then be used to plot and analyze the model’s performance.

ETL pipelines

Discussion

Looking at both implementations, code-wise they are surprisingly similar in both structure, syntax, and implementation complexity. One thing that stands out, is the implementation of polymorphic higher-order functions which you can see in the Scala implementation at the start of the runTestHarness(), the private function called _readFile – it’s implementation requires the implementation of a case class trait which is used as the return type and then a match-case statement with wrappers for each type. Now, this approach has some major benefits and drawbacks, however, even though they are based on facts about the structure of the language, they are also subject to the personal preference of the user of the language. Both Scala and Python are strongly typed languages – in other words, their variables always have a type, it’s just that Python’s one is dynamically allocated during interpretation.

Scala’s approach of requiring such an explicit declaration of types:

Allows you to catch errors and type mismatches during compilation and with modern IDE’s at the time of writing. Admittedly you can do this in python as well using type hints and static code checkers like mypy, but it is still not as robust.
Can slow down prototyping stages of development, which can be detrimental to Data Scientists as their work usually involves using such pipelines in the form of structured libraries to run everyday experiments, therefore the more extra code they need to write can be considered time not spent analyzing results.
Can actually improve performance considerably as modern compilers use that information of functions/methods and their type signatures to compile more efficient instructions.

Python’s approach of not requiring such an explicit declaration of types:

Allows for flexible and dynamic prototyping and development.
Can allow for a highly polymorphic design – i.e. with the proper abstractions you can technically run the same code on float, double, int, and even strings or other literals (doing the same in a language like Scala can be a nightmare).
Can result in some really easy to miss, and hard to diagnose bugs – you can accidentally cast a float to an integer, which can invalidate your whole data and is.
Can make code really challenging to test and validate – above reason, take a pipeline which has 20-30 parameters as an example. The code complexity can skyrocket suddenly.
Can actually significantly slow down performance – extra code needs to be run to do that type inference in real-time, although that is significantly mitigated by modern interpreters and just-in-time compilers.

When it comes to parallel code execution, Scala one-ups Python considerably. The Scala code is designed to be run in parallel, firstly the functional paradigm of immutable variables, and copy-on-edit approach allows considerable race condition protection right from the start. Also, operations executed on Spark Dataframes/RDDs are actually run in parallel without you having to specify this, whereas if you try to do the same in Pandas, you would quickly stumble into some considerable problems. Python has another inherent flaw in its parallelization approach – the Global Interpreter Lock – in simplest terms, only one thread at a time can hold control of the python interpreter. Therefore even multi-threaded code is actually sequential under the hood. Although modern packages like multiprocessing, which mostly mitigates this by spawning multiple processes of python interpreters in the OS, it still has some flaws and it is very expensive to parallelize some operations.

Even though the Spark framework is implemented in Scala and is undoubtedly faster there, the other most popular API for it is the pyspark one. Essentially you get almost the same syntax, the same performance benefits (parallelization and so on), but in the python ecosystem. So that is yet another alternative.

Expression evaluation strategies – in general, there are two main branches of evaluation strategies – strict and non-strict. Strict evaluation evaluates an expression immediately, and non-strict postpone it until a later stage (many variations exist). Python uses a non-strict approach called Lazy Evaluation where an expression is only evaluated when its value is needed, whereas Scala, by default, uses strict evaluation. What is more interesting is that you can use the lazy keyword in Scala to force lazy evaluation on an expression. Being able to choose between the two strategies can be very beneficial in ETL pipelines as you can avoid, or delay computation until a certain stage. The cost you’d pay for that is bigger memory requirements for your variables, as they are now complex expressions.

Final Thoughts

Looking at these implementations in a vacuum, we can see, there is no clear ”winner”, neither language is absolutely better or worse. One does one set of things better at the cost of complicating some other things and vice-versa. Therefore it seems that the right approach, when building a big production environment would be to leverage each language’s strengths where needed and use them interchangeably. Similar to the way Tensorflow does its thing – it is a library implemented in almost entirely C++ and CUDA, however, it provides API wrappers for Python and that is the most used version of the library, in fact, it is the main distribution of the library.

So unless you are willing to use two or more languages in your project, leveraging each one’s strengths where needed, you would have to choose one and stick to it. Therefore you should carefully analyze your task, and what it entails, then either go with the one you believe would be the best for that job or easiest to maintain or simply the one your team is most comfortable with. Because remember – you want to build something that lasts, not re-implement from scratch, or just slap extra code next to the existing one every few months every time you want to put in an additional feature.

Dictionary and additional reading

MNIST of Handwritten Digits – A popular dataset (or database as some refer to it) hosted and main tained by Yann LeCunn. MNIST stands for Modified National Institute of Standards and Technology. For simplicity, we will be referring to the dataset as simply MNIST for most of this article.
NN – Neural Network if you are unfamiliar with the subject, it would be way too big to cover in this small section, a good semi-technical description is provided by IBM here.
ML – Machine Learning
ETL Pipeline – Extract Transform Load Pipeline
Interpreted language – a language whose code is in either human-readable format, or at least something that is not in machine code format (the term can be flexible). A quick read on the subject
Just-in-Time Compilation
Tensor – an algebraic n-dimensional object which describes multi-linear relationships. In less technical terms, this is a generalization of the 2 Dimensional Matrix.
Normalization – put a set of data into a common scale, in essence, put it in a range between 0 and 1. (the term can have a very broad meaning)
Sequential Model – a neural network model where all the layers and data flow are in sequence, one after the other.

Github and resource links

By George Krastev | Motion Software

Next А Company we Collaborated with Hitting a Record Number of Bookings in 2023

ETL Pipelines and Neural Network Classification – Python vs Scala

George Krastev