Your Guide to DL with MLSQL Stack (3)

时间:2022-06-21
本文章向大家介绍Your Guide to DL with MLSQL Stack (3),主要内容包括其使用实例、应用技巧、基本知识点总结和需要注意事项,具有一定的参考价值,需要的朋友可以参考一下。

This is the third article of Your Guide with MLSQL Stack series. We hope this article series shows you how MLSQL stack helps people do AI job.

As we have seen in the previous posts that MLSQL stack give you the power to use the built-in Algorithms and Python ML frameworks. The ability to use Python ML framework means you are totally free to use Deep Learning tools like PyTorch, Tensorflow. But this time, we will teach you how to use built-in DL framework called BigDL to accomplish image classification task first.

Requirements

This guide requires MLSQL Stack 1.3.0-SNAPSHOT. You can setup MLSQL stack with following links. We recommend you deploy MLSQL stack in local.

  1. Docker
  2. Mannually Compile
  3. Prebuild Distribution

If you meet any problem when deploying, please let me know and feel free to address any issue in this link.

Project Structure

I have created a project named store1, and there is a directory called image_classify contains all mlsql script we talk today. It looks like this:

image.png

We will teach you how to build the project step by step.

Upload Image

First, download cifar10 raw images from url: https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz ungzip it and make sure it's a tar file.

Though MLSQL Console supports directory uploading, but the huge number of files in the directory will crash the uploading component in the web page, and of course, we hope we can fix this issue in future. Now, there is one way that packaging the directory as a tar file to walk around this uploading crash issue.

image.png

then save upload tar file to your home:

-- download cifar data from https://github.com/allwefantasy/spark-deep-learning-toy/releases/download/v0.01/cifar.tgz
!fs -mkdir -p /tmp/cifar;
!saveUploadFileToHome /cifar.tar /tmp/cifar;

the console will show the real-time log which indicates that the system is extracting images.

image.png

This may take for a while because there are almost 60000 pictures.

Setup some paths.

We create a env.mlsql which contains variables path related:

set basePath="/tmp/cifar"; 
set labelMappingPath = "${basePath}/si";
set trainDataPath = "${basePath}/cifar_train_data";
set testDataPath = "${basePath}/cifar_test_data";
set modelPath = "${basePath}/bigdl";

And the other script will include this script to get all these paths.

Resize the pictures

We hope we can resize the images to 28*28, you can achieve it with ET ImageLoaderExt. Here are how we use it:

include store1.`alg.image_classify.env.mlsql`;

-- {} or {number} is used as parameter holder.
set imageResize='''
run command as ImageLoaderExt.`/tmp/cifar/cifar/{}` where 
and code="
    def apply(params:Map[String,String]) = {
         Resize(28, 28) ->
          MatToTensor() -> ImageFrameToSample()
      }
"
as {}
''';

-- train should be quoted because it's a keyword.
!imageResize "train" data;
!imageResize test testData;

In the above code, because we need to resize train and test dataset, in order to avoid duplicate code, we wrap the resize code as a command, then use this command to process train and test dataset separately.

Extract label

For example, When we see the following path we know that this picture contains frog. So we should extract frog from the path.

/tmp/cifar/cifar/train/38189_frog.png

Again, we wrap the SQL as a command and process the train and test data separately.

set extractLabel='''
-- convert image path to number label
select split(split(imageName,"_")[1],"\.")[0] as labelStr,features from {} as {}
''';

!extractLabel data newdata;
!extractLabel testData newTestData;

We will convert the label to number and then plus 1(cause the bigdl needs the label starts from 1 instead of 0).

set numericLabel='''
train {0} as StringIndex.`/tmp/cifar/si` where inputCol="labelStr" and outputCol="labelIndex" as newdata1;
predict {0} as StringIndex.`/tmp/cifar/si` as newdata2;
select (cast(labelIndex as int) + 1) as label,features from newdata2 as {1}
''';

!numericLabel newdata trainData;
!numericLabel newTestData testData;

Save what we get until now

We will save all these data so we can use the processed data in future without executing repeatedly:

save overwrite trainData as parquet.`${trainDataPath}`;
save overwrite testData as parquet.`${testDataPath}`;

Train the images with DL

We create a new script file named classify_train.mlsql, and we should load the data first and convert the label to an array:

include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

finally, we use our algorithm to train them:

train trainData as BigDLClassifyExt.`${modelPath}` where
disableSparkLog = "true"
and fitParam.0.featureSize="[3,28,28]"
and fitParam.0.classNum="10"
and fitParam.0.maxEpoch="300"

-- print evaluate message
and fitParam.0.evaluate.trigger.everyEpoch="true"
and fitParam.0.evaluate.batchSize="1000"
and fitParam.0.evaluate.table="testData"
and fitParam.0.evaluate.methods="Loss,Top1Accuracy"
-- for unbalanced class 
-- and fitParam.0.criterion.classWeight="[......]"
and fitParam.0.code='''
                   def apply(params:Map[String,String])={
                        val model = Sequential()
                        model.add(Reshape(Array(3, 28, 28), inputShape = Shape(28, 28, 3)))
                        model.add(Convolution2D(6, 5, 5, activation = "tanh").setName("conv1_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Convolution2D(12, 5, 5, activation = "tanh").setName("conv2_5x5"))
                        model.add(MaxPooling2D())
                        model.add(Flatten())
                        model.add(Dense(100, activation = "tanh").setName("fc1"))
                        model.add(Dense(params("classNum").toInt, activation = "softmax").setName("fc2"))
                    }
'''
;

Int the code block, we use Keras-style code to build our model, and we tell our system some information e.g. how many classes and what's the feature size.

If this training stage takes too long, you can decrease fitParam.0.maxEpoch to a small value.

The console will print the message when training:

image.png

and finally the validate result:

image.png

Use model command to check the model train history:

!model history /tmp/cifar/bigdl;

Here are the result:

image.png

Register the model as a function

Since we have built our model, now let us learn how to predict the image. First, we load some data:

include store1.`alg.image_classify.env.mlsql`;

load parquet.`${trainDataPath}` as tmpTrainData;
load parquet.`${testDataPath}` as tmpTestData;

select array(cast(label as float)) as label,features from tmpTrainData as trainData;
select array(cast(label as float)) as label,features from tmpTestData as testData;

now, we can register the model as a function:

register BigDLClassifyExt.`${modelPath}` as cifarPredict;

finally, we can use the function to predict a new picture:

select
vec_argmax(cifarPredict(vec_dense(to_array_double(features)))) as predict_label,
label from testData limit 10 
as output;

Of course, you can predict a table:

predict testData as BigDLClassifyExt.`${modelPath}` as predictdata;

Why BigDL

GPU is very expensive and normally, our company already have lots of CPUs, if we can make full use of these CPUs which will save a lot of money.