Image processing language Halide
Agenda
Halide
Halide
Halide’s basic entities
Halide pipeline
RGB2Gray
RGB2Gray using TBB
RGB2Gray in Halide
RGB2Gray efficiency comparison (1920x1280)
Scheduling
Scheduling
Scheduling
Scheduling
Scheduling
RGB2Gray efficiency comparison (1920x1280)
Box filter in Halide
Box filter efficiency comparison (1920x1280)
Scheduling: producer-consumer
Scheduling: producer-consumer
Scheduling: producer-consumer
Scheduling: producer-consumer
Scheduling: producer-consumer
Histogram computation in Halide
K-means in Halide
K-means in Halide
K-means in Halide
K-means in Halide
Ahead-of-time compilation
Ahead-of-time compilation
Summary
Deep learning layer
Deep learning network
Deep learning model
Deep learning pipeline using Halide
Halide layers: convolution
Halide layers: fusion
Halide backend @ OpenCV’s deep learning module
Calls to action
8.07M
Category: programmingprogramming

Image processing language Halide

1. Image processing language Halide

Dmitry Kurtaev, IOTG Computer Vision
<[email protected]>

2. Agenda

Halide’s basics
Samples: RGB → Gray, Box filter (blurring), Histogram computation, K-means
Scheduling
Gamma correction on GPU in Halide
Ahead-of-time compilation
Halide in OpenCV
Internet of Things Group
2

3. Halide

Programming language embedded in C++ (C++11 or higher)
Designed for image processing tasks
LLVM compiler backend
Generates C code, OpenCL/CUDA kernels, OpenGL shaders
Cross-platform compilation with AVX/SSE/FMA/F16 features
High-level scheduling
It isn’t Turing complete
Internet of Things Group
3

4. Halide

Intermediate
representation
(IR)
Algorithm,
Scheduling
Halide
LLVM
Machine code
OpenCL compiler
CUDA
Internet of Things Group
4

5. Halide’s basic entities

Variables
Halide::Var x("x"), xo("xo"), xi("xi");
Halide::RDom r(-1, 3); // r in [-1, 1]
Reduction domains
Halide::Expr e = sin((x + r) * 0.1f);
Expressions
Functions
Scheduling directives
Buffers
Halide::Func f("f");
f(x) = sum(e);
f.split(x, xo, xi, 16)
.parallel(xo)
.vectorize(xi);
f.compile_jit(Halide::get_host_target());
Halide::Buffer<float> output(1000);
f.realize(output);
Internet of Things Group
5

6. Halide pipeline

Write algorithm
Halide::Var x("x"), xo("xo"), xi("xi");
Halide::RDom r(-1, 3); // r in [-1, 1]
Schedule
Halide::Expr e = sin((x + r) * 0.1f);
Choose OS, architecture, features
Compile
Realize
Halide::Func f("f");
f(x) = sum(e);
f.split(x, xo, xi, 16)
.parallel(xo)
.vectorize(xi);
f.compile_jit(Halide::get_host_target());
Halide::Buffer<float> output(1000);
f.realize(output);
Internet of Things Group
6

7.

Image representations
1
1
1
1
0
0
0
1
1
1
1
0
0
1
1
1
0
1
0
0
0
0
0
1
1
0
0
1
0
0
1
0
1
1
0
0
0
0
1
0
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
0
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
1
1
1
0
0
1
1
1
1
1
1
1
0
0
0
1
Binary
25 32 140 223 225 228 229 228 228 228
16 16 133 207 220 224 225 224 224 224
16 16 56 158 212 218 220 222 222 222
RGB, interleaved
14 14 12 81 188 208 213 216 217 217
13 12 12 19 95 175 190 199 208 212
13 12 10 11 27 79 167 183 193 198
12 13 10 12 12 16 34 119 162 187
11 12 11 13 12 17 14 21 41 60
9
9
12 11 13 10 14 15 20 21
9
9
13 12 10 13 13 14 10 11
RGB, planar
Grayscale
Internet of Things Group
7

8. RGB2Gray

void rgb2gray(const uint8_t* src, uint8_t* dst, int height, int width) {
for (int i = 0; i < width * height; ++i) {
dst[i] = (uint8_t)(0.299f * src[i * 3 + 0] + // red
0.587f * src[i * 3 + 1] + // green
0.114f * src[i * 3 + 2]); // blue
}
}
Internet of Things Group
8

9. RGB2Gray using TBB

tbb::parallel_for(
tbb::blocked_range<int>(0, height * width),
[&](tbb::blocked_range<int> r) {
uint16_t red, green, blue, R2GRAY = 77, G2GRAY = 150, B2GRAY = 29;
int begin = r.begin();
int end = r.end();
const uint8_t* __restrict__ srcData = src;
uint8_t* __restrict__ dstData = dst;
for (int i = begin; i < end; ++i) {
red = srcData[i * 3];
green = srcData[i * 3 + 1];
blue = srcData[i * 3 + 2];
dstData[i] = (R2GRAY * red + G2GRAY * green + B2GRAY * blue) >> 8;
}
});
Internet of Things Group
9

10. RGB2Gray in Halide

uint16_t R2GRAY = 77, G2GRAY = 150, B2GRAY = 29;
Func f("rgb2gray");
auto input = Buffer<uint8_t>::make_interleaved(src, width, height, 3);
Var x("x"), y("y");
Expr r = cast<uint16_t>(input(x, y,
Expr g = cast<uint16_t>(input(x, y,
Expr b = cast<uint16_t>(input(x, y,
f(x, y) = cast<uint8_t>((R2GRAY * r
0));
1));
2));
+ G2GRAY * g + B2GRAY * b) >> 8);
Buffer<uint8_t> output(dst, {width, height});
f.realize(output);
Internet of Things Group
10

11. RGB2Gray efficiency comparison (1920x1280)

Intel® Core™ i5-4460 CPU @ 3.20GHz x 4
OpenCV (GNU
5.4.0)
TBB (Intel® C++ Compiler)
0.76ms
0.674ms
Internet of Things Group
Halide (LLVM 5.0.1)
1.315ms
11

12. Scheduling

Var x("x"), y("y");
Func f("f");
f(x, y) = 0;
f.print_loop_nest(); // debug printage
f.trace_stores(); //

f.realize(10, 10);
produce f:
for y:
for x:
f(...) = ...
Internet of Things Group
12

13. Scheduling

Var x("x"), y("y"), yo("yo"), yi("yi");
Func f("f");
f(x, y) = 0;
f.bound(x, 0, 10).bound(y, 0, 10)
.split(y, yo, yi, 5)
.parallel(yo);
f.print_loop_nest();
f.trace_stores();
f.realize(10, 10);
Internet of Things Group
produce f:
parallel y.yo:
for y.yi in [0, 4]:
for x:
f(...) = ...
13

14. Scheduling

Var x("x"), y("y"), yo("yo"), yi("yi"),
xo("xo"), xi("xi"), tile("tile");
Func f("f");
f(x, y) = 0;
f.bound(x, 0, 10).bound(y, 0, 10)
.split(y, yo, yi, 5)
.split(x, xo, xi, 5)
.reorder(xi, yi, xo, yo)
.fuse(xo, yo, tile)
.parallel(tile);
f.print_loop_nest();
f.trace_stores();
f.realize(10, 10);
Internet of Things Group
produce f:
parallel x.xo.tile:
for y.yi in [0, 4]:
for x.xi in [0, 4]:
f(...) = ...
14

15. Scheduling

Var x("x"), y("y"), yo("yo"), yi("yi");
Func f("f");
f(x, y) = 0;
f.bound(x, 0, 10).bound(y, 0, 10)
.split(y, yo, yi, 5)
.parallel(yo)
.vectorize(x, 4);
f.print_loop_nest();
f.trace_stores();
f.realize(10, 10);
Internet of Things Group
produce f:
parallel y.yo:
for y.yi in [0, 4]:
for x.x:
vectorized x.v8 in [0, 3]:
f(...) = ...
15

16. Scheduling

Var x("x"), y("y"), yo("yo"), yi("yi");
Func f("f");
f(x, y) = 0;
f.estimate(x, 0, 10).estimate(y, 0, 10);
Pipeline(f).auto_schedule(get_host_target());
f.print_loop_nest();
f.trace_stores();
f.realize(10, 10);
Internet of Things Group
produce f:
parallel y:
parallel x.x_vo:
vectorized x.x_vi in [0, 7]:
f(...) = ...
16

17. RGB2Gray efficiency comparison (1920x1280)

Intel® Core™ i5-4460 CPU @ 3.20GHz x 4
OpenCV (GNU
5.4.0)
TBB (Intel® C++ Compiler)
0.76ms
0.674ms
f.split(y, yo, yi, 64)
.parallel(yo)
.vectorize(x, 8);
0.796ms
Internet of Things Group
Halide (LLVM 5.0.1)
1.315ms (serial)
f.split(y, yo, yi, 64)
.split(x, xo, xi, 64)
.reorder(xi, yi, xo, yo)
.fuse(xo, yo, tile)
.parallel(tile)
.vectorize(x, 8);
1.221ms
f.parallel(y)
.vectorize(x, 32);
(auto scheduling)
0.869ms
17

18. Box filter in Halide

Func f("box_filter");
auto input = Buffer<uint8_t>::make_interleaved(src, width, height, 3);
Func padded = BoundaryConditions::constant_exterior(input, 0);
Var x("x"), y("y"), c("c");
Func input_uint16("input_uint16");
input_uint16(x, y, c) = cast<uint16_t>(padded(x, y, c));
RDom r(-1, 3, -1, 3);
Expr s = sum(input_uint16(x + r.x, y + r.y, c));
float ratio = 1.0f / 9;
f(x, y, c) = cast<uint8_t>(s * ratio);
f.output_buffer().dim(0).set_stride(3).set_bounds(0, width);
f.output_buffer().dim(1).set_stride(3 * width).set_bounds(0, height);
f.output_buffer().dim(2).set_stride(1).set_bounds(0, 3);
Internet of Things Group
18

19. Box filter efficiency comparison (1920x1280)

Intel® Core™ i5-4460 CPU @ 3.20GHz x 4
OpenCV (GNU
5.4.0)
3.603ms
Internet of Things Group
TBB (Intel® C++ Compiler)
4.779ms
(is not well auto-vectorized)
Halide (LLVM 5.0.1)
3.784ms
19

20. Scheduling: producer-consumer

81.5ms @ 1920x1280
Scheduling: producer-consumer
Var x("x"), y("y");
Func producer("producer"), consumer("consumer");
producer(x, y) = sin(x + y);
consumer(x, y) =
producer(x, y - 1) +
producer(x - 1, y) + producer(x, y) + producer(x + 1, y) +
producer(x, y + 1);
consumer.realize(5, 5);
producer is inlided to consumer ⇒ #sin – 125!
Var x("x"), y("y");
Func consumer("consumer");
consumer(x, y) =
sin(x + y - 1) +
sin(x – 1 + y) + sin(x + y) + sin(x + 1 + y) +
sin(x + y + 1);
consumer.realize(5, 5);
Internet of Things Group
20

21. Scheduling: producer-consumer

33.4ms @ 1920x1280
Scheduling: producer-consumer
Var x("x"), y("y");
Func producer("producer"), consumer("consumer");
producer(x, y) = sin(x + y);
consumer(x, y) =
producer(x, y - 1) +
producer(x - 1, y) + producer(x, y) + producer(x + 1, y) +
producer(x, y + 1);
producer.compute_root();
producer.trace_loads();
producer.trace_stores();
consumer.trace_stores();
consumer.realize(5, 5);
producer
Internet of Things Group
consumer
21

22. Scheduling: producer-consumer

82.5ms @ 1920x1280
Scheduling: producer-consumer
Var x("x"), y("y");
Func producer("producer"), consumer("consumer");
producer(x, y) = sin(x + y);
consumer(x, y) =
producer(x, y - 1) +
producer(x - 1, y) + producer(x, y) + producer(x + 1, y) +
producer(x, y + 1);
producer.compute_at(consumer, y);
producer.trace_loads();
producer.trace_stores();
consumer.trace_stores();
consumer.realize(5, 5);
producer
Internet of Things Group
consumer
22

23. Scheduling: producer-consumer

28.1ms @ 1920x1280
Scheduling: producer-consumer
Var x("x"), y("y");
Func producer("producer"), consumer("consumer");
producer(x, y) = sin(x + y);
consumer(x, y) =
producer(x, y - 1) +
producer(x - 1, y) + producer(x, y) + producer(x + 1, y) +
producer(x, y + 1);
producer.store_root();
producer.compute_at(consumer, y);
producer.trace_loads();
producer.trace_stores();
consumer.trace_stores();
consumer.realize(5, 5);
producer
Internet of Things Group
consumer
23

24. Scheduling: producer-consumer

50.5ms @ 1920x1280
(need to be parallelized)
Var x("x"), y("y");
Func producer("producer"), consumer("consumer");
producer(x, y) = sin(x + y);
consumer(x, y) =
producer(x, y - 1) +
producer(x - 1, y) + producer(x, y) + producer(x + 1, y) +
producer(x, y + 1);
producer.store_root();
producer.compute_at(consumer, x);
producer.trace_loads();
producer.trace_stores();
consumer.trace_stores();
consumer.realize(5, 5);
producer
Internet of Things Group
consumer
24

25. Histogram computation in Halide

void histogram(uint8_t* src, int* dst, int height, int width) {
static Func f("hist");
static Buffer<int> output(dst, {256, 3});
if (!f.defined()) {
auto input = Buffer<uint8_t>::make_interleaved(src, width, height, 3);
Var c("c"), i("i");
RDom r(0, width, 0, height);
f(i, c) = 0;
Expr lum = clamp(input(r.x, r.y, c), 0, 255);
f(lum, c) += 1;
f.estimate(i, 0, 256).estimate(c, 0, 3);
Pipeline(f).auto_schedule(get_host_target());
}
f.realize(output);
}
Internet of Things Group
25

26. K-means in Halide

English     Русский Rules