Apache Beam學(xué)習(xí)筆記——幾種常見的處理類Transform

Chiclaim 發(fā)布于2019-08-16 10:47 / 3573人閱讀

摘要：要說(shuō)在中常見的函數(shù)是哪一個(gè)，當(dāng)然是。是一個(gè)實(shí)現(xiàn)了接口的抽象類，其中是數(shù)據(jù)處理方法，強(qiáng)制子類必須實(shí)現(xiàn)。以上為學(xué)習(xí)一天的總結(jié)，有錯(cuò)誤歡迎指正。相同的是這個(gè)方法處理的都是中的一個(gè)元素。

在閱讀本文前，可先看一下官方的WordCount代碼，對(duì)Apache Beam有大概的了解。

要說(shuō)在Apache Beam中常見的函數(shù)是哪一個(gè)，當(dāng)然是apply()。常見的寫法如下：

[Final Output PCollection] = [Initial Input PCollection].apply([First Transform])
.apply([Second Transform])
.apply([Third Transform])

而在最簡(jiǎn)單的wordcount代碼中，就出現(xiàn)了許多種不同的傳入?yún)?shù)類型，除了輸入輸出的部分，還包括
1）使用ParDo.of():

.apply("ExtractWords-joe",
        ParDo.of(new DoFn() {
            @ProcessElement
            public void processElement(ProcessContext context) {
                System.out.println(context.element()+"~");
                for (String word : context.element().split(" ")) {
                    if (!word.isEmpty()) {
                        //輸出到Output PCollection
                        context.output(word);
                    }
                }
            }
        })
)

2）使用MapElements.via():

.apply("FomatResults",
        MapElements.via(new SimpleFunction,String>() {
            @Override
            public String apply(KV input) {
                return input.getKey()+":"+input.getValue();
            }
        }))

3）以及使用PTransform子類：

.apply(new CountWords())
  public static class CountWords extends PTransform,
      PCollection>> {
    @Override
    public PCollection> expand(PCollection lines) {

      // Convert lines of text into individual words.
      PCollection words = lines.apply(
          ParDo.of(new ExtractWordsFn()));

      // Count the number of times each word occurs.
      PCollection> wordCounts =
          words.apply(Count.perElement());

      return wordCounts;
    }
  }

這么多種傳入方式到底有什么聯(lián)系？通過查看源碼可以看出apply函數(shù)的定義如下：

  public  OutputT apply(
      String name, PTransform root) {
    return begin().apply(name, root);
  }

傳入的參數(shù)為PTransform類對(duì)象，也就是這幾種傳入?yún)?shù)其實(shí)都是PTransform類的變形。
PTransform是一個(gè)實(shí)現(xiàn)了Serializable接口的抽象類，其中public abstract OutputT expand(InputT input); 是數(shù)據(jù)處理方法，強(qiáng)制子類必須實(shí)現(xiàn)。
因此第(3)種方式很容易理解，就是通過繼承PTransform并實(shí)現(xiàn)了expand方法定義了CountWords類，給apply方法傳遞了一個(gè)CountWords對(duì)象。

在第(2)種方式中，MapElements是PTransform的子類，實(shí)現(xiàn)了expand方法，其實(shí)現(xiàn)方式是調(diào)用@Nullable private final SimpleFunction fn;成員中定義的數(shù)據(jù)處理方法，MapElements.via()則是一個(gè)為初始化fn的靜態(tài)方法，定義如下：

  public static  MapElements via(
      final SimpleFunction fn) {
    return new MapElements<>(fn, null, fn.getClass());
  }

傳入了一個(gè)SimpleFunction對(duì)象，SimpleFunction是一個(gè)必須實(shí)現(xiàn)public OutputT apply(InputT input) 方法的抽象類，用戶在該apply方法中實(shí)現(xiàn)數(shù)據(jù)處理。
所以這種方式的實(shí)現(xiàn)方式如下：
定義SimpleFunction的子類并實(shí)現(xiàn)其中的apply方法，將該子類的對(duì)象傳遞給MapElements.via()。

第(1)種方式中，ParDo.of()方法傳入一個(gè)DoFn對(duì)象，返回一個(gè)SingleOutput對(duì)象：

  public static  SingleOutput of(DoFn fn) {
    validate(fn);
    return new SingleOutput(
        fn, Collections.>emptyList(), displayDataForFn(fn));
  }

SingleOutput與MapElements類似，也是PTransform的子類，實(shí)現(xiàn)了expand方法，使用private final DoFn fn;成員中的方法進(jìn)行數(shù)據(jù)處理。
而DoFn是一個(gè)抽象類，用戶必須實(shí)現(xiàn)其注解方法(存疑) public void processElement(ProcessContext c)。
所以這種方式的實(shí)現(xiàn)方式如下：
定義DoFn的子類并實(shí)現(xiàn)其中的processElement方法，將該子類的對(duì)象傳遞給ParDo.of()。
需要注意的是processElement方法與前2種方式不同，輸入和輸出數(shù)據(jù)都是在傳入?yún)?shù)ProcessContext c中，而不是通過return進(jìn)行傳遞。

以上為學(xué)習(xí)Apache Beam一天的總結(jié)，有錯(cuò)誤歡迎指正。

Day2補(bǔ)充，3種方式的區(qū)別和聯(lián)系：

**
1）MapElement.via(SimpleFunction)和PTransform
MapElements是PTransform的一個(gè)子類：
public class MapElements
extends PTransform, PCollection>
從泛型參數(shù)來(lái)看，PTransform處理的是PCollection，而MapElement處理的是PCollection中的一個(gè)元素，對(duì)比SimpleFunction的apply方法和PTransform的expand方法的實(shí)現(xiàn)方式得到驗(yàn)證。

2）MapElement.via(SimpleFunction)和ParDo.of(DoFn)
區(qū)別之前已經(jīng)說(shuō)過，DoFn的processElement方法的輸入和輸出都是從參數(shù)傳入，而SimpleFunction的apply方法從參數(shù)傳入輸入，從return傳出輸出。
相同的是這2個(gè)方法處理的都是PCollection中的一個(gè)元素。
查看MapElement的expand方法源碼：

@Override
public PCollection expand(PCollection input) {
  checkNotNull(fn, "Must specify a function on MapElements using .via()");
  return input.apply(
      "Map",
      ParDo.of(
          new DoFn() {
            @ProcessElement
            public void processElement(ProcessContext c) {
              c.output(fn.apply(c.element()));
            }
    //部分代碼忽略
          }));
}

可以看出其實(shí)也是實(shí)現(xiàn)了DoFn的子類，在DoFn的processElement方法中調(diào)用SimpleFunction對(duì)象的apply方法進(jìn)行處理。