試用 Apache Beam
您可以使用我們的互動式筆記本試用 Apache Beam 管道。
- Java SDK
- Python SDK
- Go SDK
Colab 中的互動式 WordCount
這個互動式筆記本將向您展示一個簡單、最小版本的 WordCount 是什麼樣子。
package samples.quickstart;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.Filter;
import org.apache.beam.sdk.transforms.FlatMapElements;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.TypeDescriptors;
import java.util.Arrays;
public class WordCount {
public static void main(String[] args) {
String inputsDir = "data/*";
String outputsPrefix = "outputs/part";
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read lines", TextIO.read().from(inputsDir))
.apply("Find words", FlatMapElements.into(TypeDescriptors.strings())
.via((String line) -> Arrays.asList(line.split("[^\\p{L}]+"))))
.apply("Filter empty words", Filter.by((String word) -> !word.isEmpty()))
.apply("Count words", Count.perElement())
.apply("Write results", MapElements.into(TypeDescriptors.strings())
.via((KV<String, Long> wordCount) ->
wordCount.getKey() + ": " + wordCount.getValue()))
.apply(TextIO.write().to(outputsPrefix));
pipeline.run();
}
}
若要了解如何在您自己的電腦上安裝和執行 Apache Beam Java SDK,請按照Java 快速入門中的說明進行操作。
import apache_beam as beam
import re
inputs_pattern = 'data/*'
outputs_prefix = 'outputs/part'
with beam.Pipeline() as pipeline:
(
pipeline
| 'Read lines' >> beam.io.ReadFromText(inputs_pattern)
| 'Find words' >> beam.FlatMap(lambda line: re.findall(r"[a-zA-Z']+", line))
| 'Pair words with 1' >> beam.Map(lambda word: (word, 1))
| 'Group and sum' >> beam.CombinePerKey(sum)
| 'Format results' >> beam.Map(lambda word_count: str(word_count))
| 'Write results' >> beam.io.WriteToText(outputs_prefix)
)
若要了解如何在您自己的電腦上安裝和執行 Apache Beam Python SDK,請按照Python 快速入門中的說明進行操作。
package main
import (
"context"
"flag"
"fmt"
"regexp"
"github.com/apache/beam/sdks/v2/go/pkg/beam"
"github.com/apache/beam/sdks/v2/go/pkg/beam/io/textio"
"github.com/apache/beam/sdks/v2/go/pkg/beam/runners/direct"
"github.com/apache/beam/sdks/v2/go/pkg/beam/transforms/stats"
_ "github.com/apache/beam/sdks/v2/go/pkg/beam/io/filesystem/local"
)
var (
input = flag.String("input", "data/*", "File(s) to read.")
output = flag.String("output", "outputs/wordcounts.txt", "Output filename.")
)
var wordRE = regexp.MustCompile(`[a-zA-Z]+('[a-z])?`)
func main() {
flag.Parse()
beam.Init()
pipeline := beam.NewPipeline()
root := pipeline.Root()
lines := textio.Read(root, *input)
words := beam.ParDo(root, func(line string, emit func(string)) {
for _, word := range wordRE.FindAllString(line, -1) {
emit(word)
}
}, lines)
counted := stats.Count(root, words)
formatted := beam.ParDo(root, func(word string, count int) string {
return fmt.Sprintf("%s: %v", word, count)
}, counted)
textio.Write(root, *output, formatted)
direct.Execute(context.Background(), pipeline)
}
若要了解如何在您自己的電腦上安裝和執行 Apache Beam Go SDK,請按照Go 快速入門中的說明進行操作。
如需 WordCount 如何運作的更詳細說明,請參閱WordCount 範例逐步解說。
下一步
- 在WordCount 範例逐步解說中逐步了解其他 WordCount 範例。
- 逐步瀏覽我們的學習資源。
- 深入了解我們最喜歡的影片和播客。
- 加入 Beam users@ 電子郵件清單。
- 如果您有興趣為 Apache Beam 程式碼庫做出貢獻,請參閱貢獻指南。
如果您遇到任何問題,請隨時聯繫我們!
上次更新於 2024/10/31
您是否找到您要找的所有內容?
這些內容是否都實用且清晰?是否有任何您想要變更的地方?請告訴我們!