PipelineWise illustrates the power of Singerx
Stitch is based on Singer, an open source standard for moving data between databases, web APIs, files, queues, and just about anything else. Because it's open source, anyone can use Singer to write data extraction and loading scripts or more comprehensive utilities. TransferWise, the company I work for, used Singer to create a data pipeline framework called PipelineWise that replicates data from multiple sources to multiple destinations.
TransferWise uses more than a hundred microservices, which means we have hundreds of different type of data sources (MySQL, PostgreSQL, Kafka, Zendesk, Jira, etc.). We wanted to create a centralised analytics data store that could hold data from all of our sources, with due attention paid to security and scalability. We wanted to use change data capture (CDC) wherever possible to keep lag low. In addition, our solution had to:
- Apply schema changes automatically
- Avoid vendor lock-in — we wanted access to the source code to develop new features and fix issues quickly
- Keep configuration as code
We looked at traditional ETL tools, commercial replication tools, and Kafka streaming ETL. None of them met all of our needs. (You can read more details in my post on Medium.)
After several months we found the Singer specification and realised that we could get to a solution more quickly by building on this great work.
A data pipeline is born
Our analytics platform team created PipelineWise as an experiment in close cooperation with our data analysts and some of the product teams that use the data. It proved to be successful — PipelineWise now meets all of our initial requirements. We use it to replicate hundreds of gigabytes of data every day from 120 microservices, 1,500+ tables, and a bunch of external tools into our Snowflake data warehouse, with only minutes of lag.
Monitoring with Grafana: Replicating 120 data sources, 1,500+ tables into Snowflake with PipelineWise on three nodes of c5.2xlarge EC2 instances
Like any tool, PipelineWise has limitations:
- Not real-time: The currently supported target connectors are microbatch-oriented. We have to load data from S3 via the COPY command into Snowflake or Amazon Redshift because individual
INSERT
statements are inefficient. Creating these batches adds an extra layer to the process, so replication is not real-time. The replication lag from source to target is between 5 and 30 minutes depending on the data source. - Very active transactional tables: PipelineWise tries to do parallel processing wherever possible. Microbatches are created in parallel as well, one batch for each table, but currently we can’t create one individual batch in parallel. This means that replicating extremely large tables with millions of only
INSERTS
andUPDATES
can be slow when the CDC replication method is enabled. In this case key-based incremental replication is faster and still reliable, as there are no deleted rows in source.
An evolving solution
PipelineWise is likely to evolve for some time to come, but it’s mature enough to release back to the open source community. Our hope is that others might benefit from and contribute toward the project, and possibly open up new and exciting ways of analysing data.
For detailed information on PipelineWise features and architecture, check out the documentation.
原文地址:https://www.cnblogs.com/rongfengliang/p/11531537.html
- Basic Calculator 基本计算器-Leetcode
- python yield函数深入浅出理解
- 十分钟搞定 Tensorflow 服务
- datapump跨平台升级迁移的总结 (r8笔记第77天)
- Java中isAssignableFrom()方法与instanceof()方法用法
- 与Ajax同样重要的jQuery(1)
- Java中Class类详解、用法及泛化
- python 函数编程的位置参数、默认参数、关键字参数以及函数的递归
- Java子类的父类和要实现的接口有相同的方法/函数会冲突吗
- Java关键字 Finally执行与break, continue, return等关键字的关系
- #if 和#ifdef的区别
- python高阶函数:map(f,[list]),reduce(f,[list],可选初始值),
- 深入探讨 Java 类加载器
- 斐波那契查找原理详解与实现
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- Go语言入门(十一) 接口编程
- 一个价值百万的思路,如何下载去水印视频
- Go语言入门(十) Mysql与Redis操作
- 【Vue进阶】——如何实现组件属性透传?
- Go语言入门(九) 文件操作
- zookeeper完整详细版
- redis学习(十九)
- Android开发6年,互联网寒冬公司倒闭后,耗时3个月北上广求职,终拿到头条Offer!
- 直播软件开发如何使用FFMPEG推流并保存在本地
- react-router学习笔记
- 尤大 3 天前发在 GitHub 上的 vue-lit 是啥?
- BFE.dev前端刷题 23. 实现一个sum()方法
- 彻底深刻理解js原型链之prototype,proto以及constructor(一)
- SAP Spartacus取cart的HTTP请求
- 记一次Netty连接池FixedChannelPool连接未释放问题的排查总结