CppCon 2017: Carl Cook "When a Microsecond Is an Eternity: High Performance Trading Systems in C++"

关于Carl Cook在CppCon 2017上对低延迟代码介绍的一些笔记

Slowpath removal

Bad:

if (checkForErrorA())
    handleErrorA();
else if (checkForErrorB())
    handleErrorB();
else if (checkForErrorC())
    handleErrorC();
else
    sendOrderToExchange();

Good:

int64_t errorFlags;
    ...
if (!errorFlags)
    sendOrderToExchange();
else
    handleError(errorFlags);

原因分析：

第一种写法分支较多，影响硬件的分支预测
第一种写法指令较多，影响硬件的指令缓存
第一种写法中的check函数也是有可能影响数据缓存的

Template-based configuration

使用配置文件来控制代码走向通常是很方便的，比起使用虚函数，模板能更高效的完成这一工作，因为没有了运行时的开销，同时减少了分支以及不必要的代码

代码实例：

struct OrderSenderA {
    void SendOrder() {
        ...
    }
};

struct OrderSenderB {
    void SendOrder() {
        ...
    }
};

template<typename T>
struct OrderManager : public IOrderManager {
    void MainLoop() final {
        // ... and at some stage in the future...
        mOrderSender.SendOrder();
    }
    T mOrderSender;
};

std::unique_ptr<IOrderManager> Factory(const Config& config) {
    if (config.UseOrderSenderA())
        return std::make_unique<OrderManager<OrderSenderA>>();
    else
        return std::make_unique<OrderManager<OrderSenderB>>();
};

int main(int argc, char* argv[]) {
    auto manager = Factory(config);
    manager->MainLoop();
}

总之，如果在编译时就能确定的事情，就不需要等到运行时再去做，提高代码运行速度

Lambda functions are fast and convenient

作者表示使用Lambda不会更快，但是展示了C++的高效和强大。如果在编译时能确定将被执行的函数，那么倾向于使用Lambda

代码实例：

// Definition
template <typename T>
void SendMessage(T&& lambda) {
    auto msg = PrepareMessage();
    lambda(msg);
    send(msg);
}

// Usage
SendMessage([&](auto& message) {
    message.instrument = x;
    message.price = z;
    ...
});

上面的函数有可能被编译器直接内联成为非常底层的操作（例如，两个mov指令），但却是用高级语言写出来的，由此可见为什么C++成了低延迟系统的选择

Memory allocation

内存的分配非常慢，因此尽量避免内存分配。可能的优化措施包括：

使用池化技术，预分配好一些对象
避免释放对象，尽量复用对象
- delete不会引起系统调用，内存也没有被还给操作系统。但是glibc的free函数有大概400行的记录代码
- 重复使用对象可以帮助减少内存碎片
如果一定要销毁大的对象，尽量在其他线程做

Exceptions

不要害怕使用异常（如果使用gcc, clang或者msvc）。只要异常没有被抛出就没有开销
不要在控制流中使用异常，这会导致至少1.5us的开销，而且代码会很难看

Prefer templates to branches

跟前面的有点类似，也是分支消除的技术。如果已经预先知道所有选择以及对应如何处理，可以使用模板来去除分支

代码实例：

// Branching approach
enum class Side { Buy, Sell };

void RunStrategy(Side side) {
    const float orderPrice = CalcPrice(side, fairValue, credit);
    CheckRiskLimits(side, orderPrice);
    SendOrder(side, orderPrice);
}

float CalcPrice(Side side, float value, float credit) {
    return side == Side::Buy ? value - credit : value + credit;
}

// Templated approach
template<Side T>
void Strategy<T>::RunStrategy() {
    const float = CalcPrice(fairValue, credit);
    CheckRiskLimits(orderPrice);
    SendOrder(orderPrice);
}

template<>
float Strategy<Side::Buy>::CalcPrice(float value, float credit) {
    return value - credit;
}

template<>
float Strategy<Side::Sell>::CalcPrice(float value, float credit) {
    return value + credit;
}

并不是所有的分支语句都需要这样处理，只需要保证hotpath能最快运行即可

Multi-threading

最好避免使用多线程

数据同步开销很大
无锁编程可能仍然需要硬件锁
多线程代码难读懂
生产者容易使消费者饱和如果一定要使用：
尽可能共享少的数据。多线程写同一个缓存时容易增加开销
考虑复制数据而非共享数据。例如：single writer，single reader lock free queue
如果一定要共享数据，考虑不使用同步

Data lookups

假设从市场中找到某支股票的价格，且它们使用marketId进行关联，常规做法

struct Market {
    int32_t id;
    char shortName[4];
    int16_t quantityMultiplier;
        ...
};

struct Instrument {
    float price;
        ...
    int32_t marketId;
};

Message orderMessage;
orderMessage.price = instrument.price;
Market& market = Markets.FindMarket(instrument.marketId);
orderMessage.qty = market.quantityMultiplier * qty;
...

更好的做法

struct Market {
    int32_t id;
    char shortName[4];
    int16_t quantityMultiplier;
        ...
};

struct Instrument {
    float price;
        ...
    int32_t marketId;
    int16_t quantityMultiplier;
};

优化的原理在于，从缓存中一次性取出所有需要的数据。第一个做法先读instrument.price，当我们需要读取这4个字节时，我们其实会读取64个字节到缓存行中（硬件结构如此）这意味着整个Instrument结构体都将被加载到缓存中，然后又访问了market.quantityMultiplier，需要将整个Market结构体加载到缓存中，这就有可能发生缓存未命中从而影响速度。第二个做法加载了Instrument结构体后，因为缓存的存在，我们无需其他开销就能读到第二个值。比起节省的内存来说，还是提升的速度更为重要

Fast associative containers(std::unordered_map)

标准库的哈希表在发生冲突时，使用链表连接冲突的哈希值，这将会比顺序访问内存慢得多另一种解决冲突的方法是开放寻址法，即冲突之后将数据放在下一个位置 Optiver的做法是：两者结合哈希表存放的是哈希值和指针的pair，因此每个pair是16字节，所以每个缓存行会读到4个pair（回想一下，缓存行是64字节）根据哈希值得到的索引去找，如果哈希值不匹配就直接去下一个索引，对缓存很友好

((always inline)) and ((noinline))

这两个选项对编译器来说影响很大，最好经过测试再决定是否要内联

代码实例

CheckMarket();
if (notGoingToSendAnOrder)
    ComplexLoggingFunction();
else
    SendOrder();

__attribute__((noline))
void ComplexLoggingFunction() {
    ...
}

为什么这里一定不要内联？因为发送订单才是hotpath，如果进行了内联，那么hotpath中就会充斥毫无必要的打印日志的指令，从而影响了指令缓存的性能。如果不内联，那么它只是一个call调用，而且更有利于进行分支预测

Keeping the cache hot

要记住，hotpath是很少被完全执行到的，中间可能因为各种因素而停止发送订单，比如风控、持仓限制，等等，所以我们的缓存中可能充斥着非hotpath的数据和指令简单的解决方案：频繁的假装发送订单，一直保持缓存是准备就绪的状态。有的低延迟网卡（Mellanox, Solar Flare）甚至支持将数据写到卡里但不发送，就是为了这一目的

上篇THU OS Lab1

下篇Understanding mutex