updates

author Riccardo Casatta <riccardo@casatta.it>

Mon, 18 Jan 2021 17:59:12 +0000 (18:59 +0100)

committer Riccardo Casatta <riccardo@casatta.it>

Mon, 18 Jan 2021 17:59:12 +0000 (18:59 +0100)
author Riccardo Casatta <riccardo@casatta.it>
Mon, 18 Jan 2021 17:59:12 +0000 (18:59 +0100)
committer Riccardo Casatta <riccardo@casatta.it>
Mon, 18 Jan 2021 17:59:12 +0000 (18:59 +0100)
diff --git a/content/blog/2021/fee_estimation_for_light_clients.md b/content/blog/2021/fee_estimation_for_light_clients.md

index d0da8225e1b018150ccff450c4092064c2ca122d..c090ba7015ddd7a35e9c2da30fcc67d8ccd3825b 100644 (file)
--- a/content/blog/2021/fee_estimation_for_light_clients.md
+++ b/content/blog/2021/fee_estimation_for_light_clients.md
@@ -10,42 +10,39 @@ draft: false
  
  ## Introduction: what's fee estimation?
  
-Fee estimation is the process of selecting the fee rate for a bitcoin transaction according to two factors:
+Fee estimation is the process of selecting the fee rate [^fee rate] for a bitcoin transaction according to two factors:
  
  * Current traffic of the Bitcoin network
  * The urgency, or lack of urgency, of the sender to see the transaction confirmed in a block.
  
-Selecting a too high fee rate could mean losing money because we may obtain the exact same thing (confirm in the same block) with a lower expense.
+Selecting a too high fee rate means losing money, since the same result may have been achieved with a lower expense.
  
  Selecting a too low fee rate could mean waiting a long time before the transaction confirms, or even worse, never see the transaction confirmed.
  
  ## The problem
  
-Bitcoin core node offers fee estimation through the RPC method `estimatesmartfee`, there are also a lot of [estimators] online, so why we need yet another estimator?
+Bitcoin core node offers fee estimation through the RPC method [estimatesmartfee], there are also a lot of [fee estimators] online, so why we need yet another estimator?
  
  Bitcoin core model is not suitable for light-clients such as mobile wallets, even in pruned mode. Online estimators are bad because:
  
-* Privacy: Asking something to a server could leak the caller ip, if using tor, the timing is an information leak because it could relate to a transaction made soon after.
-* Security: A compromised source of fee rates could provide too high fee rates causing loss of money or too low ones causing tx to never confirm.
+* Privacy: Contacting the server may leak the IP, and the request timing may be used to relate the request to a transaction made soon after.
+* Security: A compromised source of fee rates could provide too high fee rates causing loss of money or too low ones causing transaction to never confirm.
  
  Replace By Fee (RBF) and Child Pay For Parents (CPFP) are techniques minimizing the fee estimation problem, because one could simply under-estimate fee rate and raise if needed, however:
-* RBF and CPFP may leak more information, such as detecting patterns that may leak the wallet used.
-* Requires some level of interaction, requiring the client to be online again to perform the fee bump. Sometimes this could be very costly such in a context with offline signer.
+* RBF and CPFP may leak more information, such as detecting patterns that may leak the kind of wallet used.
+* Requires additional interaction: the client must be online again to perform the fee bump. Sometimes this is very costly, for instance when using an offline signer.
  
-So this work is an effort to build a **good fee estimator for purely peer to peer light clients** such as [neutrino] based ones or determine it is infeasible.
+This work is an effort to build a **good fee estimator for purely peer to peer light clients** such as [neutrino] based ones or determine whether it is infeasible.
  
-In the meantime, two other sub-goals are pursued:
+In the meantime, another sub-goal is pursued: attract data-scientist interest, Indeed the initial step for this analysis consists in constructing a data set, which might be starting point of different kind of studies.
  
-* I want to share what I learned while tackling this problem
-* I would like to attract data-scientist interest, I hope this is just the beginning as you will see while reading and in the big "Futures development" section.
+## The difficulties and the solution
  
-## The difficulties and the solution?
-
-The difficult part in doing fee estimation on a light client is the lack of information available, for example, bitcoin core `estimatesmartfee` use up to the last 1008 blocks and has full information about the mempool [^1], such as the fee rate of every one of these transactions but a light-client cannot rely on all this information.
+The difficult part in doing fee estimation on a light client is the lack of information available, for example, bitcoin core `estimatesmartfee` use up to the last 1008 blocks and has full information about the mempool [^mempool], such as the fee rate of every one of these transactions but a light-client cannot rely on all this information.
  
  However, other factors are available and may help in fee estimation, such as the day of the week since it's well-known the mempool usually empties during the [weekend]. Or the hour of the day to predict recurring daily events such as [bitmex withdrawals].
  
-The idea is to apply Machine Learning (ML) techniques [^2] to discover patterns over this informations and see if they are enough to achieve good estimations.
+The idea is to apply Machine Learning (ML) techniques [^disclaimer] to discover patterns over this informations and see if it's enough to achieve good estimations.
  
  However this creates another problem, machine learning needs data, a lot of data to work well, is this information available?
  
@@ -53,7 +50,7 @@ However this creates another problem, machine learning needs data, a lot of data
  
  We are going to use a DNN (Deep Neural Network) an ML technique in the supervised learning branch, the ELI5 is: give a lot of example inputs with the desired output to a black box, if there are relations between inputs and outputs, and if there are enough examples, the black box will give predicted output to inputs it has never seen before.
  
-To understand what's our input and outputs, we need the question we want to answer. The question a fee estimator need to answer is:
+To define our input and outputs, we need the question we want to answer. The question a fee estimator need to answer is:
  
  *"Which fee rate should I use if I want this transaction to be confirmed in at most `n` blocks?"*
  
@@ -78,7 +75,7 @@ The [data logger] is MIT licensed open source software written in Rust.
  
  We need to save the time transactions enter in the node's mempool, to be more efficient and precise we should not call only the RPC endpoints but listen to [ZMQ] events. Luckily just released bitcoin core 0.21.0 added a new [ZMQ] topic `zmqpubsequence` [notifying] mempool events (and block events). The logger is also listening to `zmqpubrawtx` and `zmqpubrawblock` topics, to make less RPC calls.
  
-Other than the time, we save other data for several reasons, in the end, we are not interested only in the timestamp of the transaction when enters the mempool, but more importantly, how many blocks will pass until this transaction is confirmed. In the final dataset this field is called `confirms_in`, if `1` it means the transaction is confirmed in the next block after it has been seen. This is a reason to save blocks in the raw logs (to see when a seen tx gets confirmed).
+Other than the time, we save other data for several reasons, in the end, we are not interested only in the timestamp of the transaction when enters the mempool, but more importantly, how many blocks will pass until this transaction is confirmed. In the final dataset this field is called `confirms_in`, if `1` it means the transaction is confirmed in the next block after it has been seen. This is a reason to save blocks in the raw logs (to see when a seen transaction gets confirmed).
  
  Another critical information is the `fee_rate` of the transaction, since the absolute fee value of the fee paid by a bitcoin transaction is not available nor derivable from only the transaction itself, we need the transaction's previous outputs values.
  
@@ -96,48 +93,50 @@ I expect and hope raw logs will be useful also for other projects, for example,
  
  The [dataset] is publicly available (~400MB gzip compressed, ~1.6GB as plain CSV).
  
-The output of the model, is clear and it's the fee rate, expressed in `[satoshi/vbytes]`.
+The output of the model it's the fee rate, expressed in `[satoshi/vbytes]`.
  
  What about the inputs? In general we want two things:
  
-* Something that it's correlated to the output, even with a non-linear relation.
+* Something that is correlated to the output, even with a non-linear relation.
  * It must be available in a light client, for example supposing to have the informations regarding the last 1000 blocks is considered too much.
  
  We want to compare model results with another available estimation, thus we have also data to compute bitcoin core `estimatesmartfee` errors, but we are not going to use this data for the model.
  
-The dataset will contain only transaction with already confirmed inputs. To consider transactions with unconfirmed inputs the fee rate should be computed as a whole, for example if transaction `t2` has an unconfirmed input coming from `t1` outputs (`t1` has all confirmed inputs) and all unspent outputs, a unique fee rate of the two txs is to consider. Supposing `f()` is the absolute fee and `w()` is tx weight, the fee rate is `(f(t1)+f(t2))/(w(t1)+w(t2))`. At the moment the model simply discard this txs for complexity reasons.
+The dataset will contain only transactions with already confirmed inputs. To consider transactions with unconfirmed inputs the fee rate should be computed as a whole, for example if transaction `t2` has an unconfirmed input coming from `t1` outputs (`t1` has all confirmed inputs) and all unspent outputs, a unique fee rate of the two transactions is to consider. Supposing `f()` is the absolute fee and `w()` is transaction weight, the fee rate is `(f(t1)+f(t2))/(w(t1)+w(t2))`. At the moment the model simply discard this transactions for complexity reasons.
  
-For similar reasons there is the flag `parent_in_cpfp`. When a tx has inputs confirmed (so it's not excluded by previous rule) but 1 or more of its output has been spent in the same block, `parent_in_cpfp` it's 1.
-Transactions with `parent_in_cpfp=1` are included in the dataset but excluded by current model, since the miner considered a merged fee rate of the 2 to build the block.
+For similar reasons there is the flag `parent_in_cpfp`. When a transaction has inputs confirmed (so it's not excluded by previous rule) but 1 or more of its output has been spent in the same block, `parent_in_cpfp` it's 1.
+Transactions with `parent_in_cpfp=1` are included in the dataset but excluded by current model, since the miner considered a merged fee rate of the transactions group to build the block.
  
  #### The mempool
  
-The most important information come from the mempool status, however, we cannot feed the model with a list of mempool txs fee rate because this array has a variable length. To overcome this the mempool is converted in buckets which basically are counter of transactions with a fee rate in a specific range. The mempool buckets is defined by two parameters, the percentage increment and the max value.
-Supposing to choose the mempool bucket to have parameter `50%` and `500.0 sat/vbytes` the buckets are like the following
+The most important information come from the mempool status, however, we cannot feed the model with a list of mempool transactions fee rate because this array has a variable length. To overcome this the mempool is converted in buckets which basically are counter of transactions with a fee rate in a specific range. The mempool buckets array is defined by two parameters, the `percentage_increment` and the `array_max` value.
+Supposing to choose the mempool buckets array to have parameters `percentage_increment = 50%` and `array_max = 500.0 sat/vbytes` the buckets are like the following
  
-bucket | min fee rate | max fee rate
+bucket | bucket min fee rate | bucket max fee rate
  -|-|-
-a0|1.0|1.5 = min*(1+increment)
+a0|1.0|1.5 = min*(1+`percentage_increment`)
  a1|1.5 = previous max|2.25
  a2|2.25| 3.375
  ...|...|...
  a15|437.89|inf
  
-We previously stated this model is for light-client such as [neutrino] based ones, on this clients the mempool is already available (it's needed to check for receiving tx) but the problem is we can't compute fee rates of this transactions because previous confirmed inputs are not in the mempool!
+The array stops at `a15` because `a16` would have a bucket min greater than `array_max`.
+
+We previously stated this model is for light-client such as [neutrino] based ones, on these clients the mempool is already available (it's needed to check for receiving tx) but the problem is we can't compute fee rates of this transactions because previous confirmed inputs are not in the mempool!
  
-Luckily, **thanks to temporal locality, an important part of mempool transactions spend outputs created very recently**, for example in the last 6 blocks.
-The blocks are available trough the p2p network, and downloading the last 6 is considered a good compromise between resource consumption and accurate prediction. We need the model to be built with the same data available in the prediction phase, as a consequence the mempool data in the dataset refers only to txs having their inputs in the last 6 blocks. However the `bitcoin-csv` tool inside the [data logger] allows to change this parameter from command line.
+Luckily, **thanks to temporal locality [^temporal locality], an important part of mempool transactions spend outputs created very recently**, for example in the last 6 blocks.
+The blocks are available through the p2p network, and downloading the last 6 is considered a good compromise between resource consumption and accurate prediction. We need the model to be built with the same data available in the prediction phase, as a consequence the mempool data in the dataset refers only to transactions having their inputs in the last 6 blocks. However the `bitcoin-csv` tool inside the [data logger] allows to configure this parameter.
  
  #### The outliers
  
-Another information the dataset contain is the block percentile fee rate, considering `r_i` to be the rate of the `ith` tx in a block, `q_k` is the fee rate value such that for each tx in a block `r_i` < `q_k` returns the `k%` txs in the block that are paying lower fees.
+Another information the dataset contain is the block percentile fee rate, considering `r_i` to be the rate of the `ith` transaction in a block, `q_k` is the fee rate value such that for each transaction in a block `r_i` < `q_k` returns the `k%` transactions in the block that are paying lower fees.
  
  Percentiles are not used to feed the model but to filter some outliers tx.
-Removing this observations is controversial at best and considered cheating at worse. However, it should be considered bitcoin core `estimatesmartfee` doesn't even bother to give estimation for the next block, I think because of the many txs that are confirming in the next block are huge overestimation, or clear errors like [this one] I found when I started logging data.
-This outliers are a lot for txs confirming in the next block (`confirms_in=1`), less so for 2, mostly disappeared for 3 or more. It's counterintuitive overestimation exist for `confirms_in=2`, how it's possible an overestimation doesn't enter the very next block? The reason for that is network latency, the miner didn't see that tx yet, or block building latency, the miner saw the tx, but decided it's not efficient to rebuild the block template yet.
+Removing this observations is controversial at best and considered cheating at worse. However, it should be considered that bitcoin core `estimatesmartfee` doesn't even bother to give estimation for the next block, I think because of the many transactions that are confirming in the next block are huge overestimation, or clear errors like [this one] I found when I started logging data.
+This outliers are a lot for transactions confirming in the next block (`confirms_in=1`), less so for 2, mostly disappeared for 3 or more. It's counterintuitive that overestimation exist for `confirms_in=2`, how it's possible an overestimation doesn't enter the very next block? The reason for that is network latency, the miner didn't see that transaction yet, or block building latency, the miner saw the tx, but decided it's not efficient to rebuild the block template yet.
  
-To keep the model balanced, when over-estimation are filtered out, simmetrycally under-estimation are filtered out too. This also has the effect to remove some txs that are included because fee is payed out-of-band.
-Another reason to filter txs, is that the dataset is over-represented by txs with low `confirms_in`, like more tha 50% of txs confirms in the next block, so I think it's good to filter some of this txs
+To keep the model balanced, when over-estimation are filtered out, simmetrycally under-estimation are filtered out too. This also has the effect to remove some transactions that are included because fee is payed out-of-band.
+Another reason to filter transactions, is that the dataset is over-represented by transactions with low `confirms_in`, like more tha 50% of transactions confirms in the next block, so I think it's good to filter some of this transactions.
  
  The filters applied are the followings:
  
@@ -153,19 +152,19 @@ Not yet convinced to remove this outliers? The [dataset] contains all the observ
  
  column | used in the model | description
  -|-|-
-txid | no | Transaction hash, useful for debugging purpose to check correctness
-timestamp | converted |The moment when the tx has been added in the mempool, in the model is used in the form `day_of_week` and `hour`
-current_height | no |The blockchain height seen by the node in this moment
-confirms_in | yes | This tx confirms at block height `current_height+confirms_in`
-fee_rate | target | This tx fee rate measured in `[sat/vbytes]`
+txid | no | Transaction hash, useful for debugging
+timestamp | converted | The time when the transaction has been added in the mempool, in the model is used in the form `day_of_week` and `hour`
+current_height | no | The blockchain height seen by the node in this moment
+confirms_in | yes | This transaction confirmed at block height `current_height+confirms_in`
+fee_rate | target | This transaction fee rate measured in `[sat/vbytes]`
  fee_rate_bytes | no | fee rate in satoshi / bytes, used to check bitcoin core `estimatesmartfee` predictions
  block_avg_fee | no | block average fee rate `[sat/vbytes]` of block `current_height+confirms_in`
-core_econ | no | bitcoin estimate smart fee result for `confirms_in` block target and in economic mode. Could be not available `?` when a block is connected more recently than the estimation has been requested, estimation are requested every 10 secs.
-core_cons | no | same as previos but with conservative mode
-mempool_len | no | sum of the mempool tx wit fee rate available (sum of every `a*` field)
-parent_in_cpfp | no | it's 1 when the tx has outputs that are spent in the same block as the tx is confirmed (they are parent in a CPFP relations).
-q1-q30-... | no | Tx confirming fast could be outliers, usually paying a lot more than required, this percentiles are used to filter those txs,
-a1-a2-... | yes | contains the number of tx in the mempool with known fee rate in the ith bucket.
+core_econ | no | bitcoin `estimatesmartfee` result for `confirms_in` block target and in economic mode. Could be not available `?` when a block is connected more recently than the estimation has been requested, estimation are requested every 10 secs.
+core_cons | no | Same as previous but with conservative mode
+mempool_len | no | Sum of the mempool transaction with fee rate available (sum of every `a*` field)
+parent_in_cpfp | no | It's 1 when the transaction has outputs that are spent in the same block as the transaction is confirmed (they are parent in a CPFP relations).
+q1-q30-... | no | Transaction confirming fast could be outliers, usually paying a lot more than required, this percentiles are used to filter those txs,
+a1-a2-... | yes | Contains the number of transaction in the mempool with known fee rate in the ith bucket.
  
  
  ![The good, the bad and the ugly](/images/the-good-the-bad-the-ugly.jpg)
@@ -177,9 +176,9 @@ a1-a2-... | yes | contains the number of tx in the mempool with known fee rate i
  The code building and training the model with [tensorflow] is available in [google colab notebook] (jupyter notebook), you can also download the file as plain python and execute locally. About 30 minutes are needed to train the model, but heavily depends on hardware available.
  
  ![graph confirm_in blocks vs fee_rate](/images/20210115-111008-confirms_in-fee_rate.png)
-<div align="center">Tired to read and want a couple simple statement? In the last month a ~50 sat/vbyte tx never took more than a day to confirm and a ~5 sat/vbyte never took more than a week</div><br/>
+<div align="center">Tired to read and want a couple simple statement? In the last month a ~50 sat/vbyte transaction never took more than a day to confirm and a ~5 sat/vbyte never took more than a week</div><br/>
  
-As a reference, in the code we have a calculation of the bitcoin core `estimatesmartfee` MAE [^3] and drift [^4], note this are `[satoshi/bytes]` (not virtual bytes).
+As a reference, in the code we have a calculation of the bitcoin core `estimatesmartfee` MAE [^MAE] and drift [^drift], note this are `[satoshi/bytes]` (not virtual bytes).
  MAE is computed as `avg(abs(fee_rate_bytes - core_econ))` when `core_econ` is available (about 1.2M observations, sometime the value is not available when considered too old)
  
  
@@ -246,7 +245,7 @@ Honestly, about the neural network parameters, they are mostly the one taken fro
  
  A significant part of a ML model are the activation functions, `relu` (Rectified Linear Unit) is one of the most used lately, because it's simple and works well as I learned in this [introducing neural network video]. `relu` it's equal to zero for negative values and equal to the input for positive values. Being non-linear allows the whole model to be non-linear.
  
-For the last layer it's different, we want to enforce a minimum for the output, which is the minimum relay fee `1.0` [^5]. One could not simply cut the output of the model after prediction because all the training would not consider this constraint. So we need to build a custom activation function on which the model training will be able to use for the [gradient descent] optimization step. Luckily is very simple using tensorflow primitives:
+For the last layer it's different, we want to enforce a minimum for the output, which is the minimum relay fee `1.0` [^minimum relay fee]. One could not simply cut the output of the model after prediction because all the training would not consider this constraint. So we need to build a custom activation function on which the model training will be able to use for the [gradient descent] optimization step. Luckily is very simple using tensorflow primitives:
  
  ```
  def clip(x):
@@ -318,7 +317,7 @@ This is just a starting point, there are many future improvements such as:
  
  * Build a separate model with full knowledge, thus for full, always-connected nodes could be interesting and improve network resource allocation in comparison with current estimator.
  * Tensorflow is a huge dependency, and since it contains all the feature to build and train a model, most of the feature are not needed in the prediction phase. In fact tensorflow lite exist which is specifically created for embedded and mobile device, [prediction test tool] and the final integration in [bdk] should use that.
-* There are other fields that should be explored that could improve model predictions, such as, tx weight, time from last block, etc. Luckily the architecture of the logger allows the recreation of the dataset from the raw logs very quickly. Also some fields like `confirms_in` are so important that the model could benefit from expansion during pre-processing with technique such as [hashed feature columns].
+* There are other fields that should be explored that could improve model predictions, such as, transaction weight, time from last block, etc. Luckily the architecture of the logger allows the recreation of the dataset from the raw logs very quickly. Also some fields like `confirms_in` are so important that the model could benefit from expansion during pre-processing with technique such as [hashed feature columns].
  * Bitcoin logger could be improved by a merge command to unify raw logs files, reducing redundancy and consequently disk occupation.
  * At the moment I am training the model on a threadripper CPU, training the code on GPU or even TPU will be needed to decrease training time, especially because input data will grow and capture more mempool situations.
  * The [prediction test tool] should estimate only with p2p, without requiring a node. This work would be propedeutic for [bdk] integration
@@ -346,16 +345,19 @@ And also this tweet that remembered me I had this work in my TODO list
  
  <br/><br/>
  
-[^1]: mempool is the set of transactions that are valid by consensus rules (for example, they are spending existing bitcoin), broadcasted in the bitcoin peer to peer network, but they are not yet part of the blockchain.
-[^2]: DISCLAIMER: I am not an expert data-scientist!
-[^3]: MAE is Mean Absolute Error, which is the average of the series built by the absolute difference between the real value and the estimation.
-[^4]: drift like MAE, but without the absolute value
-[^5]: Most node won't relay transactions with fee lower than the min relay fee, which has a default of `1.0`
+[^fee rate]: The transaction fee rate is the ratio between the absolute fee expressed in satoshi, over the weight of the transaction measured in virtual bytes. The weight of the transaction is similar to the byte size, however a part of the transaction (the segwit part) is discounted, their byte size is considered less because it creates less burden for the network.
+[^mempool]: mempool is the set of transactions that are valid by consensus rules (for example, they are spending existing bitcoin), broadcasted in the bitcoin peer to peer network, but they are not yet part of the blockchain.
+[^temporal locality]: In computer science temporal locality refers to the tendency to access recent data more often than older data.
+[^disclaimer]: DISCLAIMER: I am not an expert data-scientist!
+[^MAE]: MAE is Mean Absolute Error, which is the average of the series built by the absolute difference between the real value and the estimation.
+[^drift]: drift like MAE, but without the absolute value
+[^minimum relay fee]: Most node won't relay transactions with fee lower than the min relay fee, which has a default of `1.0`
  
  
+[estimatesmartfee]: https://bitcoincore.org/en/doc/0.20.0/rpc/util/estimatesmartfee/
  [core]: https://bitcoincore.org/
  [bitmex withdrawals]: https://b10c.me/mempool-observations/2-bitmex-broadcast-13-utc/
-[estimators]: https://b10c.me/blog/003-a-list-of-public-bitcoin-feerate-estimation-apis/
+[fee estimators]: https://b10c.me/blog/003-a-list-of-public-bitcoin-feerate-estimation-apis/
  [neutrino]: https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki
  [weekend]: https://www.blockchainresearchlab.org/2020/03/30/a-week-with-bitcoin-transaction-timing-and-transaction-fees/
  [notifying]: https://github.com/bitcoin/bitcoin/blob/master/doc/zmq.md#usage
author	Riccardo Casatta <riccardo@casatta.it>
	Mon, 18 Jan 2021 17:59:12 +0000 (18:59 +0100)
committer	Riccardo Casatta <riccardo@casatta.it>
	Mon, 18 Jan 2021 17:59:12 +0000 (18:59 +0100)