What to expect when running a Bitcoin Core Full Node

This post is not about how to install Bitcoin Core and run it as a full node.

Its about what you can expect when you install Bitcoin Core and run it as a full node.

About full nodes

Firstly, lets clarify the role of a full node. A node is any system that connects to the Bitcoin network. A full node is any system that connects to the Bitcoin network and retains a full and up to date copy of the blockchain.

A full node is not a miner. Nodes simply relay information. By contrast, a miner specifically listens for transactions and tries to create new blocks. Miners do not typically hold a copy of the blockchain.

A full node fulfils one of two roles.

If you simply run a full node, and do not use it as a wallet, all your node is doing is adding to the capacity of the Bitcoin Network to relay information between nodes. This is of some value, but not hugely significant.

If you run a full node and use it as a wallet (either directly or by linking a client wallet to your node), your full node adds to the economic strength of the Bitcoin Network. It does this by enforcing the consensus rules for transactions. The more nodes that enforce the consensus rules, the more difficult it is for malicious nodes to break that consensus.

It is also worth pointing out that running your Bitcoin transactions through your own node is the purest form of transacting in Bitcoin. You have your own copy of the blockchain and can verify transactions for which you are a beneficiary without have to rely on someone else’s copy of the blockchain. It is the most accurate interpretation of the common Bitcoin axiom of “being your own bank”.

Hardware

If you’re an individual and not employed by a software firm involved in Bitcoin, or some other agency tasked with promoting Bitcoin, chances are you’re going to run Bitcoin Core on some old system that you might otherwise have recycled/dumped.

Generally, this is OK. Where your system is going to need the most poke is in downloading the blockchain and verifying each block as it comes in. Once you have downloaded the entire blockchain, each new block is created every 10 minutes, so your system will have a 10 minute break between processing calls. Prior to this, when you’re downloading blocks one after the next in order to complete the blockchain, your system will exhaust its RAM and disk IO. Its quite normal for your system to be become momentarily unresponsive in this phase.

For reference, I downloaded the blockchain on an 8 year old mini-PC with 4GB of RAM and a 300GB disk. You’re going to need 180GB of disk at the time of writing to accommodate the current block chain.

Network

Something similar applies in respect of the network. The current blockchain is ~180GB, so you’re going to have to download this. There is no hurry with this. You can stop and start the Bitcoin daemon as often as you want. It will just pick up where it left off when you restart. I set up a cron schedule on mine to start the daemon at 23:00 and stop it again at 08:00, so that the node wasn’t interfering with my day to day download requirements. It took me 5-6 week to get the entire blockchain.

At the beginning blocks will will rack up really quickly, as the first blocks weren’t full and considerably smaller than the 1mb limit. As you get into the last 50k blocks (out of 500k at time of writing), where all blocks are full, things slow down significantly.

Once you have the entire chain, the load on the network eases, as you’re only picking up 1 new 1mb block every 10 minutes. There is also a bit of chatter re. notifications but nothing substantial.

One point to note:

If the Bitcoin daemon isn’t shut down cleanly, the next time it starts, it will re-verify all the blocks it downloaded during its last run. During this time, it won’t download any new blocks, and the RPC service won’t be available to process calls. If the previous run was particularly long, this process will also take a long time. You can check the log to see that this is happening. All you can do is let it run. If the daemon gets improperly killed again, the whole process will start again when the Bitcoin daemon is restarted. You should really, really try to avoid letting the daemon stop unexpectedly. Never kill the daemon.

Checking status

How do you know when you have downloaded the full blockchain?

One way you’ll know is when the Bitcoin daemon is running, but you disk isn’t thrashing and your system is generally responsive. That generally means you have all the blocks and the daemon is just sitting there waiting to pick up the next one that becomes available.

You can obviously verify this with an RPC call too:

root@ubuntu:~# bitcoin-cli getblockchaininfo | grep blocks -A 1
 "blocks": 508695,
 "headers": 508695,

This tell you that the service can see headers on the blockchain for 508,695, and also that there are 508,695 blocks on your system.

If you stop your system for  a few hours or days, and run this command again when your restart it, the number of blocks will be lower than the numbers of headers, and your system will start thrashing again as it catches up. The longer the gap, the longer the catch up period. When your system is catching up, it has no value to the Bitcoin network, so try and organise your system so that it is always on with cron controlling whether or not the Bitcoin daemon is running.

In my next post, I will explain how to use your full node with your Bitcoin wallet.

 

Everything you want to know about AWS Direct Connect but were afraid to ask

AWS Direct Connect is one of those AWS services that everybody knows about but not too many people use. Learn more about your options here. I’ve recently been involved in the set up of a redundant AWS Direct Connect link. To assist others considering doing the same, I’m sharing what I’ve learned.

MTUs?

This is big.

Within an AWS availability zone, and between availability zones in the same region, EC2 instances use jumbo frames. However, jumbo frames are not supported on AWS Direct Connect links, so you will be limited to a maximum MTU of 1500. You may wish to consider the implications of this before you consider using AWS Direct Connect.

Otherwise…

What is AWS Direct Connect?

Its a dedicated link between a 3rd party network and AWS. That means data flows over a dedicated isolated connection, which means you get dedicated, consistent bandwidth, unlike a VPN, which flows over the public Internet.

How is it provisioned?

You have 2 choices. AWS partners with co-location data centre providers across their various regions. This involves AWS dropping wholesale connectivity directly into the Meet Me Rooms in these 3rd party data centres. If your equipment is located in one of these data centres, your AWS Direct Connect connection is then simply patched from your cabinet into the Meet Me Room. This is called a Cross Connect.

If you are not using one of AWS’s co-location data centre partners, you can still make a Direct Connect link from your corporate network to AWS. This involves linking your corporate network to one of the data centres where AWS has a presence in their Meet Me Room, from where you can make on onward connection to AWS. The Direct Connect documentation lists telecoms providers in each region who can provide this service, and the data centres to which they can make connections.
https://aws.amazon.com/directconnect/partners/

What speeds are available?

By default, you can get either a 10GB or 1GB connection, but you can also consult directly with the AWS partners to get lower speed connections.

What do you pay?

You pay per hour for the amount of time your connection is “up” (connected at both ends). What you pay per hour depends on the speed of your connection. If you provision a connection but it isn’t “up”, you don’t pay, unless you leave that unconnected connection in place for > 90 days (after which you start paying the standard rate).

You also pay per GB of data transferred from AWS to your location. You don’t pay for data transferred from your location to AWS.

What if I need more than 10GB?

You can aggregate multiple 10GB connections together.

How stable are the connections?

Whereas connecting to AWS with a VPN provides for 2 BGP routes from your location to AWS, a Direct Connect link is a single point of failure. It is thought (presumed?) that AWS provide for a certain level of redundancy once the connection leaves the Meet Me Room in the data centre, but there are no guarantees about this and AWS do not offer an SLA for connectivity.

What hardware do I need?

You will need L3 network hardware. It will need to be able to do BGP routing and support encrypted BGP passphrases. It will need to have sufficient port speed to connect to the Direct Connect uplinks you have provided. If this is a virgin install in a co-location data centre, there are switches available that can do both L3 and L2, handle BGP and provide redundancy for 2 Direct Connect connections. This negates the need to purchase both routers and switches. You should be able to get this kit for < €20,000. Providers will almost certainly try to sell you more expensive kit. If you’re using Direct Connect, they presume money is no object for you.

What are the steps required to set up a connection?

Decide if you need a single connection or if your going to need a pair of redundant connections.

Decide what speed connection you need. Don’t guess this. Estimate it based on current network traffic in your infrastructure.

Design you IP topology

If you are going to use one of the co-location data centres, contact them. Otherwise, contact one of the Telecoms Provider partners. They will provide pricing/guidance in terms of connecting your equipment or location to the relevant Meet Me Room.

Procure the termination hardware on your side of the connection.

Once you have provisioned your connection and hardware, starting building your configuration on the AWS side of the connection.

What do I need on the in terms of configuring the VPC I am connecting to?

Typically, you will be connecting resources in a VPC to your co-location data centre of on-premises infrastructure. There are a number of hops between a VPC and a Direct Connect connection.

Working out from the VPC, the first thing you need is a Virtual Private Gateway (AWS denotes these as VGW, rather than VPG). This is logically a point of ingress/egress to your VPC. You will be asked to chose a BGP identifier when creating this. If you use BGP already, supply what you need. Otherwise, let AWS generate one for you.

When you have created this, you next create a Route Table that contains a route for the CIDR of your co-location data centre or on-premises infrastructure that points to the VGW you created earlier.

Next, create a subnet(s) (or use an existing one) and attach the Route Table to that subnet. Anything resources that need to use the Direct Connect connection need to be deployed in that subnet(s). Its probably worth deploying an EC2 instance in that subnet for testing.

This is all you need to do in the VPC configuration (you can apply NACLs, security etc later. Leave everything open for now for testing.)

How do I set up the Direct Connect configuration on the AWS side?

Once you’ve configured your VPC, you now need to configure your Direct Connect service (you don’t need to do these in any particular order. You can start with Direct Connect if you like).

Create the connections (dxcon) you require in the AWS Direct Connect console. You’ll be asked for a location to connect to and chose a speed of either 10GB or 1GB (if you want a lower speed, you’ll need to talk to your Telco or data centre before you can proceed).

The connection will be provisioned fairly quickly, and show itself in a “provisioning” state. After a few hours, it will be in a “down” state. At this point, you can select actions and download what is called a Letter of Authority (LOA) for the connection. This will specify what ports in the Meet Me Room your connection should be patched in to. You need to forward this to your co-location data centre or Telco for them to action.

Note: it is not infrequent to find the ports you have been allocated are already in use by someone else. In this case, delete the connection and start again. If you can, check with the data centre verbally that the ports are free before you submit the LOA to them. Repeat all of above if you have multiple connections. Redundancy is dealt with later in the process.

To be able to use your connection, you now need to attach a Virtual Interface (dxvif) to it. You have options here, and as is always the case, options make things a bit more complicated.

You can connect a Virtual Interface to either a VGW (Virtual Private Gateway) or a Direct Connect Gateway (not the same thing as a Direct Connect connection).

If you connect to a VGW, you will only ever be able to connect to the VPC to which that VGW provides access.

If you connect to a Direct Connect Gateway, you can associate multiple VGWs with that Gateway, allowing you access to multiple VPCs *across all AWS regions*. If you want to use this option, you need to create a Direct Connect Gateway before you create a Virtual Interface.

I can’t see any reason other than corporate governance and security why you would not want to use a Direct Connect Gateway, so I’d suggest using that option if in doubt.

So now proceed and create your Virtual Interface. If you only want to attach it to the VGW you created earlier, that option is there for you. Otherwise, attach it to the Direct Connect Gateway you created.

Once you have your Virtual Interface, go back to the Connections panel and associate that with one of your connections. You will need a dedicated Virtual Interface for each connection (you can also attach multiple Virtual Interfaces to the same connection, but that isn’t relevant here).

The final step here only occurs if you are using a Direct Connect Gateway. If you are, you need to associate the VGW you created in your VPC with the Direct Connect Gateway. It should be presented as option for you in the list of available VGWs. Start typing its identifier into the search field if not. The UI can be a bit flaky here.

That should be everything. Redundancy is the next piece.

How do I configure redundancy on the AWS side?

If you want redundant connectivity, you really need to use a Direct Connect Gateway rather than linking your connection directly to a VGW. I *think* this is a requirement for redundancy. If not, its still my recommendation.

If you have done that, you should now have 2 Virtual Interfaces and 1 VGW associated with your Direct Connect Gateway. Think of the Direct Connect Gateway as a router. The 2 Virtual Interfaces are on the external side of the router, linking in to 2 Direct Connect connections. The VGW is on the AWS side of the router, linking back to the VPC.

That should be all that is required. Traffic will flow out of the VPC through the VGW into the Direct Connect Gateway, which is BGP enabled and links into the 2 Virtual Interfaces, which are also BGP enabled. If one connection goes down, BGP routes the traffic on to the other connection. This is transparent to the VPC.

What about redundancy on the other side of the connection?

This is matter for your network administrator or service provider. Typically, the 2 connections will terminate in a logical stack of redundant routers/switches which are BGP enabled and can transfer traffic flow between the external connections.

How do I know its working?

You won’t see the state of your connections and Virtual Interfaces switch to “available” until L2 connectivity is established and the necessary BGP authentication handshake has occurred. At that point, you should be able to send ICMP requests from your termination hardware to the EC2 instance you created in your VPC earlier.

Good luck!

 

 

 

Slaying the Development branch – Evolving from git-flow (Part 2)

In Part 1, I talked about how we developed a git branching strategy that allowed for both a single historical deployment branch, but also for multiple release branches to exist at the same time. The residual issue with this was that we had to suppress merge conflicts during our branch sync’ing operations, so we had to find another way of exposing merge conflicts in a timely and reliable way.

In dealing with this, we devolved responsibility for merge conflict detection to Jenkins.

Builds that were deployed to our Development, QA and UAT environments all originated in Jenkins Build jobs. We had lots of build jobs, but we maintained these programmatically using python, so updating them was not difficult.

We added an extra build step before our integration tests. This step queried the source control repo for all current release branches in a particular project. Each branch was checked out and merged to the branch being built. If any branch failed to merge, the Build job was failed. The only way the development team lead could get the Build to succeed was to resolve the merge conflict.

When I first proposed this, there was some consternation. The main argument was that the development of one release branch should not be delayed by code in another release branch. If you consider release branches in isolation, there is some merit in this argument.

However, while timely deployment of releases is an important consideration, it is not the concern of Source Control. The concern of Source Control is that the code underpinning the product is well managed, which means conflicts between different development strands should be exposed and resolved as soon as possible, even if this impinges on one particular group who are contributing to the overall process.

It was also the case the merge conflicts were reasonably rare, and I was able to argue that a few minutes spent resolving a merge conflict every couple of months was a small price to pay to not have to unravel a bug based on a merge conflict that had found its way into Production.

We proceeded with the system, and as predicted, Build failures due to merge conflicts were rare. However, they did happen, which was something of a relief, as it proved the system was working as designed.

More generally, the development teams were given a whole new lease of life by removing the development branch from the SDLC. In fact, it really only became apparent how much of a bind the development branch was after it had been removed.

The overall success of the system was apparent when a minor bug in the scripting that underpinned the system led to a brief period of confusion before it was found and resolved. During this incident, I offered the option of temporarily re-opening the development branch.

The resounding NO that this was greeted with gave me not insignificant satisfaction!

Renaming files with non-interactive sftp

SFTP hangs around the IT Operations world like a bit of a bad smell.

Its pretty secure, it works, and its similar enough to FTP for software developers and business managers to understand, so its not uncommon to find it underpinning vast data transfer processes that have been designed in a hurry.

Of course, its very rudimentary in terms of what it can do, and very dependent on the underlying security of the OS on which it resides, so its not really something that should find a home in Enterprise IT solutions.

Anyway, sometimes you just have to deal with it. One problem that you will often encounter is that while you have SFTP access to a system, you may not have shell access via OpenSSH. This makes bulk operations on files a bit more difficult, but not impossible.

SFTP has a batch mode that allows you pass STDIN commands to the processor. If used in conjunction with non-interactive login (ie an OpenSSH Public/Private Key Pair) you can actually process bulk operations.

Let’s say you want to rename 500 files in a particular directory:

You can list the files as follows:

echo "ls -l1" | sftp -q -i ~/.ssh/id_rsa -b - user@sftp.mycompany.com:/dir1/

In this case, the parameter:

-b -

tells the processor to process the command coming from STDIN

You can now incorporate this into a BASH loop to complete the operation:

for f in `sftp -q -i ~/.ssh/id_rsa -b - user@sftp.mycompany.com:/dir1/ | grep -v sftp | grep -v Changing`;
    do
    echo "Renaming $f...";
    echo "rename $f $f.renamed" | sftp -q -i ~/.ssh/id_rsa -b - user@sftp.mycompany.com:/dir1/;
done

 

Slaying the Development branch – Evolving from git-flow (Part 1)

Continuous Integration and Continuous Delivery (CI/CD) is essential to any modern day, mission critical software development life cycle.

The economic logic is simple. If you’re paying a developer a lot of money to fix bugs or add features to your software, it doesn’t make sense to have those bug fixes and features sitting in a build pipeline for 2 months waiting to be deployed. Its the equivalent of spending money on stock for your grocery store and leaving it on a shelf in your loading bay instead of putting it in the shop window.

But taking legacy software development life cycles and refactoring them so that they can use CICD is a significant challenge. It is much harder to re-factor embedded, relatively stable processes than to design new ones from the ground up.

This was a challenge I was faced with in my most recent employment. This article, and its sequel, describe some of the challenges I encountered and how they were resolved, focusing specifically on how we evolved our source control management strategy from one based on git-flow to one that permitted merging code changes directly from Feature branches to Production.

I’ll begin by describing the environment as I found it.

This was a Multi Tenant Software as a Service (SAAS) provided over the Internet on a business to business basis. The SAAS comprised 16 individual services with a variety of MySQL and PostgreSQL as data stores. The services were built with Java (for processing and ETL operations) and Rails (for Web UI and API operations).

The business profile required parallel development streams, so source control was based on the git-flow model. Each project had a development branch, from which feature branches were taken. Feature branches were merged into concurrent release branches. Builds were created from release branches and deployed in the infrastructure tiers (Dev, QA, UAT, Staging, Prod). There was no historical deployment branch and no tagging. Each release cycle lasted approximately 6 weeks. A loose Agile framework applied, in that stories were part of releases, but Agile processes were not strictly followed.

Infrastructure used in the software development life cycle was shared. There were monolithic central Dev, QA, UAT environments etc. Local developer environments were not homogenous. Java developers couldn’t run local Rails apps and vice versa. All code was tested manually in centralised, shared environments.

The situation described above would be reasonably typical in software development environments which have evolved without a DevOps culture and dedicated Operations resources (ie where development teams build the environments and processes).

While the development/deployment process in this environment was working, it was sub-optimal, resulting in delays, cost overruns and issues with product quality.

A plan was developed to incrementally migrate from the current process to a CI/CD-based process. This involved changes to various functions, but perhaps the most important change was to the source control management strategy, which is what I want to deal with in detail in this article.

A typical development cycle worked as follows.

In every git project, the development branch was the main branch. That is to say, the state of all other branches was relative to the development branch (ahead or behind in terms of commits).

For a scheduled release, a release branch was created from the development branch. For for the purposes of illustration, lets call this release1. Stories scheduled for release1 were developed in feature branches taken from development, which were then merged into release1. These features also had to be merged to development. When all features were merged to release1, release1 was built and deployed to QA.

At the same time, work would start on release2, but a release2 branch would not be created, nor would release2 features be merged to development, as development was still being used as a source for release1 features. Only when development for release1 was frozen could release2 features be merged to development, and only when release1 was built for Production was a release2 branch created.

This system had been inherited from a simpler time when the company was younger and the number of applications comprising the platform was much smaller. Its limitations were obvious to all concerned, but the company didn’t not have a dedicated “DevOps” function until later in its evolution, so no serious attempt had been made to re-shape it.

From talking to developers, it became clear that the primary source of frustration with the system was the requirement to have to merge features to multiple branches. This was particularly painful when a story was pulled from a release, where the commit was reversed in the release branch but not the development branch. It was not infrequent for features to appear in one release when they were scheduled for another.

After talking through the challenge, we decided on a number of requirements:

1. Features would only be merged to one other branch

2. We could have concurrent release branches at any time

3. We would have a single historical “Production” branch, called “deploy”, which was tagged at each release

4. At the end of the process, we would only be one migration away from true CI/CD (merging features directing to deploy)

5. We would no longer have a development branch

From the outset, we knew the requirement that would present the biggest challenge was to be able to maintain concurrent release branches, because when multiple branches are stemmed from the same source branch, you always run the risk of creating merge conflicts when you try to merge those branches back to the source.

At this juncture, its probably wise to recap on what a merge conflict is, as this is necessary to understand how we approached the challenge in the way that we did.

A merge conflict occurs between 2 branches when those branches have a shared history, but an update is made to the same line in the same file after those branches have diverged from their common history. If a conflict exists, only one of the branches can be merged back to the common source.

If you think of a situation in which 2 development teams are working on 2 branches of the same project taken from the same historical branch, and those 2 branches ultimately have to be merged back to that historical branch, you can see how this could present a problem.

When you then extrapolate that problem out over 16 individual development projects, you see how you’re going to need a very clearly defined strategy for dealing with merge conflicts.

Our first step was to define at which points in the development cycle interaction with the source control management system would be required. This was straightforward enough:

1. When a new release branch was created

2. When a new patch branch was created

3. When a release was deployed to Production

We understood that at each of these points, source control would have to be updated to ensure that all release branches were in sync, and that whatever method we used to ensure they were in sync would have to be automated. In this instance, “in sync” means that every release branch should have any commits that are necessary for that release. For instance, if release1 were being deployed to Production, it was important that release2 should have all release1 commits after the point of deployment. Similarly, if we were creating release3, release3 should have all commits from release2 etc etc.

However, we knew that managing multiple branches in multiple projects in this way was bound to produce merge conflicts, but at the same time, we didn’t want a situation in which a company-wide source control management operation was held up by a single merge conflict in a single project.

In light of this, we decided to do something a little bit controversial.

If our aim was to keep branches in sync and up to date, we decided that branching operations should focus on this goal, and that we would use a separate mechanism to expose and resolve merge conflicts. Crucially, this part of the process would occur prior to global branching updates, so that all branches arrived at the point of synchronisation in good order.

So, to return to the 3 points where we interacted with source control, we decided on the following:

1. When a new release branch was created

This branch would be created from the latest tag on the historical production branch ( “deploy”). All commits from all other extant release branches would be merged to this branch, resulting in a new release branch that already contained all forward development. When a merge conflict existed between the new branch and an extant release branch, the change from the extant branch would be automatically accepted (–merge-strategy=theirs)

2. When a new patch branch was created

This branch would be created from the latest tag on the deploy branch . No commits from any other extant release branches would be merged to this branch, because we did not want forward development going into a patch. Because no extant release branches were being merged to the patch, there was no need to deal with merge conflicts at this point.

3. When a release was deployed to Production

At this point, the release branch would be merged to the deploy branch, and the deploy branch would then be merged to any extant release branches. This would ensure that everything that had been deployed to Production was included in forward development. When a merge conflict existed between the deploy branch and an extant branch, the change from the extant branch would be automatically accepted (–merge-strategy=ours). The release branch that had been merged would be deleted, and the the deploy branch tagged with the release version number.

We decided to refer to the automatic acceptance of changes during merge operations as “merge conflict suppression”. In the next part of the article, I’ll explain how we decided to deal with merge conflicts in a stable and predictable way.

 

 

 

Breaking up 3′s mobile broadband scam

3 mobile offer a mobile broadband service. You buy a router which takes a mobile SIM, which has mobile data enabled and which is subscribed to a mobile data plan. The router connects to the mobile network and then shares that connection over a WIFI network in your home.

3 offer a number of plans based on the amount of data you want to download, ranging from 3GB per month up to 100GB per month. For example, 30GB per month costs €29.99.

If you go over your limit, you pay penalty charges. Then penalty rate is 5c per MB. This doesn’t sound like much, but if you go 1GB over your limit (half a Netflix movie), that’s €50. This applies whether you are on the 3GB or the 100GB plan. Once you get to 1GB over your limit, 3 stops the connection and you can’t download any more data until your plan renews at the start of your next billing cycle.

Does this sound like a scam? No, not really, but I haven’t included all the detail.

Given the penal rate that applies if you go over your limit, you would think that 3 had some sure fire mechanism of alerting you that you are approaching your limit or are at your limit, so that you could stop using the service.

Well, they don’t.

All they do is send a warning SMS message to the mobile phone number that is associated with the SIM card in your router. The only way that you’ll ever see this message is if you login to the router and check the messages, which you’re never going to do, because there is no reason to login to the router other than when you first set it up.

It would of course be much more logical for them to send you an email (you supply your email address when you buy the router and/or subscribe to the data plan), but they don’t do this. Nor to they make any functionality available whereby you can set up an email warning yourself.

Nor do they shut off your connection when you reach your limit. Of course, they could argue that they want to give users the capability to burst beyond their limit, in case of some emergency, but why then do they go ahead and shut down your connection when you go over your limit by 1GB, which costs €50? And why don’t they give users the option of having their connection shut off when they breach their limit? And regardless of both these points, you can also buy a data add-on for a couple of euro if you want to go above your limit in a given month.

The only explanation that stacks up here is that 3 are making a small fortune from users breaching their mobile data limits. I use the service, and am technically savvy, but in the first 4 months of using it, I breached my limit twice. I had budgeted paying €29.99 per month for 30GB per month over 4 months (Total: €119.96 for 120GB of data), but ended up paying €219.96 for 122GB of data.

Certain that 3 were playing offside here, I contacted ComReg to make a complaint. Surely, they would recognise how ridiculous it was for service providers to be warning users about breaching data limits by sending SMS messages to SIM cards buried in routers.

But no. To my amazement, I got a response from ComReg saying that mobile providers were not obliged to warn users about limit breaches by email. They were only required to send SMS messages!

Now my blood was really up.

I got to work with some Python and Selenium and wrote a script that logs into my 3 account once per hours and picks up my remaining allowance. This allowance is then posted to a my website, where it is checked by a simple web content checker app running on my phone. Every time the content changes, the app alerts me.

My plan now is to extend this functionality to other users, so that they too can cut off the supply of €50 fines being delivered to 3. If you would like to have your allowance monitored in this way, please let me know and I will send you what you need.

If you’re tech savvy, you can do it yourself. See here:

https://github.com/garrethmcdaid/3allowancecheck/

How Bitcoin mining pools work

I’ve written this to clarify my own understanding. Treat with caution.

Bitcoin mining pools exist because the computational power required to mine Bitcoins on a regular basis is so vast that it is beyond the financial and technical means of most people. Rather than investing a huge amount of money in mining equipment that will (hopefully) give you a return over a period of decades, a mining pool allows the individual to accumulate smaller amounts of Bitcoin more frequently.

The principle is quite straightforward: lots of people come together to combine their individual computing power into a single logical computing unit and share any rewards (Bitcoins) proportionally based on the amount of computing effort they contributed.

If you are following the Bitcoin news lately, then you know that the complexity arises is in the regulation of the network, for instance:

How do you know how much effort was contributed by each member?

How do you prevent members jumping in and out of pools, particularly pools that haven’t mined a Bitcoin in while, and where it it likely they will mine a Bitcoin in the near future?

How do you prove that individual members are actually working?

How do you prevent more powerful members from hogging the network bandwidth of the master miner, preventing less powerful members from contributing?

How do you ensure that the master miner is getting enough throughput to have a reasonable chance of mining a Bitcoin?

There are just a sample of the problems that exist in relation to efficient and fair Bitcoin mining in pools. The following is a general explanation of how these problems are dealt with.

Let’s clarify terminology first (I’m assuming basic knowledge of how Bitcoin works here).

Hash: the end product, a binary number, of an encryption operation performed by a miner. Each new Hash is created by the miner adding a sequential string (a nonce) to the source (salt) of the encryption operation. Modern computers can generate hundreds of thousands of Hashes per second.

Target: In Bitcoin, each new Block has a Target. This is a binary number. To succeed in creating a new Block, a miner has to compute a Hash that is lower than that number. The Bitcoin protocol adjusts that number depending on the amount of activity on the network. If there is a lot of activity, the number becomes smaller, if there is less activity, the number becomes larger. The objective is to regulate the creation of new Blocks (ie Bitcoins) to 1 every 10 minutes or so.

Difficulty: This is a measure of how difficult it is for a miner to derive a Hash that is less than the current Target (ie mine Bitcoins). It is extrapolated from the amount of time it took to generate the last 2,016 blocks. At a rate of 1 block every 10 minutes, it should take 2 weeks to generate 2,016 blocks. If it has actually taken less than this, the difficulty will increase (ie the Target will be a smaller number). If it has taken more than this, the difficulty will decrease (ie the Target will be a larger number). The unit of measurement of difficulty is Hashes ie the number of Hashes on the entire network that were generated to created 2,016 Blocks.

Master Node: a full Bitcoin node that operates on the Bitcoin P2P network and which regulates a pool of members, who do not directly communicate on the Bitcoin P2P network, but who use the Master Node as a proxy.

Share: a Share is something that is particular to Mining pools. It does not form part of the wider Bitcoin protocol. It is the primary method used by the Master Node to regulate the activity of members of a pool. The next section deals with Shares.

When you start running a Bitcoin mining process, you will probably be aware of your Hash Rate. This is the number of Hashes your Bitcoin mining hardware is generating per second. These days, this is normally measured in Ghps, which means Millions (Giga) of Hashes Per Second. A typical high-end Graphics Card (GPU) on a modern PC can generate about 0.5 Ghps. A dedicated ASIC mining rig, which will cost over €1,000, might be able to generate 8,000 Ghps. A mining pool will typically measure its combined compute power in Thps, or Billions (Tera) of Hashes Per Second.

When you participate in a mining pool, and you see your hardware generating (lets say) 5,000 Ghps, this does not mean that you are submitting 5,000 Ghps to the Bitcoin network, via the Master Node. If that were the case, and all the pool members were doing the same, the Master Node that controls the pool would simply explode.

What it means is that your mining hardware can generate 5,000 Ghps locally on your computer.

This capacity isn’t used directly on the Bitcoin network. Instead, the Master Node that controls the pool acts as a proxy between the pool members and the main Bitcoin network. For this to work, the Master Node has to ensure both that the members are supplying enough Hashes for the Master Node to be able to compete on the main Bitcoin mining network, and that the allocation of any Bitcoin mined is divided proportionately according to the amount of compute effort supplied by the individual members.

To do this, the Master Node observes the Hash Rate of each of the members, and distributes computational challenges to them that all have a slightly lower Difficulty rating than the Difficultly rating of the current Block Target (lets call this the Proxy Target) . If such a Hash is found, the Master Node accepts this as a “Proof of Work Accepted”. If a Hash that is lower than the Proxy Target is not found within the allotted time, the member completes the work anyway before moving on to the next computational challenge distributed by the Master Node.

In this way, your mining process will log “Work Units Started”, which will always be a lower number than “Proof Of Works Accepted”. The smaller the gap between these numbers, the “luckier” you are. The greater the gap, the “unluckier” you are. In reality however, and over time, the gap should be consistent between across all members, as the master node will adjust the Difficultly of the computational challenges based on the Hash Rate of the member, which can change over time.

A Share is therefore the equivalent of a “Proof of Work Accepted”. The Master Node will keep a record of all “Proof of Works Accepted” from each member, and distribute Bitcoin mined based on the number of Shares each member has contributed.

Confused? Of course you are, so lets go through that again, from a different perspective.

If the Master Node were sending computational challenges to members that had a Difficulty rating that was equal to the Difficulty rating of the current Target in the Blockchain, the Master Node would just be sitting there idlly for days on end waiting for one of the members to come up with the necessary Hash to create the new Block. The Master Node would have no knowledge of what effort the other members contributed, and would have no option but to award the full reward to the successful member, even if that member only contributed 0.001% of computational effort involved in creating the Block.

Instead, the Master Node lowers the bar on the Difficulty rating (relative to the actual difficulty rating of the current Block) so that it receives lots of Hashes from the members. All but one of these Hashes will be lower than the current Blockchain target, but at least now the Master Node can confirm that its members are working, and at what rate they are working. It can then use that information to both regulate the traffic received from the members and proportionately divide any rewards.

Additionally, it can ensure that more powerful members, whose submissions are rate-limited to allow submissions from less powerful members, are not discriminated against. These more powerful nodes are given challenges with higher Difficulty ratings, but any “Proof of Work Submitted”s (ie Shares) that they accumulate are weighted according to the Difficulty rating that was set, giving them a higher proportionate of any ultimate reward.

While all of the above may sound complex, it is still in fact just a general introduction to mining pools, and based on the mining pool that I use, Bitminter.

Other pools use variations of this methodology, but all follow the general principle that the Master Node is a proxy to the main Bitcoin network, that members must prove to the Master Node that they are working and that rewards are allocated based on the amount of work done.

A brief note about payment methodologies is also warranted. Many pools will use either the PPS (Pay Per Share) or PPLNS (Pay Per Last N Shares) method to distribute rewards.

In the PPS model, you get a payment for each Share you contribute regardless of the success or failure of the mining pool. This makes for a regular income, but doesn’t allow you to benefit when the pool has a lucky streak.

In the PPLNS model, you get paid only when Bitcoins are mined, and only on the basis of the Shares you submitted to that effort. This makes for more irregular income, but allows you to benefit when the pool has a lucky streak.

Educate yourself read more about digital currency updates and mining news.

 

 

How to monitor Docker containers with Nagios and NRPE

Monitoring whether or not a Docker container is alive on a remote host should be fairly easy, right?

The standard approach in this is to include a suitable NRPE script on the remote host, and call that remotely from your Nagios server via the NRPE TCP daemon on the remote host. This script is a good example of same, and we’ll refer to it in the rest of the article.

This generally works fine when you’re doing innocuous things like checking free disk space or if a certain process is running. Checking a Docker container is a little bit harder, because the command:

docker inspect

can only be run as root, whereas the NRPE service on the remote host runs as a non-privileged user (usually called nagios).

As such, when you test your NRPE call from the Nagios server, like so:

/usr/lib64/nagios/plugins/check_nrpe -H dockerhost.yourdomain.com -c check_docker_container1

Your will see a response like:

NRPE: Unable to read output

or

UNKNOWN - container1 does not exist.

You get this response because the nagios user cannot execute the docker control command.

Your could get around this by running NRPE on the remote host as the root user, but that really isn’t a good idea, and you should never do this.

A better play (if you are confident that your Nagios set up is secure) is to extend controlled privileged to the nagios user via sudo. You can create the following file in /etc/sudoers.d/docker to achieve this:

nagios    ALL=(ALL:ALL)  NOPASSWD: /usr/bin/docker inspect *
nagios    ALL=(ALL:ALL)  NOPASSWD: /usr/lib64/nagios/plugins/check-docker-container.sh *

This allows the nagios user to run both the wrapper script around the docker inspect command and the docker control command itself, without requiring a password. Note, only inspect permission is granted. Obviously, we don’t want to give nagios permission to actually manipulate containers.

In addition to this, we must make provision for NRPE to run the command using sudo when called via the NRPE TCP daemon. So, in nrpe.cfg, instead of:

command[check_docker_container1]=/usr/lib64/nagios/plugins/check-docker-container.sh container1

we have:

command[check_docker_container1]=sudo /usr/lib64/nagios/plugins/check-docker-container.sh container1

 

 

Using Elasticsearch Logstash Kibana (ELK) to monitor server performance

There are myriad tools that claim to be able to monitor server performance for you, but when you’ve already got a sizeable bag of tools doing various automated operations, its always nice to be able to fulfil an operational requirement using one of those rather than having to on board another one.

I love Elasticsearch. It can be a bit of minefield to learn, but when you get to grips with it, and bolt on Kibana, you realise that there is very little you can’t do with it.

Even better, Amazon AWS now have their own Elasticsearch Service, so you can reap all the benefits of the technology without having to worry about maintaining a cluster of Elasticsearch servers.

In this case, my challenge was to expose performance data from a large fleet of Amazon EC2 server instances. Yes, there is certain amount of data available in AWS Cloudwatch, but it lacks key metrics like memory usage and load average, which are invariably the metrics you must want to review.

One approach to this would be to put some sort of agent on the servers and have a server poll the agent, but again, that’s extra tools. Another approach would be to put scripts on the servers that push metrics to Cloudwatch, so that you can augment the existing EC2 Cloudwatch data. This was something we considered, but with this method, the metrics aren’t logged to the same place in Cloudwatch as the EC2 data, so it all felt a bit clunky. And you only get 2 weeks of backlog.

This is where we turned to Elasticsearch. We were already using Elasticsearch to store information about access to our S3 buckets, which we were happy with. I figured there had to be a way to leverage this to monitor server performance, so set about some testing.

Our basic setup was a Logstash server using the S3 Input plugin, and the Elasticsearch output plugin, which was configured to send output to our Elasticsearch domain in AWS

output {
 if [type] == "s3-access" {
     elasticsearch {
         index => "s3-access-%{+YYYY.MM.dd}"
         hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
         ssl => true
    }
 } 
}

We now wanted to created a different type of index, which would hold our performance metric data. This data was going to be taken from lots of servers, so Logstash needed a way to ingest the data from lots of remote hosts. The easiest way to do this is with the Logstash input plugin syslog. We first set up Logstash to listen for syslog input.

input {
     syslog {
         type => syslog
         port => 8514
     }
}

We then get our servers to send their syslog output to our Logstash server, by giving them a universal rsyslogd configuration, where logs.domain.com is our Logstash server:

#Logstash Configuration
$WorkDirectory /var/lib/rsyslog # where to place spool files
$template LogFormat,"%HOSTNAME% ops %syslogtag% %msg%"
*.* @@logs.mydomain.com:8514;LogFormat

We now update our output plugin in Logstash to create the necessary Index in Elasticsearch:

output {
 if [type] == "syslog" {
    elasticsearch {
       index => "test-syslog-%{+YYYY.MM.dd}"
       hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
       ssl => true
    }
 } else {
    elasticsearch {
       index => "s3-access-%{+YYYY.MM.dd}"
       hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
       ssl => true
    }
 }
}

Note that I have called the syslog Index “test-syslog-…”. I will explain this in a moment, but its important that you do this.

Once these steps have been completed, it should be possible to see syslog data in Kibana, as indexed by Logstash and stored in our AWS Elasticsearch domain.

Building on this, all we had to do next was get our performance metric data into the syslog stream on each of our servers. This is very easy. Logger is a handly little utility that comes pre-installed on most Linux distros that allows you send messages to syslog (/var/log/messages by default).

We trialled this with Load Average. To get the data to syslog, we set up the following cronjob on each server:

* * * * * root cat /proc/loadavg | awk '{print "LoadAverage: " $1}' | xargs logger

This writes the following line to /var/log/messages every minute:

Jun 21 17:02:01 server1 root: LoadAverage: 0.14

It should then be possible to search for this line in Kibana

message: "LoadAverage"

to verify that it is being stored in Elasticsearch. When we do find results in Kibana, we can see that the LogFormat template we used in our server rsyslog conf has converted the log line to:

server1 ops root: LoadAverage: 0.02

To really make this data useful however, we need to be able to perform visualisation logic on the data in Kibana. This means exposing the fields we require and making sure those field have the correct data type for numerical visualisations. This involves using some extra filters in your Logstash configuration.

filters {
   if [type] == "syslog" {
       grok {
          match => { "message" => '(%{HOSTNAME:hostname})\s+ops\s+root:\s+(%{WORD:metric-name}): (%{NUMBER:metric-value:float})' }
       }
   }
}

This filter operates on the message field after it has been converted by ryslog, rather than on the format of the log line in /var/log/messages. The crucial part of this is to expose the Load Average value (metric-value) as a float integer, so that Kibana/Elasticsearch can deal with it as an integer rather than a string. If you only specify NUMBER as your grok data type, it will be exposed as a string, so you need to add the “:float” to complete the data type conversion to type integer.

To verify that it is exposed as a string, look in Kibana under Settings -> Indices. You should only have a single Index Pattern at this point (test-syslog-*). Refresh the field list for this, and search for “metric-value”. At this point, it may indicate that the data type for this is “String”, which we can now deal with. If it already has data type “Number”, you’re all set.

In Elasticsearch indices, you can only set the data type for a field when the index is created. If your “test-syslog-” index was created before we properly converted “metric-value” to an integer, you can now create a new index and verify that metric-value is an integer. To do this, update the output plugin in your Logstash configuration and restart Logstash.

output {
 if [type] == "syslog" {
    elasticsearch {
       index => "syslog-%{+YYYY.MM.dd}"
       hosts => ["search-*********-5isan2svbmpipm2xznyupbeabe.us-west-2.es.amazonaws.com:443"]
       ssl => true
    }
 } 
}

A new Index (syslog-) will now be created. Delete the existing Index pattern in Kibana and create a new one for syslog-*, using @timestamp as the default time field. Once this has been created, Kibana will obtain and updated field list (after a few seconds), and in this, you should see that “metric-value” now has a data type of “Number”.

(For neatness, you may want to replace the “test-syslog-” index with a properly named index even if you data type for “metric-value” is already “Number”).

Now that you have the data you need in Elasticsearch, you can graph it with a visualisation.

First, set your interval to “Last Hour” and create/save a Search for what you want to graph, eg:

metric-name: "LoadAverage" AND hostname: "server1"

Now, create a Line Graph visualisation for that Search, setting the Y-Axis to Average for field “metric-value” and the X-axis to Data Histogram. Click “Apply” and you should see a graph like below:

Screen Shot 2016-06-22 at 10.32.56

 

 

Migrating MySQL from AWS RDS to EC2

Applications that use MySQL as their underlying RDBMS commonly evolve as follows:

  1. Application and MySQL server on same EC2 instance
  2. Application balanced between multiple EC2 instances and MySQL server moved to RDS instance
  3. MySQL server moved back to EC2 with DIY High Availability infrastructure
  4. MySQL server moved to Bare Metal with DIY High Availability infrastructure in Co-Lo data centre

As each of these migration steps arrives, the size of the dataset under MySQL management is larger, and the availability of the application more critical, making each step exponentially more complex.

In recent months, I have had to manage Step 3 in this live cycle (the migration back to EC2 from RDS). The following is an account of my experience.

The dataset involved was 4TB in size. That isn’t huge by today’s standards, but its large enough to involve multiple days of data transfer and to require something more than a mysqldump and import in your planning.

The dataset was also highly volatile, in that it was being augmented 24/7, and relied on stored procedures to aggregate data on a daily basis on which commercial SLAs were based. In other words, stopping updates to the dataset for anything more than a couple of hours was not an option.

Time pressure was a further consideration. RDS has a hard limit of 6TB of disk space for an instance (and a 2TB file size limit), and our application was due to introduce new functionality that would increase the rate of data accumulation dramatically. We estimated that we had 2-3 months to complete the transition before the 6TB limit appeared on the horizon.

We did our research and decided on a strategy. We would create a Read Replica of our RDS master and allow it to come into sync. When it was in sync, we would promote it to a standard RDS instance and note the replication point in the Bin Log. We would then do a full mysqldump of the database and inject that directly into our EC2 master, which we estimated would take 96 hours. When this was complete, we would make the EC2 master a slave of the RDS master, and start replication from the point in the Bin Log we had previously noted. We estimated that the data gap would take 18-20 hours to fill, after which we would have a full and intact dataset in EC2.

This plan was fine except for one detail. Because of data relies extensively on stored procedures, it requires a lot of RAM and CPU grunt to get through its workload. Under normal circumstances, we maintained a Read Replica for the RDS master, to allow for intensive read queries that would not impact on the processing capability of the RDS master. On occasion, when there were replication issues, the Bin Log on the RDS would grow rapidly, consuming several hundred GBs of disk space. This isn’t supposed to be an issue in MySQL, but the internal mechanics of RDS and how the Bin Log is managed seem to make it an issue. We we saw the Bin Log growing to this extent, performance on the RDS master rapidly degraded, requiring us to terminate replication completely (in order that RDS would flush the Bin Log).

Given that our plan involved allowing the Bin Log to grow over 96 hours, we were obviously concerned. We discussed this with our support partners, Percona, who recommend an alternative strategy.

They suggested using the MySQL Bin Log utility to back up the Bin Log to location outside RDS, which we could then stream into our EC2 master. This would involved extra steps in the process, and tighter co-ordination, but it seemed to be a lot less riskier in terms of impacting on the RDS master. Our new plan was therefore as follows:

  1. Ensure all applications are using a DNS record for MySQL server that has 0 sec TTL
  2. Create a Read Replica of the RDS master and allow to come in sync
  3. Stop replication on the replica, note the replication point and promote to master
  4. Configure RDS master to retain at least 12 hours of Bin Log, and wait for 12 hours (ensuring that Bin Log growth does not impact on performance during this time)
  5. Start Bin Log backup from RDS master to disk on EC2 master
  6. Commence mysqldump from RDS master and inject directly into EC2 master
  7. On completion of mysqldump and injection, start restore of Bin Log file into EC2 master
  8. Verify that RDS master and EC2 master are approximately in sync
  9. Pause updates to dataset in RDS master for approx. 1 hour
  10. Verify that RDS master and EC2 master are fully in sync
  11. Stop Bin Log backup and Bin Log restore
  12. Re-create stored procedures on EC2 master
  13. Change DNS record for MySQL Server to point to EC2 master
  14. Re-commence updates to dataset

On completion of this process, we had moved our 4TB dataset from RDS to EC2 with only a 1 our interruption in the data update process. For High Availability, we created 2 slaves and managed these with MySQL Utilities. We placed 2 HA Proxy nodes in front of this MySQL server farm and balanced traffic to the HA Proxy nodes with an Elastic Load Balancer listening for TCP (rather than HTTP) connections.

Its probably also worth mentioned that EC2 also has disk limits. A single EBS volumes can have a maximum size of 16TB. To overcome this, you can combine multiple EBS volumes into an LVM set, or use software based RAID 0. We were initially concerned about using these sort of virtual disks for storing data, but this should be less of a concern when you remember than EBS itself has multiple layers of redundancy. We went for an LVM configuration.