Introduction

Webhooks are user-defined HTTP callbacks. At WePay, we make use of webhooks we call Instant Payment Notifications or IPNs to update our partners on the status of transactions happening in our system. IPNs allow our partners to receive notifications whenever something important happens to objects such as a checkout, credit card, merchant account, user, etc. For example, if the state of a checkout with id 12345 changed from “authorized” to “captured”, we will notify our partner with an IPN containing object_type = “checkout” and object_id = 12345. Partners can then do an API call to look up the exact changes. Our IPN delivery system also handles following use cases:

  • IPN Batching - We batch IPNs that are generated within a very short span of time for a given object. For example, for checkout with id 12345, if state transition from “authorized” to “captured” and from “captured” to “reserved” happens too quickly, instead of sending two IPNs for checkout 12345, we will send only one so that only a single lookup is needed to get the latest state.
  • Retries for IPN deliveries that fail - In cases where the partner is unable to respond to our HTTP post request, we do telescopic retries so that the partner can receive the IPN whenever their system comes up again.

Problems with existing infrastructure

Previously, IPN deliveries were handled by our monolith using Gearman. This system had several issues:

  • We were using Gearman in our monolith for lot of asynchronous tasks that include sending emails, processing payments, sending IPNs, creating reports etc. As load increased, we began to encounter operational issues. Gearman failed often, and we experienced worker connection issues with it.
  • A slow partner could cause a backup in our IPN system due to limited Gearman workers.
  • Since IPN delivery was handled by the monolith, there was no easy way for other services outside of the monolith to send IPNs.

IPNs with Google Cloud Pub/Sub

Google Cloud Pub/Sub is a publisher/subscriber messaging system that provides many-to-many asynchronous messaging and decouples senders and receivers. We evaluated Pub/Sub and decided to create a microservice for IPN delivery with it. Some things to be aware of when using Pub/Sub include:

  • The ordering of messages is not guaranteed. Hence, messages received by a subscriber can be out of order. In the case of IPN delivery, since our IPNs don’t contain any information about what changed for the object, this was fine for us. For example, even if two IPNs for checkout with id 12345 were generated, one for “authorized” state and another for “captured” state, and were delivered out of order, since IPN only contains data “checkout=12345” and actual state information needs to be gathered by doing a lookup on that checkout, we are able to provide correct information without needing strict ordering for message deliveries.
  • Cloud Pubsub provides “at least once” delivery guarantee. However, messages can be delivered multiple times. In our case, we were using Redis as temporary storage of IPN data to manage IPN batching and retries for failed IPN deliveries. Since we were doing object level batching of IPNs, these duplicate messages would automatically get deduplicated.

Implementation

The implementation of our IPN delivery microservice can be described using the following diagram: IPN Service diagram

The IPN Delivery Service consists of following components:

  • Message Poller This component is responsible for pulling data from Cloud Pub/Sub and handling duplicates

  • Message Processor This component is responsible for processing pending IPNs after every fixed interval. This processing interval also defines our batching interval. In the case of multiple IPNs for the same object, we send only one. This is how IPN batching is implemented in the service.

  • Retry Handler This component is responsible for retrying any failed IPN later as per our IPN retry schedule.

Every Cloud Pub/Sub message received by the subscriber needs to be acknowledged. If a message remains unacknowledged for a time duration more than acknowledgment deadline time, it makes the message available for pulling again. Cloud Pub/Sub keeps unacknowledged/not pulled messages for 7 days. This property enables us to do auto recovery in case of application failure. Typical scenarios include:

  • Microservice goes down In this case Cloud Pub/Sub still contains data for 7 days. As soon as python service comes back, it starts getting IPN messages from Cloud Pub/Sub and starts delivering them.
  • Redis goes down We acknowledge a pubsub message only after we have attempted an IPN delivery. If redis goes down, IPN delivery doesn’t happen which means we don’t acknowledge the pubsub message as well. Cloud Pub/Sub redelivers that message later. As a result, as soon as Redis comes back, we start processing IPNs again.

Lessons

We are running our IPN service using Cloud Pub/Sub for last few months and these are some of the points which summarize our learnings

  • Give a thought to the type of delivery flow you choose. Cloud Pub/Sub provides two types of delivery flow. Pull and push. In pull mode, the subscriber needs to request Pub/Sub for incoming messages. In push mode, you need to provide an HTTPS URL which can receive messages initiated by Pub/Sub. Give a thought to the delivery mode based on your application requirement. Few questions which can be asked for this include:

    • How much traffic your application is supposed to serve? If it is low, using pull mode might cause more network usage because the Subscriber will be sending network requests to Cloud Pub/Sub and will be returning with empty responses most of the time.
    • Do messages need to be delivered to subscriber near real-time? In the case of pull mode, message delivery can get delayed by a couple of seconds sometimes.
    • Does your application need to manage your own flow control? In case of pull mode, you can do so by acknowledging/not acknowledging a message with a combination of modifyAckDeadline. In the case of push mode, you will be relying on flow control done by Cloud Pub/Sub.
    • Can subscribers come up dynamically? In this case, it is easier to implement pull mode as push mode requires some configuration and verification of push endpoint in case subscriber is not running in app engine

    We used pull mode for our service because we wanted to handle flow control ourselves and intermittent delay of message delivery didn’t affect our use case. More can be found on this topic here

  • Make sure that current ack deadline value works for you. Cloud Pub/Sub expects a message to be acknowledged by subscriber receiving it. If a message is not acknowledged for a time greater than the time defined by ack deadline value, it gets redelivered. Make sure that the ack deadline value is suitable for your app. i.e. If your app is doing some batch processing at an interval of 5 minutes, make sure that ack deadline value is more than that to avoid duplicate messages. The ack deadline on an incoming message can be dynamically modified using modifyAckDeadline API
  • Expect duplicate and out of order messages. Even after choosing suitable ack deadline you will sometimes receive duplicate messages. Messages will be delivered out of order. Make sure to handle those according to your use case.
  • Retry in the case of intermittent server errors. In case you pull messages in a loop using subscription.pull(), the client receives errors intermittently with an error message suggesting internal failure/backend error. In almost all of the cases, very next subscription.poll() works successfully. These errors have gone down with time as Cloud Pub/Sub matures, however, they can happen intermittently.