I tweeted something a few days ago about queues and MySQL.
MySQL is not a data queue. MySQL is not a data queue. MySQL is not a data queue. MySQL is not a data queue. MySQL is not a data queue.
— Kevin Schroeder (@kpschrade) October 23, 2016
… and I got a few confused looks, largely because you can use MySQL as a data queue (as in it is possible). So here, now, I will explain why I said that MySQL is not a data queue, even though it can be used as one, and also why you should use a queue instead of MySQL if you need a queue.
It actually isn’t.
MySQL is a Relational Database Management System. According to Wikipedia, the source of all knowledge, a relational database management system has, at minimum, the following characteristics:
- Present the data to the user as relations (a presentation in tabular form, i.e. as a collection of tables with each table consisting of a set of rows and columns);
- Provide relational operators to manipulate the data in tabular form.
According to this other Wikipedia entry on Relational Databases, “A relational database is a digital database whose organization is based on the relational model of data, as proposed by E. F. Codd in 1970. The various software systems used to maintain relational databases are known as a relational database management system (RDBMS). Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.”
Let’s compare this to the venerable Wikipedia’s entry on Message Queue: “Message queues provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time. Messages placed onto the queue are stored until the recipient retrieves them.”
“Wait, that kind of sounds like it’s not making your point,” you might be thinking. “MySQL provides an asynchronous communication protocol.”
Actually, it does not. MySQL does not do communication. Its intention is not to provide communication. It is not about communication. It is about data storage and retrieval.
And so, class, what is a word that can describe a computer science scenario where there is a mismatch between intention and use?
An anti-pattern.
That’s not to say that you can’t do it, only that you’re doing it wrong. And, truthfully, there are times when it is OK to do it wrong. But don’t pretend that you’re doing it right. Instead, the proper approach would be to say that you had a choice between different options and you believed that the anti-pattern provided more value. You won’t find me complaining about people who use expediency when expediency is necessary.
It is not asynchronous
“Lies!” you may exclaim. “MySQL does things asynchronously all the time!”
Yes, but let’s look at the typical implementation of a queuing system in MySQL.
* * * * * php /var/www/bin/queue.php |
Umm, that ain’t asynchronous. And not only that, it is blocking. Two problems right out of the box.
But!! You have a fix!!
* * * * * php /var/www/bin/queue.php & |
And now you have three problems (I don’t even know if the example will work. It’s meant as a joke). The third problem is that now you have to manage the queue items. You will need to manage handling error conditions if one of those jobs does not execute properly. And this will be difficult.
It does not handle state
UPDATE queue SET state = 'PROCESSING' WHERE job_id = '123456'; |
Just proved me wrong, eh? OK, what happens with this code?
$obj = null; $obj->someMethod(); |
That, friends, is what we call a fatal error. I hear you saying that you will catch all those. You might (you won’t, actually), but not everyone will. And then your queue will be in an inconsistent state. Yes, I’m sure that you will also add a timestamp on there and if the job hasn’t finished executing after 30 seconds or 5 minutes you will retry it. Well, you’ve just significantly increased your feature set. Additionally, what if the job runs 6 minutes? How will you know when the job has finished or if it’s crashed instead? IPC? TCP communication?
Whew! This is getting deep.
A queue in MySQL is really a batch job
While queuing data, or work, in MySQL sounds queue-like, the two look very different. This is MySQL:
This is a queue
The consumers are able to process data as they are able. They are not bound to a specific schedule.
It is not immediate
One of the great features of a queue is that it gives control of data processing to the infrastructure. One of the key, though perhaps unsaid, tenets of a queuing system is that it processes data when it is able. That smooths out CPU usage, which is good. If you are waiting a minute to get a batch of operations to execute you have lost a minute of useful time on the CPU. This is your system on a MySQL-based queue.
In addition, your system may have dependencies that cannot wait or that would impact UX while they had to wait.
Say, for example, a “forgot password” link. Say you put that into a queue in MySQL that is managed by cron. Your customer will be waiting at least a minute, perhaps longer for that email during which time they may decide it’s not worth it or go on to doing something else. It may not go out at all if some other previous job takes too long or crashes.
“But I wouldn’t use a queue for that. I would just send the email using Sendmail.”
Sendmail is a queue!!! It is an SMTP-based message queue.
You have very little control
In other words, every job is just another job. You can’t say that a job of type X can have 20 workers and a job of type Y can have 2. You could add a column to the queue table that has a priority on it and sort on that, but what happens if another job is inserted after the SELECT occurred? Then the queue items are getting processed out of priority order.
Compare that with a message queue where, as I noted earlier, you could have 20 workers on an important queue and 2 workers on a less important queue. All you have to do is spawn more or fewer workers and you have your priorities defined right there.
Why should you use an actual queue?
Or perhaps, more accurately, when should you use an actual queuing system?
- When you need control over your infrastructure usage
- When you need results as immediately as possible, as long as it doesn’t overwhelm the system
- When you need some level of guarantee that when a message has been processed it has actually been processed
- When you need some kind of communication between disparate systems
- When you need either topics or queues
- When you need to have asynchronous communication
- When you want your messages or jobs to run in isolation (You want this)
Conclusion-ish
I have some deliverables that need to be delivered and so I’m just going to stop right here. But if you are using MySQL as a queue you might actually be doing the right thing. MySQL can be used in this way.
BUT! Compare the very minimal feature set of any queuing system written using MySQL with a poorly copied, truncated list of features for ActiveMQ (my personal favorite). Sometimes it’s tough for web developers (particularly of the PHP sort) to drink the queuing Kool-Aid, but once you have you find that there is a world of opportunity.
Broadcasting
Group Membership
Broker Camel Component
Clustering
MasterSlave
JDBC Master Slave
Replicated Message Store
Consumer Dispatch Async
Consumer Priority
Exclusive Consumer
Manage Durable Subscribers
Message Groups
Redelivery Policy
Retroactive Consumer
Selectors
Slow Consumer Handling
Subscription Recovery Policy
Destination Features
Composite Destinations
Mirrored Queues
Message Dispatching Features
Async Sends
Dispatch Policies
Message Cursors
Optimized Acknowledgement
Producer Flow Control
Total Ordering
ActiveMQ Message Properties
Advisory Message
Blob Messages
Delay and Schedule Message Delivery
Message Transformation
ObjectMessage
Structured Message Properties and MapMessages
Comments
Zen
The biggest advantage of building a job queue on top of MySQL is that you can enqueue jobs transactionally as part of the same database transaction. So after you enqueue the job, if the transaction fails and rolls back later in the code for some reason, your job would get rolled back as well.
Kevin Schroeder
OK, but if a job depends on a commit why wouldn’t you just send the message after the commit? No worker process will see the job until the commit anyway.
Benjamin Morel
That’s what I used to do at some point: commit the transaction, then push to Amazon SQS. But what if the push fails? You now have a committed transaction, and a lost job. If the job is sending an email, not a big deal. If the job is to perform an asynchronous task that’s essential to the business, then you’re screwed. For this reason, I’m now using MySQL as a queue, with INSERT to the queue as part of the transaction. I’m not using cron jobs, but multiple workers that poll the queue table with SELECT FOR UPDATE SKIP LOCKED, avoiding all concurrency issues. If a worker dies unexpectedly, the database transaction is automatically rolled back and the record immediately made available to another worker. Sure, MySQL is not meant to be used as a queue, but speed is not too bad for a lot of applications (> 1000 concurrent enqueues/dequeues per second to a single queue table), and the transactional integrity is a must have to me. If someone has an actual solution that provides transactional integrity using an external queue, I’m all ears!
Bernard
I enjoy the article