I’ll try to explain the situation here as much as I can; we have a .net web app that uses MySQL as the backend database; database size is about 250GB and has 130GB of memory; the app is multi-tenant; one of the app components is a windows services that runs continuously loading data based on newly added data, it basically assigns trainings to users based on rules; rules are nothing but business defined criterias, for example, if user A is in Group B he gets assigned training T1 and T2; the part that says “if user A is in group B” is called the rule “business defined criteria”; and the part that says “he gets assigned T1 and T2” is what the service does.
The core logic of the service is in C# but it also utilizes stored procedures; it uses two procedures
Procedure 1 (P1)
Has a cursor, which loops over all tenants that have rules changed, for each tenant it calls Procedure 2
Procedure 2 (P2)
It uses holding tables for ease of data manipulation, then JOINs these tables with the actual app tables for data changes “insert, update or delete”, before the holding tables are populated with data for each tenant, they get cleaned up within the stored procedure then get populated with the current tenant data.
The code to clean the holding tables look like this
DELETE * FROM tmp_table1;
DELETE * FROM tmp_table2;
One will ask, why don’t you use TRUNCATE, it’s faster; that is true, but the app code uses NHibernate; the code was written in a way if TRUNCATE is used in a stored procedure, the transaction that NHibernate initiated will commit at that point, which is something we don’t want, so we had to use DELETE instead.
But we attempted to switch from DELETE to TRUNCATE after changing the app code so it does not commit when TRUNCATE is used in stored procedure; but something very strange happened, which took us about 4 weeks until we figured it out. One thing to mention, the change from DELETE to TRUNCATE was not the only change during the product release; there were other changes at the app level hence it took us a while to figure out the root cause.
The behavior was, users who are using the app would experience delay when navigation through the screens or timeout; looking at the app servers, we see spikes at IIS level, looking at the database, we see flood of transactions with different statuses, normal statuses though –or at least what I think-; we suspected the service could be the issue, we stopped it and the issue disappeared; we turned it back on, the issue came back again; what we noticed is that, when the service is turned on – while stored procedure uses TRUNCATE-, it seems it’s sending many transactions to the database, which makes sense because TRUNCATE is faster that DELETE, but was causing lots of locks at the database level; the fix for our situation was to switch TRUNCATE back to DELETE until we find out why but we could not and gave up!!
One thing to note is that all PKs, FKs and indexes are set correctly with the right cardinality.
We did an optimization in the past where data in the tmp_table1 gets chunked to minimize the transaction size and process records faster, so the tmp_table1 can have 10s or 100s of records depending on how many new assignments/training need to get assigned to the learner; each chunk is 25 records.
P2 does the work based on logical operators; so P1 calls P2 with (AND) operator, then P1 calls P2 with (OR) operators; the secret behind these operators is to check all users who match all rules (AND operator), then all users who match any of the rules (OR operator).
# of calls to these two procs vary depending on how many tenants to be processed, and the # of records in tmp_table1
So guessing a daily average of calls, at low load it could be in low 100s calls, and during heavy loads it could be between 1000-2000 calls
With my many years of experience in the database world; I still don’t get why would using TRUNCATE cause such behavior?
Is it something at the database level that is not configured right? We brought Percona -the MySQL consulting company-, but they could not figure out why and could not answer this odd behavior?
Is it something in the service code?
Have you experienced something similar? If so, what was your resolution?
Thanks for sharing your thoughts
Author: Mohammad Musleh