This post has already been read 1888 times!

Approximate joins

from the Artful Common Queries page


There are two main ways to reconcile payments against charges:

  • Open Item: match payments against individual charges, typically by carrying the charge number in the payments table
  • Statement: list and sum all charges and all payments, and show the difference as the outstanding balance.

The Open Item method needs a foolproof way to match payments to charges, but what if the customer neglected to return a copy of the invoice, or to write the invoice number on the cheque? Reconciliation staff spend much of their time resolving such problems.

Can we help? Yes! It won't be entirely foolproof, but it will drastically cut down the onerous work of reconciliation.

Here is DDL for a test case:

CREATE SCHEMA approx;
USE approx;
CREATE TABLE charges (
ID INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
custID INT UNSIGNED,
amount DECIMAL(10,2) NOT NULL
);
CREATE TABLE payments (
ID INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
custID INT UNSIGNED,
amount DECIMAL( 10,2) NOT NULL
);

Both tables carry a custID column to identify whose charge or payment it is, but there is no foreign key linking payments to specific charges--that is the link we are going to approximate.

Now populate the tables with a few rows of sample charges and payments for customer #1, ensuring that you have a variety of payments – some that match the charge exactly, some that are close but not enough, and some that are slight overpayments.

INSERT INTO approx.charges VALUES
(NULL,1,100),(NULL,1,12),(NULL,1,56),(NULL,1,43),(NULL,1,59),(NULL,1,998);
INSERT INTO approx.payments VALUES
(NULL,1,99),(NULL,1,62),(NULL,1,40),(NULL,1,50),(NULL,1,12),(NULL,1,1000);

SELECT * FROM charges;
+----+--------+--------+
| ID | custID | amount |
+----+--------+--------+
|  1 |      1 | 100.00 |
|  2 |      1 |  12.00 |
|  3 |      1 |  56.00 |
|  4 |      1 |  43.00 |
|  5 |      1 |  59.00 |
|  6 |      1 | 998.00 |
+----+--------+--------+
SELECT * FROM payments;
+----+--------+---------+
| ID | custID | amount  |
+----+--------+---------+
|  1 |      1 |   99.00 |
|  2 |      1 |   62.00 |
|  3 |      1 |   40.00 |
|  4 |      1 |   50.00 |
|  5 |      1 |   12.00 |
|  6 |      1 | 1000.00 |
+----+--------+---------+

The first thing to do is define an approximation threshold: how close must the amount paid be to the amount charged before we conclude that the amounts are related? For this example we define the proximity threshold as 2. In a real-world example, it might be 10, or 50, or perhaps percentage of the charge. It all depends on the nature of the organisation and the typical total purchase. A house builder may make frequent purchases valued at $1000 and more. You scale the threshold to the typical situation.

Since the amount paid might be more or less or even equal to the amount charged, to link a payment to a charge we need not an equi-join but a theta-join that tests a range both below and above the charge amount. That might suggest a BETWEEN clause. Here is a better idea: use the ABS() function:

SET  @proximity = 2;   -- change this value to suit your situation
SELECT
c.ID AS ChargeNo,
c.Amount AS Charge,
p.ID AS PaymentNo,
p.Amount AS Payment
FROM charges c
JOIN payments p
ON c.custID = p.custID
AND ABS(c.amount - p.amount) <= @proximity
WHERE c.custID = 1;

Before you run this query, look at the data to anticipate the result.

Here it is:

+----------+--------+-----------+---------+
| ChargeNo | Charge | PaymentNo | Payment |
+----------+--------+-----------+---------+
|        1 | 100.00 |         1 |   99.00 |
|        2 |  12.00 |         5 |   12.00 |
|        6 | 998.00 |         6 | 1000.00 |
+----------+--------+-----------+---------+

The solution is correct, as far as it goes, but it doesn’t go far enough. We correctly identified the three situations: underpayment, exact payment and overpayment, but we suppressed all charges that don’t have a matching payment. Reconciliation staff are probably interested in a bigger picture of the situation. Fix this by changing the INNER JOIN to a LEFT JOIN:

SET @proximity = 2;
SELECT
c.ID AS ChargeNo,
c.amount AS Charge,
p.ID AS PaymentNo,
p.amount AS Payment
FROM
charges c
LEFT JOIN payments p
ON c.custID = p.custID
AND ABS(c.amount - p.amount) <= @proximity
WHERE c.custID = 1;
+----------+--------+-----------+---------+
| ChargeNo | Charge | PaymentNo | Payment |
+----------+--------+-----------+---------+
|        1 | 100.00 |         1 |   99.00 |
|        2 |  12.00 |         5 |   12.00 |
|        3 |  56.00 |      NULL |    NULL |
|        4 |  43.00 |      NULL |    NULL |
|        5 |  59.00 |      NULL |    NULL |
|        6 | 998.00 |         6 | 1000.00 |
+----------+--------+-----------+---------+

Much better! The reconciliation people now know that three charges have no matching payment.

What if the customer mistakenly pays for something twice? Add a row to the Payments table with an amount of $1000, then re-run the last query:

+----------+--------+-----------+---------+
| ChargeNo | Charge | PaymentNo | Payment |
+----------+--------+-----------+---------+
|        1 | 100.00 |         1 |   99.00 |
|        2 |  12.00 |         5 |   12.00 |
|        3 |  56.00 |      NULL |    NULL |
|        4 |  43.00 |      NULL |    NULL |
|        5 |  59.00 |      NULL |    NULL |
|        6 | 998.00 |         6 | 1000.00 |
|        6 | 998.00 |         7 | 1000.00 |
+----------+--------+-----------+---------+

How convenient! We can see at once that charge number 6 was paid for twice.

Somebody in the reconciliation department owes you lunch.

Cascading JOINs

Show parents, children and grandchildren including parents without children

SELECT parent.id AS ParentID,
IFNULL(child.parent_id,') AS ChildParentID,
IFNULL(child.id,') AS ChildID,
IFNULL(grandchild.child_id,') AS GrandchildChildID
FROM parent
LEFT JOIN child ON parent.id=child.parent_id
LEFT JOIN grandchild ON child.id=grandchild.child_id;

Classroom scheduling

from the Artful Common Queries page


You have n student classes of known size, and m classrooms of known size, where m>=n. What's the best algorithm for assigning as many classes as possible to rooms of adequate size?

It's a version of the combinatorial knapsack problem. It's known to be NP-complete, which means it's possible to verify any correct solution but there is no known algorithm for quickly finding a correct solution. How then to proceed?

Early in 2010 Joe Celko resurrected the problem in a Simple Talk column, and challenged readers to improve on SQL Server solutions he'd published in the third edition of his "SQL for Smarties". Here's his small version of the problem modified for MySQL:

DROP TABLE IF EXISTS Rooms, Classes;
CREATE TABLE Rooms(
room_nbr CHAR(2) NOT NULL PRIMARY KEY, room_size INTEGER NOT NULL
) ENGINE=MyISAM;
CREATE TABLE Classes(
class_nbr CHAR(2) NOT NULL PRIMARY KEY, class_size INTEGER NOT NULL
) ENGINE=MyISAM;
INSERT INTO Classes
VALUES ('c1', 80),('c2', 70),('c3', 65),('c4', 55),('c5', 50),('c6', 40);
INSERT INTO Rooms
VALUES ('r1', 70),('r2', 40),('r3', 50),('r4', 85),('r5', 30),('r6', 65),('r7', 55);

And here is the best solution posted by his contributors. It works in SQL Server 2005 and 2008:

WITH
Matches AS (
SELECT
class_nbr, class_size, room_nbr, room_size,
exact_match = CASE WHEN class_size = room_size THEN 1 ELSE 0 END
FROM Classes, Rooms
WHERE class_size <= room_size
),
Preferences AS (
SELECT
class_nbr, class_size,
class_room_pref = ROW_NUMBER() OVER (
PARTITION BY class_nbr ORDER BY exact_match, room_size, room_nbr
),
room_nbr, room_size,
room_class_pref = ROW_NUMBER() OVER (
PARTITION BY room_nbr ORDER BY exact_match, class_size DESC, class_nbr
)
FROM Matches m
WHERE NOT EXISTS (
SELECT 1 FROM Matches WHERE room_nbr = m.room_nbr AND class_size > m.class_size
)
),
Final AS (
SELECT
class_nbr, class_size, room_nbr, room_size,
final_pref = ROW_NUMBER() OVER (PARTITION BY class_nbr ORDER BY class_room_pref)
FROM Preferences p
WHERE NOT EXISTS (
SELECT 1 FROM Preferences
WHERE room_nbr = p.room_nbr AND class_room_pref = room_class_pref AND   room_class_pref < p.room_class_pref
)
)
SELECT c.class_nbr, c.class_size, f.room_nbr, f.room_size
FROM Classes c
LEFT JOIN Final f ON c.class_nbr = f.class_nbr AND f.final_pref = 1
ORDER BY 1;

It quickly yields this correct answer:

class_nbr class_size room_nbr room_size
c1         80        r4       85
c2         70        r1       70
c3         65        r6       65
c4         55        r7       55
c5         50        r3       50
c6         40        r2       40

As a MySQL user, you may be unfamiliar with two constructs in this query— ROW_NUMBER() OVER ...[PARTITION]..., and Common Table Expressions (CTEs) introduced by the keyword WITH.

ROW_NUMBER() numbers resultset rows based on row values. This entry shows two MySQL equivalents for it, one relying on user variables, the other on aggregation. For this problem we will use the user variable method.

CTEs provide an elegant syntax for building derived tables. The above SQL Server query builds the derived table Matches, from which it builds the derived table Preferences, from which it builds the table Final, which it joins with Classes for the final result.

Can this be done in MySQL? Yes, but not nearly so elegantly. Here we'll lay out an unoptimised step-by-step. Basically, we build the Matches, Preferences and Final tables, one at a time, then copy the final step of the SQL Server query.

First the Matches table:

DROP TABLE IF EXISTS Matches;
CREATE TABLE Matches
SELECT class_nbr, class_size, room_nbr, room_size, IF(class_size=room_size,1,0) AS exact_match
FROM Classes
JOIN Rooms ON class_size <= room_size;

The Preferences table has two Row_Number() expressions to port, so we build each, then join them:

DROP TABLE IF EXISTS room_prefs;
SET @class_nbr_prev='', @ordPrev=0;
CREATE TABLE room_prefs
SELECT ID, class_nbr, class_size, room_nbr, room_size, class_room_pref
FROM (
SELECT
ID, class_size, room_nbr, room_size,
@ordPrev := IF( @class_nbr_prev=class_nbr, @ordPrev+1, 1 ) as class_room_pref,
@class_nbr_prev := class_nbr AS class_nbr
FROM Matches m
WHERE NOT EXISTS ( SELECT 1 FROM Matches WHERE room_nbr = m.room_nbr AND class_size > m.class_size )
ORDER BY class_nbr, exact_match, room_size, room_nbr
) AS tmp ;

DROP TABLE IF EXISTS class_prefs;
SET @room_nbr_prev = '', @ordPrev=0;
CREATE TABLE class_prefs
SELECT ID, room_class_pref
FROM (
SELECT
ID,
@ordPrev := IF( @room_nbr_prev=room_nbr, @ordPrev+1, 1 ) as room_class_pref,
@room_nbr_prev := room_nbr AS room_nbr
FROM Matches m
WHERE NOT EXISTS ( SELECT 1 FROM Matches WHERE room_nbr = m.room_nbr AND class_size > m.class_size )
ORDER BY room_nbr, exact_match, class_size DESC, class_nbr
) AS tmp ;

DROP TABLE IF EXISTS Preferences;
CREATE TABLE Preferences
SELECT a.class_nbr, a.class_size, a.room_nbr, a.class_room_pref, a.room_size, b.room_class_pref
FROM room_prefs a
JOIN class_prefs b USING(ID);

Now build the Final table from Preferences:

DROP TABLE IF EXISTS Final;
SET @class_nbr_prev = '', @ordPrev=0;
CREATE TABLE Final
SELECT
room_nbr, room_size, class_size,
@ordPrev := IF( @class_nbr_prev=class_nbr, @ordPrev+1, 1 ) as final_pref,
@class_nbr_prev := class_nbr AS class_nbr
FROM Preferences p
WHERE NOT EXISTS (
SELECT 1 FROM Preferences
WHERE room_nbr = p.room_nbr AND class_room_pref = room_class_pref AND room_class_pref < p.room_class_pref
)
ORDER BY class_nbr;

The final step is identical to the last step in the SQL Server version:

SELECT c.class_nbr, c.class_size, f.room_nbr, f.room_size
FROM Classes c
LEFT JOIN Final f ON c.class_nbr = f.class_nbr AND f.final_pref = 1
ORDER BY 1;
+-----------+------------+----------+-----------+
| class_nbr | class_size | room_nbr | room_size |
+-----------+------------+----------+-----------+
| c1        |         80 | r4       |        85 |
| c2        |         70 | r1       |        70 |
| c3        |         65 | r6       |        65 |
| c4        |         55 | r7       |        55 |
| c5        |         50 | r3       |        50 |
| c6        |         40 | r2       |        40 |
+-----------+------------+----------+-----------+

Data-driven joins

from the Artful Common Queries page


Data-driven table relationships are hard to maintain, but sometimes they cannot be avoided. How do we build joins for them? One way is to use a CASE statement in the SELECT list to handle the joining possibilities. In this example, the parent.linktable column determines the name of the table where a particular parent row's data is. The method is fine when the number of child tables is small:

USE test;
DROP TABLE IF EXISTS parent, child1, child2;

CREATE TABLE parent (
id INT UNSIGNED PRIMARY KEY,
linktable CHAR(64) NOT NULL
);
INSERT INTO parent VALUES (1, 'child1'), (2, 'child2');

CREATE TABLE child1 (
id INT UNSIGNED PRIMARY KEY,
data CHAR(10)
);
INSERT INTO child1 VALUES (1, 'abc');

CREATE TABLE child2 (
id INT UNSIGNED PRIMARY KEY,
data CHAR(10)
);
INSERT INTO child2 VALUES (2, 'def');

To retrieve all child data for all parents, include in the SELECT list a CASE statement which handles all child table possibilities:

SELECT
p.id,
p.linktable,
CASE linktable
WHEN 'child1' THEN c1.data
WHEN 'child2' THEN c2.data
ELSE 'Error'
END AS Data
FROM parent AS p
LEFT JOIN child1 AS c1 ON p.id=c1.id
LEFT JOIN child2 AS c2 ON p.id=c2.id;
+----+-----------+------+
| id | linktable | Data |
+----+-----------+------+
|  1 | child1    | abc  |
|  2 | child2    | def  |
+----+-----------+------+

When the number of child tables is too large for a convenient CASE statement, PREPARE the query in a stored procedure.

(Based on a MySQL Forum post by Felix Geerinckx)

Comments are closed.

Post Navigation