Time to be creative. Here is the situation....
I work for a company that develops a PCIe card. I am putting together the database to track these cards. Every card is individually serialized and will be tracked. The card is tracked from the time the part number and serial number is program onto the board eeprom, through board test, system test, etc. The board part part number can change and a board can get higher level part numbers if it is combined with an add on card, or integrated into another system. All this data is helpful for addressing (and identifying) manufacturing issues, design issues, etc.
We have the cards manufactured by a local contract manufacturer. For most cards, they do initial board and system test. Then the card is either shipped to a customer (volume/OEM) or to our system integrator (development systems and single function appliances). More tests are done if a card goes to the system integrator. Some portion of the cards may come in house directly from the contract manufacturer, either before any testing at all, or at some stage in the test flow.
So, we have three locations where the card database can be updated. There is connectivity to both remote locations for access to the central database. For the CM, the test infrastructure is on a private network that is only reachable through the VPN so an SSH tunnel is used for connections back to the db server. The system integrator can either come back over a an SSH tunnel across the VPN connection or through our DMZ as the test infrastructure there has access to the outside internet (outbound). Sounds all good. The various test harnesses at the sites all talk back to the database and life is good.
The problem is that all testing stops if the connection back to the central database server goes down. It is a big enough problem if a new card cant be started through the test flow, but even worse is stopping a test sequence and having to re-start because the database connection went down. Allowing missing data potentially causes all sorts of issues. In the case of a missing test result, it just leaves a hole if we need to come back and do an analysis of the card lifecycle later. At the meta level, most of the data analysis would still be valid. However, if the data update is a new card entry or a part number update, or something like that, then you have problems later when you try and log entries against a non-existent card or you end up selling the wrong card because of a bad part number. It is also important from a business perspective to track who is doing what to a card and where it went before going to a customer.
So here is where the creativity comes in. How do I remove the need to have a constant connection and yet keep the view of the data at all three sites consistent? The database is MySQL using the InnoDB storage engine. Both are changeable, but potential at the cost of a lot of time. MySQL does support full replication, but it is a master source to one or more slave sources. Only the master may be updated. MySQL cluster has so many caveats that I haven't determined for sure that it can even work. Based on the note "MySQL replication will not work correctly if updates are done on multiple MySQL servers.", I think the answer is "No."
So, here is what I am thinking right now. If I setup a replicated database for each site, in-house included, that would ensure that each site has current data for queries (at least as current as the network connection state allows). All data updates go to a local "staging" table set. A data replication process watches that staging table and, as connectivity allows, sends the updates to the main DB server. Each staging table has a sequence number that gets updated as part of an update and also sent to the main server so that a quick check of the sequence will tell you if the data is in sync. I figure rows can be removed from the staging table as they are successfully sent to the main server. So the staging tables act as a write buffer in case the network link goes down. I still need to think about what the behavior should be if a query to the main database replica comes back with stale data (sequence numbers don't match). The case that would need to be handled carefully is if data had been sent from the staging buffer to the main database successfully, but not made it back through replication yet. Perhaps the synchronization process shouldn't remove the entry from the staging buffer table until it shows up in the replicated main table. That way, all queries could be run as the union of the main and buffer tables and all writes go the buffer table.
Seems like it would work well until you have a non-trivial network outage. If we test a card at the CM while the network is down and then bring it in-house or send it to the system integrator for use before the databases resync, then the card doesn't exist in the database copies outside the CM and we are stuck, again. This would happen even with full blown clustering, but the real problem is that a card not existing in the database when it comes in-house is not an error and it would be very difficult to programmatically detect and trap this.
I know its long post and I don't have a TLDR version, but I'm interested to hear peoples thoughts.
--SS
Its complex, with a lot of moving parts, but on the face of it, it seems less painful than