TR Forums

Tue May 06, 2014 9:33 pm

Time to be creative. Here is the situation....

I work for a company that develops a PCIe card. I am putting together the database to track these cards. Every card is individually serialized and will be tracked. The card is tracked from the time the part number and serial number is program onto the board eeprom, through board test, system test, etc. The board part part number can change and a board can get higher level part numbers if it is combined with an add on card, or integrated into another system. All this data is helpful for addressing (and identifying) manufacturing issues, design issues, etc.

We have the cards manufactured by a local contract manufacturer. For most cards, they do initial board and system test. Then the card is either shipped to a customer (volume/OEM) or to our system integrator (development systems and single function appliances). More tests are done if a card goes to the system integrator. Some portion of the cards may come in house directly from the contract manufacturer, either before any testing at all, or at some stage in the test flow.

So, we have three locations where the card database can be updated. There is connectivity to both remote locations for access to the central database. For the CM, the test infrastructure is on a private network that is only reachable through the VPN so an SSH tunnel is used for connections back to the db server. The system integrator can either come back over a an SSH tunnel across the VPN connection or through our DMZ as the test infrastructure there has access to the outside internet (outbound). Sounds all good. The various test harnesses at the sites all talk back to the database and life is good.

The problem is that all testing stops if the connection back to the central database server goes down. It is a big enough problem if a new card cant be started through the test flow, but even worse is stopping a test sequence and having to re-start because the database connection went down. Allowing missing data potentially causes all sorts of issues. In the case of a missing test result, it just leaves a hole if we need to come back and do an analysis of the card lifecycle later. At the meta level, most of the data analysis would still be valid. However, if the data update is a new card entry or a part number update, or something like that, then you have problems later when you try and log entries against a non-existent card or you end up selling the wrong card because of a bad part number. It is also important from a business perspective to track who is doing what to a card and where it went before going to a customer.

So here is where the creativity comes in. How do I remove the need to have a constant connection and yet keep the view of the data at all three sites consistent? The database is MySQL using the InnoDB storage engine. Both are changeable, but potential at the cost of a lot of time. MySQL does support full replication, but it is a master source to one or more slave sources. Only the master may be updated. MySQL cluster has so many caveats that I haven't determined for sure that it can even work. Based on the note "MySQL replication will not work correctly if updates are done on multiple MySQL servers.", I think the answer is "No."

So, here is what I am thinking right now. If I setup a replicated database for each site, in-house included, that would ensure that each site has current data for queries (at least as current as the network connection state allows). All data updates go to a local "staging" table set. A data replication process watches that staging table and, as connectivity allows, sends the updates to the main DB server. Each staging table has a sequence number that gets updated as part of an update and also sent to the main server so that a quick check of the sequence will tell you if the data is in sync. I figure rows can be removed from the staging table as they are successfully sent to the main server. So the staging tables act as a write buffer in case the network link goes down. I still need to think about what the behavior should be if a query to the main database replica comes back with stale data (sequence numbers don't match). The case that would need to be handled carefully is if data had been sent from the staging buffer to the main database successfully, but not made it back through replication yet. Perhaps the synchronization process shouldn't remove the entry from the staging buffer table until it shows up in the replicated main table. That way, all queries could be run as the union of the main and buffer tables and all writes go the buffer table.

Seems like it would work well until you have a non-trivial network outage. If we test a card at the CM while the network is down and then bring it in-house or send it to the system integrator for use before the databases resync, then the card doesn't exist in the database copies outside the CM and we are stuck, again. This would happen even with full blown clustering, but the real problem is that a card not existing in the database when it comes in-house is not an error and it would be very difficult to programmatically detect and trap this.

I know its long post and I don't have a TLDR version, but I'm interested to hear peoples thoughts.

--SS

Its complex, with a lot of moving parts, but on the face of it, it seems less painful than

Tue May 06, 2014 10:33 pm

What are the typical causes of the connection to the server going down? Maybe you should be addressing that instead. If the problem is your ISP, maybe a backup link from another ISP that you can fail over to would be a good idea?

If you can't improve the network reliability to the point where it isn't an issue, then what you've proposed is probably a reasonable (if somewhat complex, and still imperfect) mitigation measure.

Tue May 06, 2014 10:56 pm

just brew it! wrote:
What are the typical causes of the connection to the server going down? Maybe you should be addressing that instead. If the problem is your ISP, maybe a backup link from another ISP that you can fail over to would be a good idea?

If you can't improve the network reliability to the point where it isn't an issue, then what you've proposed is probably a reasonable (if somewhat complex, and still imperfect) mitigation measure.

TimeWarner -- enough said. :roll:

We do have a backup link though we had an issue this week that caused it to be unavailable. Something about an exploding fluorescent ballast....

The problem is that a link fail over is not transparent to the VPN connections and the ssh tunnels. I have a script that auto restarts the tunnels when the VPNs come back up, but even a few second outage can terminate a test cycle if a db query/update is needed while the link is down. As can first line of defense, I can just sleep/retry a few times, but more than 30 seconds or so puts me back in the realm of network outage.

--SS

Wed May 07, 2014 9:56 am

I'll have to re-read the OP with more attention, but you can, in fact, use MySQL (or another RDBMS) to do inserts/updates from several sources onto the central. There is a configurable parameter that specifies how you want to stagger new table IDs, ex: master = N, slave_one = N+1, slave_two= N+2, and so on.

Just so I'm clear: the scenario in question is mostly concerned about pushing data onto the central, or is there a specific requirement to also pull up the latest data and have it locally as well?

cheesyking · Wed May 07, 2014 11:38 am

How wedded are you to SQL? NoSQL databases tend to handle this sort of thing more easily. I'm guessing you're already to far down the SQL path to start again almost from scratch but I just thought I'd mention them anyway. It's also entirely possible the NoSQL DBs wouldn't be suitable for other reasons.

Personally I've been using CouchDB for a while so I'll give you a quick summary of how it works if you're interested...

Couch doesn't use auto-increment fields for primary keys so when two DBs synchronize you don't have to worry about key conflicts or anything like that.

All your data is stored in "documents" which have a UUID and a revision number. When you update a document its revision number changes so when two DBs sync this revision number is used to work out which document is the newest one. Obviously there's a problem if the same document is updated on both DBs at the same time, that can only be resolved by a human. However if you create your documents carefully you should be able to avoid conflicts like that ever happening.

By the sound of it information about any particular card is only being updated from one location at a time anyway so it should be pretty simple to set up.

steelcity_ballin · Wed May 07, 2014 12:28 pm

A company I used to worked for had store-store inventory that had to stay synch'd for each store. It basically had a heartbeat functionality that if it couldn't reach the other stores, it logged that time and date, and again when it could reach it after connectivity was restored. At that time, it ran a simple usp in sql to get an update for anything in that time frame. Obviously I'm over simplifying this. Your second problem, allowing the workers to work and enter inventory changes etc seems like a design problem. Their ability to enter things should not rely on their internet connection. There should be a LAN-wide connection to their own DB that will sync changes/new entries back to the main database, which will propagate to each new place where it doesn't already exist. If it cannot reach a place, it doesn't update. When/If it can, it should. The heartbeat worked in two ways. First, it kept track of who it couldn't reach. It ran the update when they came back. Secondly, each slave store also ran a heartbeat app that would push changes it made back to the server, which would accept, then propagate to the other stores.

These stores often used VPNs and frequently had network connection issues due to sizable numbers of stores. So if a store in PA went down, the other PA stores wouldn't be able to tell a customer if it was in stock at that location until it came up, only that it was or was not in stock at some point recently. The master store kept track of each store's last update, which should always be within a few moments of the heartbeat's execution after a trigger on a relevant table that would cause an update.

So what ended up working pretty well was a "changes" table at each location that would store what changed, when, and their location along with a place for acknowledgement. When it could, it would send that back to the main store. the main store would see if anything had change relevant to that record since their outage, if not, commit it and propagate it to the other locations and send back the acknowledgement. If something conflicted in that time, it would send a different acknowledgement that required user intervention to make corrective actions as to what the actual case was. We provided them with an interface for accepting one of the many possible changes, as well as manually altering that record to combine changes from multiple locations. If they accepted 1 entry, it would reject the others, and update their acknowledgement and propagate the changes. If there was a manual entry/adjustment from the master, it would reject all, acknowledge, and propagate all changes for that record ID. This worked well for medium sized businesses and inventories. If there is the potential for many collisions, well, you have to come up with something to manage those similar to what we have. In the end, if a client opened a new store, the master table that tracks these stores across their VPN would get another entry in a table that got updated with the relevant IP address. At the time, they were using Hamachi+some DNS service that updated if the IP changed, and updated the master's records appropriately. Each slave store also could talk to another slave and transfer inventory, but still had a master record of this transaction at the master.

Igor_Kavinski · Wed May 07, 2014 12:50 pm

http://www.tokutek.com/products/tokudb-for-mysql/

Hope that helps.

Wed May 07, 2014 6:08 pm

morphine wrote:
I'll have to re-read the OP with more attention, but you can, in fact, use MySQL (or another RDBMS) to do inserts/updates from several sources onto the central. There is a configurable parameter that specifies how you want to stagger new table IDs, ex: master = N, slave_one = N+1, slave_two= N+2, and so on.

Just so I'm clear: the scenario in question is mostly concerned about pushing data onto the central, or is there a specific requirement to also pull up the latest data and have it locally as well?

I needs have access to current data locally as well.

cheesyking wrote:
How wedded are you to SQL? NoSQL databases tend to handle this sort of thing more easily. I'm guessing you're already to far down the SQL path to start again almost from scratch but I just thought I'd mention them anyway. It's also entirely possible the NoSQL DBs wouldn't be suitable for other reasons.

Personally I've been using CouchDB for a while so I'll give you a quick summary of how it works if you're interested...

Couch doesn't use auto-increment fields for primary keys so when two DBs synchronize you don't have to worry about key conflicts or anything like that.

All your data is stored in "documents" which have a UUID and a revision number. When you update a document its revision number changes so when two DBs sync this revision number is used to work out which document is the newest one. Obviously there's a problem if the same document is updated on both DBs at the same time, that can only be resolved by a human. However if you create your documents carefully you should be able to avoid conflicts like that ever happening.

By the sound of it information about any particular card is only being updated from one location at a time anyway so it should be pretty simple to set up.

I'm not wedded to a particular tools (MySQL, or any RDBMS), though I would have to pretty much start from scratch. That is an ok answer if it really solves the problem.

This data base consists of a number of tables. Some have fairly static data and no foreign keys, like part number definitions, site locations, and test definitions. Others really only contain data relationships, like the "configuration" table for example -- I'll explain more in a minute. And of course some are somewhat in between.

When a card comes off the manufacturing floor and into the test harness, it gets a serial number programmed. During this step, and entry is made in the card table that will uniquely identify this card for life. The card also gets an initial configuration entry that goes in the configuration table. This configuration entry defines the part number that is associated with the card configuration. This step can happen at one of two sites.

It then starts through testing. Each step in testing generates an entry in the test results table that links to a specific card AND a specific configuration of that card, links to a test definition, and captures the pass/fail status of the test plus a bit of other information. These test updates can happen at any of the three sites and since the test result table has foreign key dependencies on several other tables, including the card and card config tables, it requires up to date information. The state of the card config table is of special importance as the most recent entry in the table for a given card is what is used when logging the test result. An out of date config table would result in a test entry that referenced the wrong configuration.

It is the foreign key relationships in these tables that makes it a real PIA to handle data sync. I'll have to give some thought to the CouchDB suggestion. Specifically how it would handle the case where the configuration state of a card is updated at site A and then a test is run at site B before site B has received the updated configuration change.

--SS

Wed May 07, 2014 6:49 pm

I believe what you describe can be handled via MySQL's master-to-master replication, as long as it's correctly set up.

TR Forums

Creative SQL data replication suggestions...

Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Re: Creative SQL data replication suggestions...

Who is online