This page contains a list of "gotchas" that I've come across during my long tenure as PhD candidate (and hacker extraordinaire). They relate to my work in (very large-scale) distributed systems, data mining and data collection.
- Java
- Not news, but recently news to me: using lots of booleans as a "compact" way to represent a bitmap. Here's the problem: Java allocates a whole word (4 bytes!) for each boolean value. Solution? Make a custom Bitmap class where the bitmap is stored using an array of ints, where the size of the array is Math.ceil(numberOfBits/Integer.SIZE). Note that for counting set bits, Java provides the Integer.bitCount() method -- as if they knew the flaw in their boolean representation would lead developers to this workaround.
- MySQL
- LOAD DATA INFILE is really fast. If you're loading a copy of a table that has GBs of indexed data, don't use inserts.
- MySQL + NFS = misery. You can use MyISAM tables, which may not have the best performance, but will certainly be easy to work with if you have to recover from crashes.
- By default, a 64-bit machine will support an extremely large number of rows per table, but a 32-bit machine will support only 2^32 rows. If you add rows beyond the limit, MySQL simply rolls the number of rows back to zero and counts from there. So if you have 2^32+1 rows, MySQL will tell you that you have only 1. Use ALTER TABLE xxx MAX_ROWS= [something large].
- PHP
- Caching of opcodes can improve performance significantly. The Alternative PHP Cache (APC) module is quite handy for this.
- The builtin bzip2 library is much less efficient (i.e., slower) than the standlone bzip2 binary, at least on windows.
- If your script is used for data mining and it grows to more than 100 lines, you probably want to use a static-typed language instead. This not only makes the code faster (generally) but also take up less memory.
- Web service + MYSQL
- Use DELAYED INSERT wherever feasible. This allows the service to return immediately to the caller that is blocking on the response.
- If the delayed insert data is coming in faster than your server can handle it, don't use delayed (or selectively drop data).
- Beware the thundering herd. Always use randomness in your timeout values.
- If your web service is humming along and all of a sudden your load on the DB server drops dramatically, make sure that the web server is not dropping connections. One likely culprit for a large number of clients: the default values for ip_conntrack_max is way too low. You can verify this by looking at /var/log/messages and looking for dropped packets.
- Linux servers
- If you see a consistent 5 or 10 second delay for certain operations to complete, your DNS settings are probably wrong. For example, your DNS servers you set probably don't exist.
- From Pred: You must have at least a 1 second sleep in scripts after the last partition command before the mkfs command, or else you run the risk of the mkfs not finding the newly-created partition. Harsh.
- Kernel Hacking
- printk works everywhere ... except when you're using it in the scheduler. In this case it causes deadlock.
- Calling kmalloc before the memory manager has initialized will do nothing. It will not cause an error, however, nor will it be caught at compile time. So make sure you don't use kmalloc until the kernel is ready for it.
- More to come...
