This is a followup to an item in the last post. It’s all general pointers and nothing’s specific to any question.
Know your tools. Using a relational database on a home computer to hold billions of values in a OEIS sequence is probably not a good idea. My gut instinct is that you should be able to use 10k records in an in-memory database (e.g., h2) and up to 100k records in a ‘real’ database (e.g., oracle, sql server, postgresql or mysql). The in-memory estimate is probably too small since these are small records so you won’t chew up much memory. These aren’t hard limits, of course, but just my instinct of where you will start to need to pay attention to database issues beyond ‘how do I get to my data?’.
A second issue is how you store the data. Simple sequences are trivial to store. Collections, e.g., sets of amicable numbers, should be stored using a standard one-to-many approach instead of creating one row with a bunch of columns. It’s more work up front but much more useful later.
Don’t do individual inserts. An individual query may only take 1-2 ms but if you’re adding hundreds of thousands or even millions of records that time adds up. Every serious database allows you to do a bulk load from an external file, e.g., the “COPY” instruction in postgresql. Use it.
Be careful here. There are tools like dbunit that allow you to specify bulk data but as I recall it’s ultimately implemented as individual insert statements. You want to use your database’s tools, not a third party that might not be as efficient.
Use bulk load format compatable with Excel et al. Yes, you can read the file directly. But it’s a pain and you might miss what’s tripping up your database. E.g., are you properly escaping quotes?
In practice this means a tab- or comma-separated values (CSV) file. This is such a widely used format that it should be a safe bet for all databases.
Something to keep in mind is that there are file limitations. E.g., Excel files are limited to a bit over 65k records. Other databases may restrict the number of records in a bulk insert, or they may have a limitation on the total number of bytes. You may need to split your bulk inserts into multiple files.
Create solid file creation applications. By that I mean that the algorithms should be well-documented and self-validating. E.g., if you know that every third value should be even then you should verify that before writing the file. It’s a good way to document what you know in addition to being a Really Good Idea in general.
Capture what you learn. A lot of PE problems will create (or at least use) new OEIS sequences. Add them to your database.
Follow conventions. This is easy – if your data is something with an OEIS sequence then make sure your sequence matches. This is mostly knowing what value to start with, e.g., starting at index 2.
Finally, don’t be afraid to use external resources. You definitely want to do the PE problem first. Sometimes there’s only a few more values, e.g., the list of perfect numbers, and it makes sense to just list them explicitly after you’ve finished the PE problem. You never know when you’ll need those values later.
Implementation note: the H2 database (among others?) supports read-only tables loaded directly from CSV files within a zip file. This is a clean way to support an in-memory database – you’ll need to pregenerate the zip file but once you have it you can quickly create an in-memory relational database containing the OEIS sequences and more without requiring anything more than one jar file and one external data file.
UPDATE 2/28/2011: I goofed. The legacy format for Excel (.xls) was limited to 65k records but the new format (.xlsx – Excel 2007) has a limit of 1,048,576 rows. I was thrown off since the Java library I use most often only supports the older format so I’m acutely aware of its limitations.