Invariant Properties

  • rss
  • Home

Tips for Importing Data

Bear Giles | January 25, 2015

I’m currently importing a large amount of spatial data into a PostgreSQL/PostGIS database and realized others could learn from my experience. Most of the advice is not specific to PostgreSQL or PostGIS.

Know the basic techniques

Know the basic techniques for loading bulk data. Use COPY if possible. If not use batch processing if possible. If not turn off auto-commits before doing individual INSERT or UPDATE calls and only commit every Nth calls. Use auto-committed INSERT and UPDATE as an absolute last resort. Unfortunately the latter is the usual default with JDBC connections.

Other standard technique is to drop any indexes before loading large amounts of data and recreate them afterwards. This is not always possible, e.g., if you’re updating a live table, but it can mean a big performance boost.

Finally remember to update your index statistics. In PostgreSQL this is VACUUM ANALYZE.

Know your database

PostgreSQL allows tables to be created as “UNLOGGED”. That means there can be data loss if the system crashes while (or soon after) uploading your data but so what? If that happens you’ll probably want to restart the upload from the start anyway.

I haven’t done performance testing (yet) but it’s another trick to keep in mind.

Know how to use the ExecutorService for multithreading

Every call to the database will have dead time due to network latency and the time required for the database to complete the call. On the flip side the database is idle while waiting for the desktop to prepare and upload each call. You can fill that dead time by using a multithreaded uploader.

The ExecutorService makes it easy to break the work into meaningful chunks. E.g., each thread uploads a single table or a single data file.

The ExecutorService also allows you to be more intelligent about how you upload your data.

Upload to a dedicated schema

If possible upload to a dedicated schema. This gives you more flexibility later.

In PostgreSQL this can be transparent to most users by calling ALTER DATABASE database SET SEARCH_PATH=schema1,schema2,schema3 and specifying both public and the schema containing the uploaded data. In other databases you can only have one active schema at a time and you will need to explicitly specify the schema later.

Upload unprocessed data

I’ve sometimes found it faster to upload raw data and process it in the database (with queries and stored procedures) than to process it on the desktop and upload the cooked data. This is especially true with fixed format records. There are benefits besides performance – see following items.

Keep track of your source

Few things are more frustrating than having questions about what’s in the database and no way to trace it back to its source. (Obviously this applies to data you uploaded from elsewhere, not data your application generated itself.) Adding a couple fields for, e.g., filename and record number, can save a lot of headaches later.

Plan your work

In my current app one of my first steps is to decompress and open each shapefile (data file), get the number of records, and then close it and delete the uncompressed file. This seems like pointless work but it allows me to get a solid estimate on the amount of time that will be required to upload the file. That, in turn, allows me to make intelligent decisions about how to schedule the work. E.g., one standard approach is to use a priority queue where you always grab the most expensive work item (e.g., number of records to process) when a thread becomes available. This will result in the fastest overall uploads.

This has two other benefits. First, it allows me to verify that I can open and read all of the shapefiles. I won’t waste hours before running into a fatal problem. Second it allows me to verify that I wrote everything expected. There’s a problem if I found 813 records while planning but later could only find 811 rows in the table. Or worse I found 815 rows in the table.

Log everything

I write to a log table after every database call. I write to a log table after every exception. It’s expensive but it’s a lot easier to query a database table or two later than to parse log files. It’s also less expensive than you think if you have a large batch size.

I log the time, the shapefile’s pathname, the number of records read and the number of records written. The exceptions log the time, shapefile’s pathname, and the exception messages. I log all exception messages – both ’cause’ and ‘nextException’ (for SQLExceptions).

Perform quality checks

Check that you’re written all expected records and had no exceptions.

Check the data for self-consistency. Add unique indexes. Add referential integrity constraints. There’s a real cost to adding indexes when inserting and updating data but there’s no downside if these will be read-only tables.

Check for null values. Check for reasonable values, e.g., populations and ages can never be negative.

You can often perform checks specific to the data. E.g., I know that states/provinces must be WITHIN their respective countries, and cities must be WITHIN their respective states/provinces. (Some US states require cities to be WITHIN a single county but other states allow cities to cross country boundaries.)

There are two possibilities if the QC fails. First, your upload could be faulty. This is easy to check if you recorded the source of each record. Second, the data itself could be faulty. In either case you want to know this as soon as possible.

Backup your work

This should be a no-brainer but immediately perform a database backup of these tables after the QC passes.

Lock down the data

This only applies if the tables will never or rarely be modified after loading. (Think things like tables containing information about states or area codes.)

Lock down the tables. You can use both belts and suspenders:

REVOKE INSERT, UPDATE, TRUNCATE ON TABLE table FROM PUBLIC.

ALTER TABLE table SET READ ONLY.

Some people prefer VACUUM FREEZE table SET READ ONLY.

If you uploaded the data into a dedicated schema you can use the shorthand

REVOKE INSERT, UPDATE, TRUNCATE ON ALL TABLES IN SCHEMA schema FROM PUBLIC.

You can also REVOKE CREATE permissions on a schema.

Categories
Uncategorized
Comments rss
Comments rss
Trackback
Trackback

« Three Interesting PostgreSQL Extensions Lessons from BIBIFI »

Leave a Reply

Click here to cancel reply.

You must be logged in to post a comment.

Archives

  • May 2020 (1)
  • March 2019 (1)
  • August 2018 (1)
  • May 2018 (1)
  • February 2018 (1)
  • November 2017 (4)
  • January 2017 (3)
  • June 2016 (1)
  • May 2016 (1)
  • April 2016 (2)
  • March 2016 (1)
  • February 2016 (3)
  • January 2016 (6)
  • December 2015 (2)
  • November 2015 (3)
  • October 2015 (2)
  • August 2015 (4)
  • July 2015 (2)
  • June 2015 (2)
  • January 2015 (1)
  • December 2014 (6)
  • October 2014 (1)
  • September 2014 (2)
  • August 2014 (1)
  • July 2014 (1)
  • June 2014 (2)
  • May 2014 (2)
  • April 2014 (1)
  • March 2014 (1)
  • February 2014 (3)
  • January 2014 (6)
  • December 2013 (13)
  • November 2013 (6)
  • October 2013 (3)
  • September 2013 (2)
  • August 2013 (5)
  • June 2013 (1)
  • May 2013 (2)
  • March 2013 (1)
  • November 2012 (1)
  • October 2012 (3)
  • September 2012 (2)
  • May 2012 (6)
  • January 2012 (2)
  • December 2011 (12)
  • July 2011 (1)
  • June 2011 (2)
  • May 2011 (5)
  • April 2011 (6)
  • March 2011 (4)
  • February 2011 (3)
  • October 2010 (6)
  • September 2010 (8)

Recent Posts

  • 8-bit Breadboard Computer: Good Encapsulation!
  • Where are all the posts?
  • Better Ad Blocking Through Pi-Hole and Local Caching
  • The difference between APIs and SPIs
  • Hadoop: User Impersonation with Kerberos Authentication

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

Pages

  • About Me
  • Notebook: Common XML Tasks
  • Notebook: Database/Webapp Security
  • Notebook: Development Tips

Syndication

Java Code Geeks

Know Your Rights

Support Bloggers' Rights
Demand Your dotRIGHTS

Security

  • Dark Reading
  • Krebs On Security Krebs On Security
  • Naked Security Naked Security
  • Schneier on Security Schneier on Security
  • TaoSecurity TaoSecurity

Politics

  • ACLU ACLU
  • EFF EFF

News

  • Ars technica Ars technica
  • Kevin Drum at Mother Jones Kevin Drum at Mother Jones
  • Raw Story Raw Story
  • Tech Dirt Tech Dirt
  • Vice Vice

Spam Blocked

53,793 spam blocked by Akismet
rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox