Invariant Properties

  • rss
  • Home

Proactive Database Defenses Using Triggers

Bear Giles | January 15, 2017

I’m sure I’ve discussed this a number of years ago but a question came up after the recent Boulder Linux User Group meeting and I decided this would be a good time to revisit it.

The question is how do you protected sensitive information from illicit insertion or modification when the attacker has full SQL access as the website user?

Important: I am focused on approaches we can use in the database itself, not our application, since the former will protect our data even if an attacker has full access to the database. These approaches are invisible to our database frameworks, e.g., JPA, once we have created the tables.

An Approach Without Triggers

At a minimum we can ensure that the database was properly configured with multiple users:

app_owner – owns the schema and tables. Often does not have INSERT/UPDATE/DELETE (or even SELECT) privileges on the tables.

app_user – owns the data but cannot modify the schema, tables, etc.

We can make this much more secure by splitting app_user into two users, app_reader and app_writer. The former user only has SELECT privileges on the tables. This is the only account used by user-facing code. The app_writer user adds INSERT/UPDATE/DELETE privileges and is only used by the methods that actually need to modify the data. Data is typically read so much more often that it is written that it often makes sense to view an application as actually two (or more) separate but related applications. In fact they may be – you can improve security by handling any data manipulation via microservices only visible to the application.

There is a big downside to this – modern database frameworks, e.g., JPA or Hibernate, make heavy use of caching to improve performance. You need to ensure that the the cache is properly updated in the app_reader cache whenever the corresponding record(s) are updated in the app_writer account.

Security Defense

This is highly database specific – does the database maintain logs that show when a user attempts to perform a non-permitted action? If so you can watch the logs on the app_reader account. Any attempt to insert or update data is a strong indication of an attacker.

Triggers Based On Related Information

A 3NF (or higher) database requires that each column be independent. In practice we often perform partial denormalization for performance reasons, e.g., adding a column for the day of the week in addition to the full date. We can easily compute the former from the latter but it takes time and can’t be indexed.

There’s a risk that a bug or intruder will introduce inconsistencies. One common solution is to use an INSERT OR UPDATE trigger that calculates the value at the time the data is inserted into the database. E.g.,

  1. CREATE FUNCTION calculate_day_of_week() ....
  2.  
  3. CREATE TABLE date_with_dow (
  4.     date text,
  5.     dow  text
  6. );
  7.  
  8. CREATE FUNCTION set_day_of_week() RETURNS trigger AS $$
  9.     BEGIN
  10.         NEW.date = OLD.date;
  11.         NEW.dow = calculate_day_of_week(OLD.date);
  12.         RETURN NEW;
  13.     END;
  14. $$ LANGUAGE plpgsql;
  15.  
  16. CREATE TRIGGER set_day_of_week BEFORE INSERT OR UPDATE ON date_with_dow
  17.    FOR EACH ROW EXECUTE PROCEDURE set_day_of_week();
CREATE FUNCTION calculate_day_of_week() ....

CREATE TABLE date_with_dow (
    date text,
    dow  text
);

CREATE FUNCTION set_day_of_week() RETURNS trigger AS $$
    BEGIN
        NEW.date = OLD.date;
        NEW.dow = calculate_day_of_week(OLD.date);
        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER set_day_of_week BEFORE INSERT OR UPDATE ON date_with_dow
   FOR EACH ROW EXECUTE PROCEDURE set_day_of_week();

This ensures that the day of week is properly set. A software bug, or attacker, can try specifying an invalid value but they’ll fail.

Of course we don’t really care (much) if the day of the week is incorrect. However there are other times when we care a great deal, e.g., cached attributes from digital certificates. If someone can insert a certificate with mismatched cached values, esp. if it they can replace an existing table entry, then they can do a lot of damage if the code doesn’t assume that the database could be corrupted and thus perform its own validation checks on everything it gets back. (First rule of security: never trust anything.) Even with tests we’ll only know that the data has been corrupted, not when and not how broadly.

Security Defense

Developers are information packrats. Can we learn anything from the provided day of week value?

Yes. It’s a huge red flag if the provided value doesn’t match the calculated value (modulo planned exceptions, e.g., passing null or a sentinel value to indicate that the application is deferring to the database). It’s easy to add a quick test:

  1. CREATE FUNCTION calculate_day_of_week() ....
  2.  
  3. -- user-defined function that can do anything from adding an entry into
  4. -- a table to sending out an email, SMS, etc., alert
  5. CREATE FUNCTION security_alert() ....
  6.  
  7. CREATE TABLE date_with_dow (
  8.     date text,
  9.     dow  text
  10. );
  11.  
  12. CREATE FUNCTION set_day_of_week() RETURNS rigger AS $$
  13.     DECLARE
  14.         calculated_dow text;
  15.     BEGIN
  16.         NEW.date = OLD.date;
  17.         NEW.dow = calculate_day_of_week(OLD.date);
  18.         IF (NEW.dow  OLD.date) THEN
  19.             security_alert("bad dow value!");
  20.             RETURN null;
  21.         END IF;
  22.         RETURN NEW;
  23.     END;
  24. $$ LANGUAGE plpgsql;
  25.  
  26. CREATE TRIGGER set_day_of_week BEFORE INSERT OR UPDATE ON date_with_dow
  27.     FOR EACH ROW EXECUTE PROCEDURE set_day_of_week();
CREATE FUNCTION calculate_day_of_week() ....

-- user-defined function that can do anything from adding an entry into
-- a table to sending out an email, SMS, etc., alert
CREATE FUNCTION security_alert() ....

CREATE TABLE date_with_dow (
    date text,
    dow  text
);

CREATE FUNCTION set_day_of_week() RETURNS rigger AS $$
    DECLARE
        calculated_dow text;
    BEGIN
        NEW.date = OLD.date;
        NEW.dow = calculate_day_of_week(OLD.date);
        IF (NEW.dow  OLD.date) THEN
            security_alert("bad dow value!");
            RETURN null;
        END IF;
        RETURN NEW;
    END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER set_day_of_week BEFORE INSERT OR UPDATE ON date_with_dow
    FOR EACH ROW EXECUTE PROCEDURE set_day_of_week();

Sidenote: check out your database documentation for more ideas. For instance many applications use @PrePersist annotations to autofill creationDate and lastUpdateDate. It’s easy to do this via a trigger – and using a trigger ensures that the data updated even if an attacker does it via SQL injection or direct access. More impressively you can write audit information to a separate table, perhaps even in a separate schema that the app_user only has INSERT privileges for in order to prevent an attacker from learning what the system has learned about them, much less altering or deleting that information.

I’ve written triggers that generate XML representations of the OLD and NEW values and write them to an audit table together with date, etc. On INSERT the OLD data is null, on DELETE the NEW data is null. Using XML allows us to use a common audit table (table name is just a field) and potentially allows you to add transaction id, etc.

It is then easy to use a bit of simple XML diff code to see exactly what changed when by reviewing the audit table.

Resources:

  • PostgreSQL
  • MySQL
  • Oracle

Triggers Based On Secrets

What about tables where there’s no “related” columns? Can we use a trigger to detect an illicit attempt to INSERT or UPDATE a record?

Yes!

In this case we want to add an extra column to the table. It can be anything – the sole purpose is to create a way to pass a validation token to the trigger.

What are validation tokens?

A validation token can be anything you want. A few examples are:

A constant – this is the easiest but will be powerful as long as you can keep it secret. An example is ’42’. An obvious variant is the sum of several of the other columns of the table. This value should not be written to the database or it will be exposed to anyone with SELECT privileges.

A time-based value – your webserver and database will have closely synced clocks so you can use a time-based protocol such as Time-based One-time Password (TOTP) Algorithm. If both the database and application servers use NTP you can keep the window as small as a few seconds. Just remember to include one tick on either side when validating the token – NTP keeps the clocks synchronized but there can still be a very small skew plus network latency to consider.

Note: TOTP requires a shared secret and is independent of the contents of the INSERT or UPDATE statement.

You can save a time-based value but it is meaningless without a timestamp – and some algorithms can be cracked if you have a series of values and the starting time.

An HMAC value – most people will be familiar with standard cryptographic hashes such as MD-5 or SHA-1 (both considered cracked) or SHA-256. They’re powerful tools – but in part because everyone will compute the same values given the same input.

In our case we want an HMAC – it is a cryptographically strong message digest that also requires an encryption key. An attacker cannot generate their own HMAC but anyone with the corresponding digital certificate can verify one. An HMAC value requires something in the message to be processed and it needs to be intrinsic to the value of the record. For instance a digital certificate, a PDF document, even a hashed password. Don’t use it to hash the primary key or any value that can be readily reused.

You can freely save an HMAC value.

Subsequent validation

We would like to know that values haven’t been corrupted, e.g., by an attacker knowledgeable enough to disable the trigger, insert bad values, and then restore the trigger. The last step is important since we can / should run periodic scans to ensure all security-related features like these database triggers are still in place. Can we use these techniques to validate the records after the fact?

Constant value: no.

Time-base value: only if we record a timestamp as well, and if we do then we have to assume that the secret has been compromised. So… no.

HMAC value: yes.

Backups and Restorations

Backups and restorations have the same problems as subsequent validations. You can’t allow any magic values to be backed up (or an attacker could learn it by stealing the backup media) and you can’t allow the time-based values plus timestamps to be backed up (or an attacker could learn the shared secret by stealing the backup media). That means you would need to disable the trigger when restoring data to the database and you can’t verify that it’s properly validated afterwards. Remember: you can’t trust backup media!

The exception is HMAC tokens. They can be safely backed up and restored even if the triggers are in place.

Security Defense

You can add a token column to any table. As always it’s a balance between security and convenience and the less powerful techniques may be Good Enough for your needs. But for highly sensitive records, esp. those that are inserted or updated relatively infrequently, an HMAC token may be a good investment.

Implementation-wise: on the application side you can write a @PrePersist method that handles the creation of the TOTP or HMAC token. It’s a standard calculation and the biggest issue, as always, is key management. On the database side you’ll need to have a crypto library that supports whatever token method you choose.

Shadow Tables

Finally, there are two concerns with the last approach. First, it requires you to have crypto libraries available in your database. That may not be the case. Second if a value is inserted it’s impossible know for sure that it was your application that inserted it.

There’s a solution to this which is entirely app-side. It might not give you immediate notification of a problem but it still gives you some strong protection when you read the data.

You start as above – add a @PrePersist method that calculates an HMAC code. Only now you edit the domain bean so that the HMAC column uses a @SecondaryTable instead of the main table. (I think you can even specify a different schema if you want even higher security.) From the data perspective this is just two tables with a 1:1 relationship but from the source code perspective it’s still a single object.

Putting this into a separate table, if not a separate schema as well, means that a casual attacker will not know that it is there. They might succeed in inserting or modifying data but not realize that the changes will be detected even if audit triggers are disabled.

The final step is adding a @PostLoad method that verifies the HMAC code. If it’s good you can have confidence the data hasn’t been corrupted. If it’s incorrect or missing you know there’s a problem and you shouldn’t trust it.

For advanced users the developer won’t even know that the extra data is present – you can do a lot with AOP and some teams are organized so that the developers write unsecured code and a security team – which is focused entirely on security, not features – is responsible for adding security entirely through AOP code interwoven into the existing code. But that’s a topic for a future blog….

Comments
No Comments »
Categories
CEU, PostgreSQL, security
Comments rss Comments rss
Trackback Trackback

Building Hadoop on Ubuntu 16.10

Bear Giles | January 2, 2017

Edge Nodes and Rolling your own Hadoop Packages

I must answer an important question before I start. Why would anyone want to build Hadoop themselves? Isn’t it much saner to use one of the commercial distributions like Cloudera or Hortonworks? Both have ‘express’ versions that are free to use and ideal for developers and small-scale testing. (They’re distributed as VMWare images but it’s straightforward to convert a VMWare image into an AWS image that can be run on an EC2 instance. In fact it’s an item on my to-blog list!) The Cloudera Express version also has an option that gives you a straight Hadoop cluster without the Cloudera enhancements.

Why bother building our own packages?

The answer is edge nodes. The classic Hadoop environment, e.g., what you’ll see in a Coursera specialization, involves a tidy Hadoop cluster that has map/reduce jobs uploaded to it and run. Everything goes through a handful of clean interfaces.

In practice any site that needs a Hadoop cluster will probably have its own software that solves its business needs and that software uses the Hadoop cluster as a resource no differently than an existing database, mail, or jms service. It needs access to the cluster but only via the well-defined wire protocol. In addition any CISO on the ball will want to keep the Hadoop cluster tucked away on its own locked-down VPC, one that has the Hadoop cluster on it but nothing else. She’ll also want anything that talks to the cluster on a relatively locked-down VPC, ideally one that’s not directly accessible from the corporate VPN much less the internet at large.

Hence ‘edge nodes’. The idea is that a Hadoop cluster consists of two types of nodes. Compute nodes run the actual Hadoop services, edge nodes do not but are able to communicate with the services. They’re often the only way to communicate with the services. Edge nodes can be located in the same VPC as the compute nodes or they can be located in a different VPC. There are benefits and drawbacks to both approaches.

Setting up an edge node is pretty straightforward. Besides access to the Hadoop compute nodes you need:

  • The configuration files from the cluster (typically in /etc/hadoop)
  • The appropriate client jars from the cluster
  • The Hadoop client programs (e.g., ‘hadoop’ or ‘hdfs’) if you’ll do anything from the command line or in scripts
  • The shared libraries for native code. (Optional but it improves performance)

In addition you’ll need the Kerberos client apps and the /etc/krb5.conf file if you use Kerberos authentication within the cluster. Kerberos is a good idea even if you’re on a dedicated VPC. The commercial distributions make it easy to set up if you don’t already have a corporate KDC. Again there are benefits and drawbacks to using the KDC bundled in the Hadoop distribution vs. using a corporate KDC.

Developer systems will also often be set up as edge nodes to development-internal clusters.

So Why Do We Need To Build Our Own Packages?

We have commercial distributions. They provide free ‘express’ versions and even a free prebundled standard Hadoop cluster. So why would we possibly want to build our own packages?

There are two reasons. The first is the most obvious – we want or need to use the version of a service that’s not supported by the commercial distribution. Cloudera takes a very conservative approach so even the newest release contains older versions of the services, albeit ones that have often had new features backported to them. Hortonworks takes a more aggressive approach and will have newer versions of the services but there’s no guarantee that it will have the version you need. In these cases you need to provide the tarball yourself and it’s often better to build it locally than to download a prebuilt image in order to reduce the risk of a nasty surprise because of unmet expectations in the system libraries. That’s rare but it can be a real pain to track down when it occurs.

The second reason is more subtle. There’s no requirement that the edge nodes run the same operating system as the compute nodes. This could be a small change, e.g., a developer laptop running Ubuntu vs. a CDH cluster running on RedHat, or it could be more substantial such as a macbook developer laptop. In the case of Ubuntu it should be possible to copy the RedHat files but again rebuilding the packages on your own system ensures there won’t be any surprises.

The Easy (But Incomplete) Solutions

You can built the most recent versions of Hadoop by cloning https://github.com/apache/hadoop and running the start-build-env.sh script. That creates a Docker environment with all of the required libraries so builds will go quickly.

  1. $ git clone https://github.com/apache/hadoop
  2. $ cd hadoop
  3. $ ./start-build-env.sh
$ git clone https://github.com/apache/hadoop
$ cd hadoop
$ ./start-build-env.sh

Unfortunately that script was introduced in Hadoop 2.8. I need to support earlier versions and that means building Hadoop on the command line instead of a preconfigured Docker container. At first glance this is straightforward:

  1. #
  2. # get source...
  3. #
  4. $ git clone https://github.com/apache/hadoop
  5. $ cd hadoop
  6. $ git checkout release-2.5.0
  7.  
  8. #
  9. # install development libraries
  10. #
  11. $ sudo apt-get install libopenssl-dev zlib1g-dev libbz2-dev libsnappy-dev
  12.  
  13. #
  14. # install Kerberos client and development libraries (just in case)
  15. #
  16. $ sudo apt-get install krb5-client, libkrb5-dev
  17.  
  18. #
  19. # install required tools
  20. #
  21. $ sudo apt-get install cmake protobuf-compiler
  22.  
  23. #
  24. # build Hadoop distribution tarball including native libraries
  25. #
  26. $ mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Drequire.openssl
#
# get source...
#
$ git clone https://github.com/apache/hadoop
$ cd hadoop
$ git checkout release-2.5.0

#
# install development libraries
#
$ sudo apt-get install libopenssl-dev zlib1g-dev libbz2-dev libsnappy-dev

#
# install Kerberos client and development libraries (just in case)
#
$ sudo apt-get install krb5-client, libkrb5-dev

#
# install required tools
#
$ sudo apt-get install cmake protobuf-compiler

#
# build Hadoop distribution tarball including native libraries
#
$ mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true -Drequire.snappy -Drequire.openssl

At this point we should fly along… until we do a good impression of a bug hitting a windshield. Ubuntu 16.10 provides version 3.0.0 of the protocol buffer tool but Hadoop has a hard requirement on version 2.5.0. That version hasn’t been supported by Ubuntu since 14.04.

We could manually download and install the 14.04 .deb packages but that runs the risk of getting into library dependency hell. There is a better solution.

Building Protocol Buffers 2.5.0 on Ubuntu 16.10

The solution is a bit of Ubuntu-fu. We don’t want to install the Ubuntu 14.04 binary packages but there’s no problem downloading the Ubuntu 14.04 source package and rebuilding it in our Ubuntu 16.10 environment. We can then safely install these packages as either a direct replacement or in a separate location (using ‘dpkg –root=dir’) without worrying about introducing other outdated libraries.

  1. #
  2. # download source package. There's a way to do this with dpkg-source but with older source packages
  3. # I prefer to do it manually.
  4. #
  5. $ wget https://launchpad.net/ubuntu/+archive/primary/+files/protobuf_2.5.0.orig.tar.gz
  6. $ wget https://launchpad.net/ubuntu/+archive/primary/+files/protobuf_2.5.0-9ubuntu1.debian.tar.gz
  7. $ wget https://launchpad.net/ubuntu/+archive/primary/+files/protobuf_2.5.0-9ubuntu1.dsc
  8.  
  9. #
  10. # unpack source
  11. #
  12. $ dpkg-source --extract protobuf_2.5.0-9ubuntu1.dsc
  13.  
  14. #
  15. # build binary packages. this will take awhile.
  16. #
  17. # note: you might need to install additional packages in order to build this package.
  18. # Build dependencies are listed in the control file under "Build-Depends".
  19. #
  20. $ cd protobuf-2.5.0
  21. $ dpkg-buildpackage -us -uc -nc
  22.  
  23. #
  24. # the binary packages are now available in the original directory. You can install
  25. # them using 'dpkg', or 'dpkg --root=dir' if you want them to exist in parallel with
  26. # the current libraries. In the latter case you will need to specify the new location
  27. # when you build hadoop.
  28. #
  29. $ cd ..
  30. $ sudo dpkg -i *deb
#
# download source package. There's a way to do this with dpkg-source but with older source packages
# I prefer to do it manually.
#
$ wget https://launchpad.net/ubuntu/+archive/primary/+files/protobuf_2.5.0.orig.tar.gz
$ wget https://launchpad.net/ubuntu/+archive/primary/+files/protobuf_2.5.0-9ubuntu1.debian.tar.gz
$ wget https://launchpad.net/ubuntu/+archive/primary/+files/protobuf_2.5.0-9ubuntu1.dsc

#
# unpack source
#
$ dpkg-source --extract protobuf_2.5.0-9ubuntu1.dsc

#
# build binary packages. this will take awhile.
#
# note: you might need to install additional packages in order to build this package.
# Build dependencies are listed in the control file under "Build-Depends".
#
$ cd protobuf-2.5.0
$ dpkg-buildpackage -us -uc -nc

#
# the binary packages are now available in the original directory. You can install
# them using 'dpkg', or 'dpkg --root=dir' if you want them to exist in parallel with
# the current libraries. In the latter case you will need to specify the new location
# when you build hadoop.
#
$ cd ..
$ sudo dpkg -i *deb

Note: you system will revert to the 3.0.0 version with the next ‘apt-get upgrade’ unless you pin the version at 2.5.0.

Finishing and Deploying the Build

We can now finish the build. When it is done there will be a large .tar.gz file in the hadoop-dist/target directory. For instance hadoop-2.5.0.tar.gz is over 133 MB. This file is traditionally untarred in the /opt directory.

  1. #
  2. # untar package
  3. #
  4. $ sudo tar xzf hadoop-dis/target/hadoop-2.5.0.tar.gz -C /opt
  5.  
  6. #
  7. # create symlink to make life easier
  8. #
  9. $ cd /opt
  10. $ sudo ln -s hadoop-2.5.0 hadoopk
  11.  
  12. #
  13. # make the native libraries available
  14. # (note: file must be created as root. showing 'echo' for convenience.)
  15. #
  16. $ echo /opt/hadoop/lib/native > /etc/ld.so.conf.d/hadoop.conf
  17. $ sudo ldconfig
  18.  
  19. #
  20. # verify shared libraries are now visible
  21. #
  22. ldconfig -p | grep hadoop
  23.  
  24. #
  25. # add hadoop binaries to PATH. Note: in the long term you'll want to update
  26. # /etc/profile or ~/.bashrc.
  27. #
  28. export PATH=$PATH:/opt/hadoop/bin
  29.  
  30. #
  31. # set HADOOP_HOME. Or is it HADOOP_COMMON_HOME? HADOOP_PREFIX? This seems to change between
  32. # Hadoop versions so check your documentation.
  33. #
#
# untar package
#
$ sudo tar xzf hadoop-dis/target/hadoop-2.5.0.tar.gz -C /opt

#
# create symlink to make life easier
#
$ cd /opt
$ sudo ln -s hadoop-2.5.0 hadoopk

#
# make the native libraries available
# (note: file must be created as root. showing 'echo' for convenience.)
#
$ echo /opt/hadoop/lib/native > /etc/ld.so.conf.d/hadoop.conf
$ sudo ldconfig

#
# verify shared libraries are now visible
#
ldconfig -p | grep hadoop

#
# add hadoop binaries to PATH. Note: in the long term you'll want to update
# /etc/profile or ~/.bashrc.
#
export PATH=$PATH:/opt/hadoop/bin

#
# set HADOOP_HOME. Or is it HADOOP_COMMON_HOME? HADOOP_PREFIX? This seems to change between
# Hadoop versions so check your documentation.
#

In total the contents of /opt/hadoop include seven directories: bin, etc, include, lib, libexec, sbin, and share. Edge nodes need to keep bin, lib, libexec, share/doc, and the client libraries from share/hadoop. The easiest way to find them is

  1. $ find /opt/hadoop/share/hadoop/common/lib
  2. $ find /opt/hadoop/share/hadoop -name "*-client*-2.5.0.jar"
  3. $ find /opt/hadoop/share/hadoop -name "*-common*-2.5.0.jar"
$ find /opt/hadoop/share/hadoop/common/lib
$ find /opt/hadoop/share/hadoop -name "*-client*-2.5.0.jar"
$ find /opt/hadoop/share/hadoop -name "*-common*-2.5.0.jar"

In my case that’s

  • ./common/hadoop-common-2.5.0.jar
  • ./common/hadoop-nfs-2.5.0.jar (needed?)
  • ./common/lib/hadoop-annotations-2.5.0.jar
  • ./common/lib/hadoop-auth-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-app-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-common-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-core-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-hs-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-hs-plugins-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-jobclient-2.5.0.jar
  • ./mapreduce/hadoop-mapreduce-client-shuffle-2.5.0.jar
  • ./yarn/hadoop-yarn-client-2.5.0.jar
  • ./yarn/hadoop-yarn-common-2.5.0.jar
  • ./yarn/hadoop-yarn-server-common-2.5.0.jar (needed?)

plus a large number of third-party libraries. Some of the less common ones that are unlikely to already be in your app include

  • ./common/lib/apacheds-i18n-2.0.0-M15.jar
  • ./common/lib/apacheds-kerberos-code-2.2.0-M15.jar
  • ./common/lib/avro-1.7.4.jar
  • ./common/lib/guava-11.0.2.jar
  • ./common/lib/jsch-0.1.42.jar
  • ./common/lib/paranamer-2.3.jar
  • ./common/lib/protobuf-java-2.5.0.jar
  • ./common/lib/psnappy-java-1.0.4.1.jar
  • ./common/lib/zookeeper-3.4.6.jar

We don’t need the contents of the /opt/hadoop/etc directory – we should use a copy of the configuration files from one of the compute nodes. We don’t need the contents of the include, sbin, or rest of the shared directories since they are only required when we run the Hadoop services.

Distribution

We don’t need to go through this process on every edge node – in the case of Ubuntu it’s easy to create a binary package for redistribution via ‘dpkg -b’. We have to follow a few simple rules and we’ll have a package that can be safely installed, updated, and removed. One huge benefit of using a binary package is that we can safely put the files in their standard places instead of the /opt directory.

I’m not familiar with the RPM creation process but I’m sure it’s equally easy to do in that environment.

Finally I am debating debating creating a Debian/Ubuntu PPA with these packages for multiple Hadoop projects and versions. Watch this blog for announcements.

Other Hadoop Projects

There are other Hadoop projects that we will want to bundle for edge nodes. One good example is Hive – building the package from source gives us the ‘beeline’ and ‘hplsql’ command line tools. The process should go smoothly once you have an environment that can build the main Hadoop project. Just be careful to examine the pom file since available profiles and final distribution location will differ.

Comments
No Comments »
Categories
cloud computing, hadoop, java, linux
Comments rss Comments rss
Trackback Trackback

DataSource Classloader Headaches

Bear Giles | January 2, 2017

I haven’t been posting since I’ve been very busy learning Hadoop + Kerberos for multiple client environments and getting into shape before it’s too late. (I know it’s “never too late” in principle but I’m seeing family and friends my age who are now unable to do hard workouts due to medical issues. For them it is “too late” to get into better shape so this is no longer an abstract concern for me.)

Part of my broader work is supporting applications with user-provided JDBC drivers. We bundle the datasource (typically HikariCP) and allow the user to specify the JDBC driver jar. Support has been very ad hoc and I’ve been working on parameterized tests that use aether to query the maven central repository for all versions of the datasource and JDBC jars and then verifying that I can make a connection to our test servers using all possible combinations. That’s not always the case, e.g., older JDBC drivers might not support a method required by newer versions of the datasource class, especially for more obscure databases such as hive.

(Note: I don’t mean to pick on Hikari here. I’m seeing this problem in several libraries and I’m just using it as an example.)

The test should be straightforward. With one test class per datasource version:

  1. ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader();
  2. for (Artifact artifact : /* Aether query */) {
  3.     try {
  4.         URL[] urls = new URL[] {
  5.             new URL("file", "", artifact.getFile());
  6.         }
  7.         ClassLoader cl = new URLClassLoader(urls, oldClassLoader);
  8.         Thread.currentThread().setContextClassLoader(cl);
  9.  
  10.         HikariConfig config = new HikariConfig();
  11.         config.setJdbcUrl(TEST_URL);
  12.         config.setDriverClassName(DRIVER_CLASSNAME);
  13.         DataSource ds = new HikariDataSource(config);
  14.         try (Connection conn = ds.getConnection();
  15.                 Statement stmt = conn.createStatement();
  16.                 ResultSet rs = stmt.executeQuery("select 1 as x")) {
  17.             assertThat(rs.next(), equalTo(true));
  18.             assertThat(rs.getInt("x"), equalTo(1);
  19.         }
  20.     } finally {
  21.         Thread.currentThread().setContextClassLoader(oldClassLoader);
  22.     }
  23. }
ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader();
for (Artifact artifact : /* Aether query */) {
    try {
        URL[] urls = new URL[] {
            new URL("file", "", artifact.getFile());
        }
        ClassLoader cl = new URLClassLoader(urls, oldClassLoader);
        Thread.currentThread().setContextClassLoader(cl);

        HikariConfig config = new HikariConfig();
        config.setJdbcUrl(TEST_URL);
        config.setDriverClassName(DRIVER_CLASSNAME);
        DataSource ds = new HikariDataSource(config);
        try (Connection conn = ds.getConnection();
                Statement stmt = conn.createStatement();
                ResultSet rs = stmt.executeQuery("select 1 as x")) {
            assertThat(rs.next(), equalTo(true));
            assertThat(rs.getInt("x"), equalTo(1);
        }
    } finally {
        Thread.currentThread().setContextClassLoader(oldClassLoader);
    }
}

(Note: I’m actually using a parameterized junit test that uses the loop to produce the list of parameters. Each parameterized test is then run individually. I’m using an explicit loop here to emphasize the need to restore the environment after each test.)

Only one problem – it can’t find the driver class. Looking at the source code in github reveals the problem:

  1. public void setDriverClassName(String driverClassName) {
  2.     Class c = HikariConfig.class.getClassLoader().loadClass(driverClassName);
  3.     ...
  4. }
public void setDriverClassName(String driverClassName) {
    Class c = HikariConfig.class.getClassLoader().loadClass(driverClassName);
    ...
}

The Hikari classes were loaded by a different classloader than the JDBC driver classes and the ‘parent’ relationship between the classloaders goes the wrong way.

The fix isn’t hard – I need to modify my classloader so it loads both the Hikari datasource library and the JDBC driver library. This requires the use of reflection to create and configure the HikariConfig and HikariDataSource classes but that’s not too hard if I use commons-lang3 helper classes. There’s even a benefit to this approach – I can specify both datasource and JDBC driver jars in the test parameters and no longer need a separate test class for each version of the datasource.

Unfortunately it doesn’t work. I haven’t dug deeper into the class but I noticed the setter only verifies that the class is visible. It’s actually loaded and used elsewhere and it might use a different classloader at that point. Research continues….

But wait, it gets worse!

As an alternative I tried to explicitly register the JDBC driver in order to create the datasource without explicitly naming the JDBC driver classname (if possible):

  1. ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader();
  2. Driver driver = null;
  3. for (Artifact artifact : /* Aether query */) {
  4.     try {
  5.         URL[] urls = new URL[] {
  6.             new URL("file", "", artifact.getFile());
  7.         }
  8.         ClassLoader cl = new URLClassLoader(urls, oldClassLoader);
  9.         Thread.currentThread().setContextClassLoader(cl);
  10.  
  11.         Class driverClass = (Class) cl.loadClass(DRIVER_NAME);
  12.         driver = driverClass.newInstance();
  13.         DriverManager.registerDriver(driver);
  14.  
  15.         ...
  16.  
  17.     } finally {
  18.         if (driver != null) {
  19.             DriverManager.deregisterDriver(driver);
  20.         }
  21.         Thread.currentThread().setContextClassLoader(oldClassLoader);
  22.     }
  23. }
ClassLoader oldClassLoader = Thread.currentThread().getContextClassLoader();
Driver driver = null;
for (Artifact artifact : /* Aether query */) {
    try {
        URL[] urls = new URL[] {
            new URL("file", "", artifact.getFile());
        }
        ClassLoader cl = new URLClassLoader(urls, oldClassLoader);
        Thread.currentThread().setContextClassLoader(cl);

        Class driverClass = (Class) cl.loadClass(DRIVER_NAME);
        driver = driverClass.newInstance();
        DriverManager.registerDriver(driver);

        ...

    } finally {
        if (driver != null) {
            DriverManager.deregisterDriver(driver);
        }
        Thread.currentThread().setContextClassLoader(oldClassLoader);
    }
}

Incredibly this fails – the deregisterDriver() call throws a SecurityException! This happens even when I explicitly set a permissive SecurityManager in the test setup. Digging into the code I discovered that the DriverManager checks whether the caller has the ability to load the class being deregistered. That sounds like a basic sanity check against malicious behavior but it introduces a classloader dependency. Again it’s not using the classloader I created in order to isolate my tests. The DriverManager is a core class so there’s no solution to this problem.

Edited to add…

I meant that there’s no clean solution to this problem. The DriverManager class uses reflection to learn the classloader of the calling method and verifies that the driver is accessible to it. In our case it’s not – we created a new classloader and it’s still our thread’s contextClassLoader but we’re calling the deregisterDriver() method from a class loaded by the original classloader.

One solution is to write and maintain another class that exists solely to deregister the driver class. That is non-obvious and will be a pain to maintain.

The other solution is to use reflection to make the internal registeredDrivers collection accessible and directly manipulate it in our ‘finally’ clause. That was my final solution.

Lessons learned

If we’re writing a library that allows the user to specify a classname at runtime we MUST test the scenario where the user loads the containing jar in a separate classloader. It’s not enough to only test it when containing jar is in the same classpath as our library – the jar might ultimately be provided by the end user and not the developer.

Comments
No Comments »
Categories
java
Comments rss Comments rss
Trackback Trackback

Archives

  • May 2020 (1)
  • March 2019 (1)
  • August 2018 (1)
  • May 2018 (1)
  • February 2018 (1)
  • November 2017 (4)
  • January 2017 (3)
  • June 2016 (1)
  • May 2016 (1)
  • April 2016 (2)
  • March 2016 (1)
  • February 2016 (3)
  • January 2016 (6)
  • December 2015 (2)
  • November 2015 (3)
  • October 2015 (2)
  • August 2015 (4)
  • July 2015 (2)
  • June 2015 (2)
  • January 2015 (1)
  • December 2014 (6)
  • October 2014 (1)
  • September 2014 (2)
  • August 2014 (1)
  • July 2014 (1)
  • June 2014 (2)
  • May 2014 (2)
  • April 2014 (1)
  • March 2014 (1)
  • February 2014 (3)
  • January 2014 (6)
  • December 2013 (13)
  • November 2013 (6)
  • October 2013 (3)
  • September 2013 (2)
  • August 2013 (5)
  • June 2013 (1)
  • May 2013 (2)
  • March 2013 (1)
  • November 2012 (1)
  • October 2012 (3)
  • September 2012 (2)
  • May 2012 (6)
  • January 2012 (2)
  • December 2011 (12)
  • July 2011 (1)
  • June 2011 (2)
  • May 2011 (5)
  • April 2011 (6)
  • March 2011 (4)
  • February 2011 (3)
  • October 2010 (6)
  • September 2010 (8)

Recent Posts

  • 8-bit Breadboard Computer: Good Encapsulation!
  • Where are all the posts?
  • Better Ad Blocking Through Pi-Hole and Local Caching
  • The difference between APIs and SPIs
  • Hadoop: User Impersonation with Kerberos Authentication

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

Pages

  • About Me
  • Notebook: Common XML Tasks
  • Notebook: Database/Webapp Security
  • Notebook: Development Tips

Syndication

Java Code Geeks

Know Your Rights

Support Bloggers' Rights
Demand Your dotRIGHTS

Security

  • Dark Reading
  • Krebs On Security Krebs On Security
  • Naked Security Naked Security
  • Schneier on Security Schneier on Security
  • TaoSecurity TaoSecurity

Politics

  • ACLU ACLU
  • EFF EFF

News

  • Ars technica Ars technica
  • Kevin Drum at Mother Jones Kevin Drum at Mother Jones
  • Raw Story Raw Story
  • Tech Dirt Tech Dirt
  • Vice Vice

Spam Blocked

53,314 spam blocked by Akismet
rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox