Invariant Properties

  • rss
  • Home

Hadoop: User Impersonation with Kerberos Authentication

Bear Giles | February 10, 2018

In my nonexistent free time I’ve been working on unit tests to cover JAAS Kerberos authentication and Hadoop user authentication in mind-numbing detail. I don’t want to just find something that works (or seems to work), I want to know that it’s working as I expect and that I understand the consequences of my implementation. This is especially true at my company since we support third parties who have different needs and different solutions. I plan to create blog posts that follow this work but see comment above about nonexistent free time. sigh.

Something that has consumed a ridiculous amount of time recently is user impersonation with Kerberos authentication on an HDFS cluster. This situation arises when you are running a cluster in a secure environment (hence Kerberos authentication) and you want your application to be able to impersonate individual users when accessing the HDFS filesystem. You could use ACLs to grant read-access to existing files but how would an application create new files with the necessary ownership, permissions, and ACLs?

Performing user impersonation is a straightforward two-step process. First you must create a proxy user and execute your code in a PrivilegedAction. This sets the UserGroupInformation#getCurrentUser() value.

  1.     UserGroupInformation application = ... // get user via one of the standard methods.
  2.     UserGroupInformation user = UserGroupInformation.createProxy("applicationUser", application);
  3.  
  4.     user.doAs(new PrivilegedAction() {
  5.        public void run() {
  6.           // do required work
  7.         }
  8.     }
    UserGroupInformation application = ... // get user via one of the standard methods.
    UserGroupInformation user = UserGroupInformation.createProxy("applicationUser", application);

    user.doAs(new PrivilegedAction() {
       public void run() {
          // do required work
        }
    }

Then you must update your Hadoop configuration. By default no user can impersonate any other user. You must add entries to core-site.xml in order to enable this, on a per-user basis. In order to allow a user ‘myApplication’ to impersonate other users you need to add

  1.    <property>
  2.        <name>hadoop.proxyuser.myApplication.hosts</name>
  3.        <value>*</value>
  4.    </property>
  5.    <property>
  6.        <name>hadoop.proxyuser.myApplication.users</name>
  7.        <value>*</value>
  8.    </property>
   <property>
       <name>hadoop.proxyuser.myApplication.hosts</name>
       <value>*</value>
   </property>
   <property>
       <name>hadoop.proxyuser.myApplication.users</name>
       <value>*</value>
   </property>

Naturally you’ll want to specify the narrowest possible values. E.g., “hosts” should be limited to your application server(s), and “users” should be limited to your application users. A “*” allows the application to impersonate Hadoop services, again a Very Bad Idea.

The “problem”

The “problem” I hit is that I needed to create a Hadoop FileSystem object. There’s two constructors:

  1.    FileSystem fs = FileSystem.newInstance(uri, conf);
   FileSystem fs = FileSystem.newInstance(uri, conf);

and

  1.    FileSystem fs = FileSystem.newInstance(uri, conf, impersonatedUser);
   FileSystem fs = FileSystem.newInstance(uri, conf, impersonatedUser);

I was using the second form – it’s the one that supports user impersonation – so it’s the one that I should use, neh? Yet I was getting an “Unable to Authenticate [KERBEROS, TOKEN]” message despite everything I did. It was very confusing since I could look at my “current user” and clearly see that the Kerberos authentication information was present.

The “solution”

The “solution” was that I was too close to the problem. We use different maven modules for user authentication, HDFS access, and Hive access. When I looked at where I was creating my UserGroupInformation instance I saw that I was properly setting up Kerberos authentication and creating a proxy user. When I looked at where I was creating the HDFS filesystem I saw that I was properly calling the method the method that allows user impersonation.

It took far longer than I want to admit to realize that you can’t do both. At least not in the (old) version of Hadoop we support. It seems that the second constructor calls createProxy() and uses a privileged action behind the scenes (good) but doesn’t check whether the current user and the impersonated user are the same. The rest of the library only looks one level down in the ‘real user’ chain and doesn’t see the required information in a proxy user of a proxy user of a Kerberos-authenticated user. My problems went away once I switched to the first constructor.

The lessons

I take away three lessons from this. The first lesson is that this is exactly the type of thing I want to cover in my unit tests. It’s the type of mistake that a developer with a mid-level amount of experience will make – this is a person who knows there’s more than one approach and has a basic understanding of why you would choose one approach over another but doesn’t yet have enough experience to know the gritty details about how different pieces of the Hadoop libraries interact. Javadoc will never be enough – you need to have increasingly more complete test coverage where you see exactly what works and what fails and why.

The second lesson is the importance of test environments where you can cover all of the case you need in those tests and where you can single step through your actual application. This can be difficult with Hadoop clusters since they’re often on an AWS EC2 instances and there’s a well-known problem with using HDFS from outside of the AWC VPC. The HDFS namenodes insist on giving the internal IP address for the datanodes – something that’s meaningless when you’re testing your software on your dev box. I wasn’t able to close several of my recent issues until I had both an embedded HDFS cluster (which I could use in unit/functional tests) and a Cloudera Express instance running in a local virtualbox instance (which I could use in integration tests).

Of course having these test environments isn’t enough. You still need to actually write the tests to verify your understanding of the library.

The third lesson is the importance of being able to step back from the immediate problem. I know this. I know that sometimes you make a lot more progress by finding something else to work on for awhile so you can see the forest instead of the trees when you return to the problem. It’s still hard to tell your boss that you know your top priority is one issue but the best next step is to ignore it and work on something entirely unrelated for a few days to a week.

Categories
Uncategorized
Comments rss
Comments rss
Trackback
Trackback

« Setting Up Multi-Factor Authentication On Linux Systems The difference between APIs and SPIs »

Leave a Reply

Click here to cancel reply.

You must be logged in to post a comment.

Archives

  • May 2020 (1)
  • March 2019 (1)
  • August 2018 (1)
  • May 2018 (1)
  • February 2018 (1)
  • November 2017 (4)
  • January 2017 (3)
  • June 2016 (1)
  • May 2016 (1)
  • April 2016 (2)
  • March 2016 (1)
  • February 2016 (3)
  • January 2016 (6)
  • December 2015 (2)
  • November 2015 (3)
  • October 2015 (2)
  • August 2015 (4)
  • July 2015 (2)
  • June 2015 (2)
  • January 2015 (1)
  • December 2014 (6)
  • October 2014 (1)
  • September 2014 (2)
  • August 2014 (1)
  • July 2014 (1)
  • June 2014 (2)
  • May 2014 (2)
  • April 2014 (1)
  • March 2014 (1)
  • February 2014 (3)
  • January 2014 (6)
  • December 2013 (13)
  • November 2013 (6)
  • October 2013 (3)
  • September 2013 (2)
  • August 2013 (5)
  • June 2013 (1)
  • May 2013 (2)
  • March 2013 (1)
  • November 2012 (1)
  • October 2012 (3)
  • September 2012 (2)
  • May 2012 (6)
  • January 2012 (2)
  • December 2011 (12)
  • July 2011 (1)
  • June 2011 (2)
  • May 2011 (5)
  • April 2011 (6)
  • March 2011 (4)
  • February 2011 (3)
  • October 2010 (6)
  • September 2010 (8)

Recent Posts

  • 8-bit Breadboard Computer: Good Encapsulation!
  • Where are all the posts?
  • Better Ad Blocking Through Pi-Hole and Local Caching
  • The difference between APIs and SPIs
  • Hadoop: User Impersonation with Kerberos Authentication

Meta

  • Log in
  • Entries RSS
  • Comments RSS
  • WordPress.org

Pages

  • About Me
  • Notebook: Common XML Tasks
  • Notebook: Database/Webapp Security
  • Notebook: Development Tips

Syndication

Java Code Geeks

Know Your Rights

Support Bloggers' Rights
Demand Your dotRIGHTS

Security

  • Dark Reading
  • Krebs On Security Krebs On Security
  • Naked Security Naked Security
  • Schneier on Security Schneier on Security
  • TaoSecurity TaoSecurity

Politics

  • ACLU ACLU
  • EFF EFF

News

  • Ars technica Ars technica
  • Kevin Drum at Mother Jones Kevin Drum at Mother Jones
  • Raw Story Raw Story
  • Tech Dirt Tech Dirt
  • Vice Vice

Spam Blocked

53,313 spam blocked by Akismet
rss Comments rss valid xhtml 1.1 design by jide powered by Wordpress get firefox