Hadoop: User Impersonation with Kerberos Authentication
Bear Giles | February 10, 2018In my nonexistent free time I’ve been working on unit tests to cover JAAS Kerberos authentication and Hadoop user authentication in mind-numbing detail. I don’t want to just find something that works (or seems to work), I want to know that it’s working as I expect and that I understand the consequences of my implementation. This is especially true at my company since we support third parties who have different needs and different solutions. I plan to create blog posts that follow this work but see comment above about nonexistent free time. sigh.
Something that has consumed a ridiculous amount of time recently is user impersonation with Kerberos authentication on an HDFS cluster. This situation arises when you are running a cluster in a secure environment (hence Kerberos authentication) and you want your application to be able to impersonate individual users when accessing the HDFS filesystem. You could use ACLs to grant read-access to existing files but how would an application create new files with the necessary ownership, permissions, and ACLs?
Performing user impersonation is a straightforward two-step process. First you must create a proxy user and execute your code in a PrivilegedAction. This sets the UserGroupInformation#getCurrentUser() value.
- UserGroupInformation application = ... // get user via one of the standard methods.
- UserGroupInformation user = UserGroupInformation.createProxy("applicationUser", application);
- user.doAs(new PrivilegedAction() {
- public void run() {
- // do required work
- }
- }
UserGroupInformation application = ... // get user via one of the standard methods. UserGroupInformation user = UserGroupInformation.createProxy("applicationUser", application); user.doAs(new PrivilegedAction() { public void run() { // do required work } }
Then you must update your Hadoop configuration. By default no user can impersonate any other user. You must add entries to core-site.xml in order to enable this, on a per-user basis. In order to allow a user ‘myApplication’ to impersonate other users you need to add
- <property>
- <name>hadoop.proxyuser.myApplication.hosts</name>
- <value>*</value>
- </property>
- <property>
- <name>hadoop.proxyuser.myApplication.users</name>
- <value>*</value>
- </property>
<property> <name>hadoop.proxyuser.myApplication.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.myApplication.users</name> <value>*</value> </property>
Naturally you’ll want to specify the narrowest possible values. E.g., “hosts” should be limited to your application server(s), and “users” should be limited to your application users. A “*” allows the application to impersonate Hadoop services, again a Very Bad Idea.
The “problem”
The “problem” I hit is that I needed to create a Hadoop FileSystem object. There’s two constructors:
- FileSystem fs = FileSystem.newInstance(uri, conf);
FileSystem fs = FileSystem.newInstance(uri, conf);
and
- FileSystem fs = FileSystem.newInstance(uri, conf, impersonatedUser);
FileSystem fs = FileSystem.newInstance(uri, conf, impersonatedUser);
I was using the second form – it’s the one that supports user impersonation – so it’s the one that I should use, neh? Yet I was getting an “Unable to Authenticate [KERBEROS, TOKEN]” message despite everything I did. It was very confusing since I could look at my “current user” and clearly see that the Kerberos authentication information was present.
The “solution”
The “solution” was that I was too close to the problem. We use different maven modules for user authentication, HDFS access, and Hive access. When I looked at where I was creating my UserGroupInformation instance I saw that I was properly setting up Kerberos authentication and creating a proxy user. When I looked at where I was creating the HDFS filesystem I saw that I was properly calling the method the method that allows user impersonation.
It took far longer than I want to admit to realize that you can’t do both. At least not in the (old) version of Hadoop we support. It seems that the second constructor calls createProxy() and uses a privileged action behind the scenes (good) but doesn’t check whether the current user and the impersonated user are the same. The rest of the library only looks one level down in the ‘real user’ chain and doesn’t see the required information in a proxy user of a proxy user of a Kerberos-authenticated user. My problems went away once I switched to the first constructor.
The lessons
I take away three lessons from this. The first lesson is that this is exactly the type of thing I want to cover in my unit tests. It’s the type of mistake that a developer with a mid-level amount of experience will make – this is a person who knows there’s more than one approach and has a basic understanding of why you would choose one approach over another but doesn’t yet have enough experience to know the gritty details about how different pieces of the Hadoop libraries interact. Javadoc will never be enough – you need to have increasingly more complete test coverage where you see exactly what works and what fails and why.
The second lesson is the importance of test environments where you can cover all of the case you need in those tests and where you can single step through your actual application. This can be difficult with Hadoop clusters since they’re often on an AWS EC2 instances and there’s a well-known problem with using HDFS from outside of the AWC VPC. The HDFS namenodes insist on giving the internal IP address for the datanodes – something that’s meaningless when you’re testing your software on your dev box. I wasn’t able to close several of my recent issues until I had both an embedded HDFS cluster (which I could use in unit/functional tests) and a Cloudera Express instance running in a local virtualbox instance (which I could use in integration tests).
Of course having these test environments isn’t enough. You still need to actually write the tests to verify your understanding of the library.
The third lesson is the importance of being able to step back from the immediate problem. I know this. I know that sometimes you make a lot more progress by finding something else to work on for awhile so you can see the forest instead of the trees when you return to the problem. It’s still hard to tell your boss that you know your top priority is one issue but the best next step is to ignore it and work on something entirely unrelated for a few days to a week.