Using a S3 Hive Metastore with EMR

When configuring Hive to use EMRFS (i.e. s3://) instead of using the implied HDFS cluster storage of the metastore, which is vital if you want to have a persistent metastore that can survive clusters being destroyed and recreated; you might encounter this message: Access Denied (Service: Amazon S3; Status Code: 403;...).

Before going into the cause and workarounds, I'll give a quick introduction.

Hive Metastore Options

There’s 2 different ways to directly access s3 data from EMR:
  • AWS’s EMRFS, which is accessed via s3:// or s3n:// URLs.
  • The Apache S3A Filesystem, which is accessed via s3a://
AWS’s implementation is based off the old Apache s3n FileSystem client. It is touted to be ‘optimised’ for running EMR on AWS with S3 and AWS doesn't support the Apache s3a file system.
The main differences between it and the standard Apache file system connector are:
  1. Per path/user/group mapping to Role to access S3
    This is done via “EMR Security configurations” which give a shared / central point for configuring at rest and client side encryption. (Or they can be manually specified as configuration options in JSON as I do here).
    The standard Apache impl can do some bucket level variations but you’ll need to use a custom credential provider to do anything more.
  2. The EMRFS S3-optimized Committer for Spark
    Which “avoid(s) list and rename operations done in Amazon S3 during job and task commit phases”. Haven’t had a chance to compare this one.
  3. The consistent view handling
    This uses a DynamoDB table to keep track of what files have been recently changed in S3 so that nodes are hit by the eventual consistency problems (the file not being there or an old one when it downloads it).
    This is a bit complicated and my initial use of it has been fiddly.
    It requires the EMR cluster to have access to the DynamoDB table, which by default is a shared one.
  4. The emulation of directories
    Being a Key/Value storage system it doesn’t have the concept of directories, the EMRFS impl uses a “_$folder$” suffix to directory paths to indicate a directory. Whilst s3a, simply uses a trailing / (just as the AWS S3 console does to indicate a directory placeholder).
    This causes problems because even if there is a file at a lower location EMRFS will try and create placeholders all the way up the tree and will fail if the role doesn’t have permissions to write to
    s3://YourBucket/_$folder$,
    s3://YourBucket/team1_$folder$,
    s3://YourBucket/team1/view_$folder$
    for a file that was stored in s3://YourBucket/team1/view/thefile.csv
    There’s a vague description but not really an explanation at this FAQ: When I use Amazon EMR with Amazon S3, empty files with the _$folder$ suffix appear in my S3 bucket. Can I safely delete these files?

The initial configuration

This was to launch a small persistent cluster, which uses an RDS MySQL database to store the hive metadata. The javax.jdo.option... options are omitted from the hive-site section for brevity.
[
  {
    "classification": "hive-site",
    "properties": {
      "hive.metastore.warehouse.dir": "s3://YourBucket/team1/hive/metastore"
    }
  },
  {
    "classification": "emrfs-site",
    "properties": {
      "fs.s3.consistent": "true",
      "fs.s3.enableServerSideEncryption": "true"
    }
  }
]

Getting AccessDenied when using an s3 hive warehouse

Trying to create a hive database, I immediately hit this error:
hive> show databases;
OK
default
Time taken: 0.169 seconds, Fetched: 1 row(s)

hive> describe database default;
OK
default Default Hive database s3://YourBucket/team1/hive/warehouse public  ROLE
Time taken: 0.028 seconds, Fetched: 1 row(s)

hive> create database test1;
FAILED:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ...; S3 Extended Request ID: ...), S3 Extended Request ID: ...)
If consistent view is off and you can a dummy file in a location where you want to create the db then it works:
cat /dev/null | aws s3 cp - s3://YourBucket/team1/hive/empty.txt
hive> create database test2;
OK
Time taken: 5.106 seconds

hive> describe database test2;
OK
testdb2         s3://YourBucket/team1/hive/test2    hive    USER
Time taken: 0.253 seconds, Fetched: 1 row(s)

Debugging it

As soon as the master node starts to start, ssh hadoop@ec2:
sudo vim /etc/hive/conf/hive-log4j2.properties
# change the following property:
property.hive.log.level = DEBUG
This will give you detailed logging of the call made to s3 with request/response headers at /mnt/var/log/hive/user/hive/hive.log (assuming you've invoked hive via sudo -u hive hive).
One the master is started, start the hive cli, run the create database (or create table) then check the above log.

The cause: the _$folder$ problem

The EC2's Role allows it it Read/Write to s3://YourBucket/team1/* as we share the bucket and silo access to the data with the first path of the prefix, loosening this restriction immediately show me the problem when try to create the external table:
Using the example from the AWS EMR documentation:
echo "123123343,1111,http://example.com/index.html,https://www.google.com/,127.20.50.21,AU" > pv_2008-06-08.txt
aws s3 cp ./pv_2008-06-08.txt s3://YourBucket/team1/view/pv_2008-06-08.txt
sudo –u hive hive
CREATE EXTERNAL TABLE page_view_s3(viewTime INT, userid BIGINT,
   page_url STRING, referrer_url STRING,
   ip STRING COMMENT 'IP Address of the User',
   country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://YourBucket/team1/view/';

Solution 1: permissions workaround

This involved making peace with emrfs wanting to create these files every where and giving the EC2 Roles permission to do so, but only for those files:
The EC2 Role I used was had inline copy of the AmazonElasticMapReduceforEC2Role policy with the s3 access removed. Then the following inline policy added to give only specific access to the buckets we wanted to give access to.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "HadoopDirectoryMarkers",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::YourBucket/_$folder$",
        "arn:aws:s3:::YourBucket/team1_$folder$"
      ]
    },
    {
      "Sid": "TeamRepoReadWriteDelete",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::YourBucket/team1/*"
    },
    {
      "Sid": "ListBuckets",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": [
        "arn:aws:s3:::YourBucket",
        "arn:aws:s3:::YourBucket-logs"
      ]
    },
    {
      "Sid": "WriteLogs",
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::YourBucket-logs/team1/emr/*"
    }
  ]
}

Solution 2: don’t use EMRFS for metastore

The other simple solution is to not use EMRFS for the hive database / metastores - just use s3a://.
That makes creating the hive database work, but you may not want to change it for the tables. However, if your reading to (and writing to) s3 locations that have files in the base paths of the external tables LOCATION, then it won' try creating “_$folder$” directory markers at the root.
But I did get errors with consistent view enabled, which was solved by sync'ing the location of the external table into EMRFS:
emrfs sync s3://YourBucket/team1/view

Handling server side encryption

This simplification I noticed during the debugging processed that had been overlooked when the permissions were originally setup is useful if you’re not already doing so.
We’d originally configured our buckets to require Server Side Encryption (SSE-S3 / AWS managed keys) and had a bucket policy enforcing every PUT request sending an SSE header. For EMR to write to this buckets it required at least the following options set in the configuration JSON:
[
  {
    "classification": "presto-connector-hive",
    "properties": { "hive.s3.sse.enabled": "true" }
  },
  {
    "classification": "core-site",
    "properties": { "fs.s3a.server-side-encryption-algorithm": "AES256" }
  },
  {
    "classification": "emrfs-site",
    "properties": { "fs.s3.enableServerSideEncryption": "true" }
  }
]
But there is a simpler, less error prone way. You can set the bucket to automatically encrypt all objects uploaded that don’t specify their an encryption by setting the Default Bucket Encryption on the bucket.
If you previously had a bucket policy requiring that header be sent you’ll need to change it otherwise you’ll need to keep sending the header. You could remove the policy but it’s possible for the client to override the default. Instead, here is a policy that checks if the SSE header is sent, that it matches what we set our default to (in this case, AWS managed SSE):
{
  "Version": "2012-10-17",
  "Id": "PutObjPolicy",
  "Statement": [
    {
      "Sid": "DenyIncorrectEncryptionHeader",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::YourBucket/*",
      "Condition": {
        "StringNotEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        },
        "Null": {
          "s3:x-amz-server-side-encryption": "false"
        }
      }
    }
  ]
}
Once the default encryption has been set on the bucket and the policy updated then you can omit all the SSE relation options in the EMR config json.

Comments

Post a Comment