Using a S3 Hive Metastore with EMR
When configuring Hive to use EMRFS (i.e. s3://) instead of using the implied HDFS cluster storage of the metastore, which is vital if you want to have a persistent metastore that can survive clusters being destroyed and recreated; you might encounter this message:
Before going into the cause and workarounds, I'll give a quick introduction.
The main differences between it and the standard Apache file system connector are:
One the master is started, start the hive cli, run the create database (or create table) then check the above log.
Using the example from the AWS EMR documentation:
The EC2 Role I used was had inline copy of the AmazonElasticMapReduceforEC2Role policy with the s3 access removed. Then the following inline policy added to give only specific access to the buckets we wanted to give access to.
That makes creating the hive database work, but you may not want to change it for the tables. However, if your reading to (and writing to) s3 locations that have files in the base paths of the external tables
But I did get errors with consistent view enabled, which was solved by sync'ing the location of the external table into EMRFS:
We’d originally configured our buckets to require Server Side Encryption (SSE-S3 / AWS managed keys) and had a bucket policy enforcing every PUT request sending an SSE header. For EMR to write to this buckets it required at least the following options set in the configuration JSON:
If you previously had a bucket policy requiring that header be sent you’ll need to change it otherwise you’ll need to keep sending the header. You could remove the policy but it’s possible for the client to override the default. Instead, here is a policy that checks if the SSE header is sent, that it matches what we set our default to (in this case, AWS managed SSE):
Access Denied (Service: Amazon S3; Status Code: 403;...)
.
Before going into the cause and workarounds, I'll give a quick introduction.
Hive Metastore Options
There’s 2 different ways to directly access s3 data from EMR:- AWS’s EMRFS, which is accessed via
s3://
ors3n://
URLs. - The Apache S3A Filesystem, which is accessed via
s3a://
FileSystem
client. It is touted to be ‘optimised’ for running EMR on AWS with S3 and AWS doesn't support the Apache s3a file system.The main differences between it and the standard Apache file system connector are:
- Per path/user/group mapping to Role to access S3
This is done via “EMR Security configurations” which give a shared / central point for configuring at rest and client side encryption. (Or they can be manually specified as configuration options in JSON as I do here).
The standard Apache impl can do some bucket level variations but you’ll need to use a custom credential provider to do anything more. - The EMRFS S3-optimized Committer for Spark
Which “avoid(s) list and rename operations done in Amazon S3 during job and task commit phases”. Haven’t had a chance to compare this one. - The consistent view handling
This uses a DynamoDB table to keep track of what files have been recently changed in S3 so that nodes are hit by the eventual consistency problems (the file not being there or an old one when it downloads it).
This is a bit complicated and my initial use of it has been fiddly.
It requires the EMR cluster to have access to the DynamoDB table, which by default is a shared one. - The emulation of directories
Being a Key/Value storage system it doesn’t have the concept of directories, the EMRFS impl uses a“_$folder$”
suffix to directory paths to indicate a directory. Whilst s3a, simply uses a trailing / (just as the AWS S3 console does to indicate a directory placeholder).
This causes problems because even if there is a file at a lower location EMRFS will try and create placeholders all the way up the tree and will fail if the role doesn’t have permissions to write tos3://YourBucket/_$folder$
,s3://YourBucket/team1_$folder$
,s3://YourBucket/team1/view_$folder$
for a file that was stored ins3://YourBucket/team1/view/thefile.csv
There’s a vague description but not really an explanation at this FAQ: When I use Amazon EMR with Amazon S3, empty files with the _$folder$ suffix appear in my S3 bucket. Can I safely delete these files?
The initial configuration
This was to launch a small persistent cluster, which uses an RDS MySQL database to store the hive metadata. Thejavax.jdo.option...
options are omitted from the hive-site section for brevity.[ { "classification": "hive-site", "properties": { "hive.metastore.warehouse.dir": "s3://YourBucket/team1/hive/metastore" } }, { "classification": "emrfs-site", "properties": { "fs.s3.consistent": "true", "fs.s3.enableServerSideEncryption": "true" } } ]
Getting AccessDenied when using an s3 hive warehouse
Trying to create a hive database, I immediately hit this error:hive> show databases; OK default Time taken: 0.169 seconds, Fetched: 1 row(s) hive> describe database default; OK default Default Hive database s3://YourBucket/team1/hive/warehouse public ROLE Time taken: 0.028 seconds, Fetched: 1 row(s) hive> create database test1; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ...; S3 Extended Request ID: ...), S3 Extended Request ID: ...)If consistent view is off and you can a dummy file in a location where you want to create the db then it works:
cat /dev/null | aws s3 cp - s3://YourBucket/team1/hive/empty.txt
hive> create database test2; OK Time taken: 5.106 seconds hive> describe database test2; OK testdb2 s3://YourBucket/team1/hive/test2 hive USER Time taken: 0.253 seconds, Fetched: 1 row(s)
Debugging it
As soon as the master node starts to start, ssh hadoop@ec2:sudo vim /etc/hive/conf/hive-log4j2.properties # change the following property: property.hive.log.level = DEBUGThis will give you detailed logging of the call made to s3 with request/response headers at
/mnt/var/log/hive/user/hive/hive.log
(assuming you've invoked hive via sudo -u hive hive
).
One the master is started, start the hive cli, run the create database (or create table) then check the above log.
The cause: the _$folder$ problem
The EC2's Role allows it it Read/Write tos3://YourBucket/team1/*
as we share the bucket and silo access to the data with the first path of the prefix, loosening this restriction immediately show me the problem when try to create the external table:Using the example from the AWS EMR documentation:
echo "123123343,1111,http://example.com/index.html,https://www.google.com/,127.20.50.21,AU" > pv_2008-06-08.txt aws s3 cp ./pv_2008-06-08.txt s3://YourBucket/team1/view/pv_2008-06-08.txt sudo –u hive hive
CREATE EXTERNAL TABLE page_view_s3(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3://YourBucket/team1/view/';
Solution 1: permissions workaround
This involved making peace with emrfs wanting to create these files every where and giving the EC2 Roles permission to do so, but only for those files:The EC2 Role I used was had inline copy of the AmazonElasticMapReduceforEC2Role policy with the s3 access removed. Then the following inline policy added to give only specific access to the buckets we wanted to give access to.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "HadoopDirectoryMarkers", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::YourBucket/_$folder$", "arn:aws:s3:::YourBucket/team1_$folder$" ] }, { "Sid": "TeamRepoReadWriteDelete", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": "arn:aws:s3:::YourBucket/team1/*" }, { "Sid": "ListBuckets", "Effect": "Allow", "Action": "s3:ListBucket", "Resource": [ "arn:aws:s3:::YourBucket", "arn:aws:s3:::YourBucket-logs" ] }, { "Sid": "WriteLogs", "Effect": "Allow", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::YourBucket-logs/team1/emr/*" } ] }
Solution 2: don’t use EMRFS for metastore
The other simple solution is to not use EMRFS for the hive database / metastores - just uses3a://
.That makes creating the hive database work, but you may not want to change it for the tables. However, if your reading to (and writing to) s3 locations that have files in the base paths of the external tables
LOCATION
, then it won' try creating “_$folder$”
directory markers at the root.But I did get errors with consistent view enabled, which was solved by sync'ing the location of the external table into EMRFS:
emrfs sync s3://YourBucket/team1/view
Handling server side encryption
This simplification I noticed during the debugging processed that had been overlooked when the permissions were originally setup is useful if you’re not already doing so.We’d originally configured our buckets to require Server Side Encryption (SSE-S3 / AWS managed keys) and had a bucket policy enforcing every PUT request sending an SSE header. For EMR to write to this buckets it required at least the following options set in the configuration JSON:
[ { "classification": "presto-connector-hive", "properties": { "hive.s3.sse.enabled": "true" } }, { "classification": "core-site", "properties": { "fs.s3a.server-side-encryption-algorithm": "AES256" } }, { "classification": "emrfs-site", "properties": { "fs.s3.enableServerSideEncryption": "true" } } ]But there is a simpler, less error prone way. You can set the bucket to automatically encrypt all objects uploaded that don’t specify their an encryption by setting the Default Bucket Encryption on the bucket.
If you previously had a bucket policy requiring that header be sent you’ll need to change it otherwise you’ll need to keep sending the header. You could remove the policy but it’s possible for the client to override the default. Instead, here is a policy that checks if the SSE header is sent, that it matches what we set our default to (in this case, AWS managed SSE):
{ "Version": "2012-10-17", "Id": "PutObjPolicy", "Statement": [ { "Sid": "DenyIncorrectEncryptionHeader", "Effect": "Deny", "Principal": "*", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::YourBucket/*", "Condition": { "StringNotEquals": { "s3:x-amz-server-side-encryption": "AES256" }, "Null": { "s3:x-amz-server-side-encryption": "false" } } } ] }Once the default encryption has been set on the bucket and the policy updated then you can omit all the SSE relation options in the EMR config json.
This comment has been removed by the author.
ReplyDelete
ReplyDeleteThank you for uploading such a wonderful content on recent technology.
AngularJS Training in Chennai
Python Training in Chennai
Java Training in Chennai
AWS Training in Chennai
Best AWS Training in Chennai
Thanks for sharing this blog with required contents.
ReplyDeleteAWS Course in Chennai
AWS Course in Bangalore
I really like your post and what you shared with us is updated and quite informative for us. Is a Turkey visa US citizens is necessary ? Yes, all the USA citizens need a visa to enter Turkey. It is a necessary Turkish travel document.
ReplyDeleteNice blog, keep sharing with us.
ReplyDeleteMobile Application Development Course in Chennai
Mobile App Development Online Training
Mobile App Development Training in Bangalore
perde modelleri
ReplyDeletesms onay
mobil ödeme bozdurma
nft nasıl alınır
ankara evden eve nakliyat
trafik sigortası
dedektör
WEBSİTE.KURMA
aşk kitapları
en son çıkan perde modelleri
ReplyDeletenft nasıl alınır
özel ambulans
lisans satın al
uc satın al
minecraft premium
en son çıkan perde modelleri
yurtdışı kargo