코알못

[AWS] EMR(hadoop) - 데이터를 이관해보자! 본문

BIG DATA

[AWS] EMR(hadoop) - 데이터를 이관해보자!

코린이s 2022. 1. 30. 22:50
728x90

하둡의 distcp 라는 명령어를 통해 쉽게 원하는 저장소로 이동 할 수 있으며 아래와 같이 정리할 수 있다. 

구분 명령어
HDFS -> S3
hadoop distcp \
-Dfs.s3.awsAccessKeyId=[target_액세스키2] \
-Dfs.s3.awsSecretAccessKey=[target_시크릿키2] \
/data/file s3://[target_url]
S3 -> S3
hadoop distcp \
-Dfs.s3n.awsAccessKeyId=[source_액세스키1] \
-Dfs.s3n.awsSecretAccessKey=[source_시크릿키1] \
-Dfs.s3.awsAccessKeyId=[target_액세스키2] \
-Dfs.s3.awsSecretAccessKey=[target_시크릿키2] \
s3a://[source_url] s3a://[target_url]
HDFS -> HDFS
# a 폴더를 b 로 복사
$ hadoop distcp hdfs:///user/a hdfs:///user/b

# a, b 폴더를 c로 복사
$ hadoop distcp hdfs:///user/a hdfs:///user/b hdfs:///user/c

# -D 옵션으로 메모리 사이즈 전달
$ hadoop distcp -Dmapreduce.map.memory.mb=2048 hdfs:///user/a hdfs:///user/b

# 파일이름, 사이즈를 비교하여 변경 내역있는 파일만 이동
$ hadoop distcp -update hdfs:///user/a hdfs:///user/b hdfs:///user/c

# 목적지의 파일을 덮어씀
$ hadoop distcp -overwrite hdfs:///user/a hdfs:///user/b hdfs:///user/c

# 10개로 나눠서 옮김
hadoop distcp \
-m 10 \
hdfs://source-nn/dir \
hdfs://target-nn/dir

자 그럼 테스트를 진행해보자!

1) HDFS -> S3

하둡 1번에서 S3로 데이터를 옮긴다.

$ hadoop fs -ls /data/tb_user
Found 1 items
-rw-r--r--   3 root hdfsadmingroup         59 2022-01-23 09:32 /data/tb_user/0000

$ hadoop fs -cat /data/tb_user/0000
1,ParkHyunJun
2,LeeHoSeong
3,thewayhj
4,LeeNow

// s3에 접근하기 위한 keyid, keysecret 을 입력 하고 
$ hadoop distcp \
-Dfs.s3.awsAccessKeyId=[s3 key id] \
-Dfs.s3.awsSecretAccessKey=[s3 secretkey] \
/data/tb_user/0000 s3a://emr-hong/data/test/tb_user/

하둡 2번에서 테이블을 생성한다.

CREATE EXTERNAL TABLE tb_user(
    id int,
    name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3a://emr-hong/data/test/tb_user'

데이터를 읽어보면 정상적으로 읽힌다.

hive> select * from tb_user;
OK
1	ParkHyunJun
2	LeeHoSeong
3	thewayhj
4	LeeNow
5	hongYooLee

2) S3 -> S3

s3 'test-a' 디렉토리의 test 파일을 'test-b' 에 옮겨 본다.

hadoop distcp \
-Dfs.s3n.awsAccessKeyId=[source_액세스키1] \
-Dfs.s3n.awsSecretAccessKey=[source_시크릿키1] \
-Dfs.s3.awsAccessKeyId=[target_액세스키2] \
-Dfs.s3.awsSecretAccessKey=[target_시크릿키2] \
s3://emr-hong/data/test/test-a/test s3://emr-hong/data/test/test-b/test

3) HDFS -> HDFS

하둡의 /user/root/HIVE_TB_USER_FAVORITY_COLOR_02 파일을 다른 하둡의 /data/ 경로에 업로드 한다.

$ hadoop distcp hdfs://ec2-x.x.x.x.ap-northeast-2.compute.amazonaws.com:8020/user/root/HIVE_TB_USER_FAVORITY_COLOR_02 hdfs://ec2-x.x.x.x.ap-northeast-2.compute.amazonaws.com:8020/data/

30초 걸려 완료 되었으며 2번 하둡에서 데이터가 정상적으로 옮겨졌는지 확인 한다.

$ hadoop fs -ls /data/
Found 1 items
-rw-r--r--   3 root hdfsadmingroup        996 2022-01-23 08:42 /data/HIVE_TB_USER_FAVORITY_COLOR_02

$ hadoop fs -cat /data/HIVE_TB_USER_FAVORITY_COLOR_02
use genie_tmp;

-- 2. 선호 국내/외(uno, lowcode_id, lowcode_name, cnt)
-- 1) 선호 국내외 데이터 추출

 

저자의 경우 하둡 2.6 을 on-premises 환경에서 운영하고 있었으며

해당 데이터를 aws 로 이관 해야 했다.

우선 옮기기 위해서 '하둡 서버 > AWS S3' 방화벽을 오픈 하였다.

그리고 aws client 로 s3 명령어를 호출하여 정상적으로 접근 되는지 확인 하였다.

$ aws s3 ls
2022-03-17 19:03:17 architecture-group-emr-data
2022-03-19 12:46:59 architecture-group-emr-file
2022-03-16 18:36:30 aws-logs-746920558207-ap-northeast-2
2022-03-16 12:21:38 hong-tmp
2022-03-23 10:04:42 s3-test-endpoint
2022-03-22 16:52:07 thewayhj-buckets

방화벽이 정상적으로 오픈 됐으니 이제 hdoop distcp 명령어를 통해 실행해본다.

$ hadoop distcp -Dfs.s3a.access.key=[accesskey] -Dfs.s3a.secret.key=[secretkey] -Dfs.s3a.endpoint=s3.ap-northeast-2.amazonaws.com /data/test/yyyy/000000_0 s3a://hong-tmp/

22/03/29 11:54:51 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[/data/test/yyyy/000000_0], targetPath=s3a://hong-tmp/, targetPathExists=true, filtersFile='null'}
22/03/29 11:54:51 INFO client.RMProxy: Connecting to ResourceManager at ...:8050
22/03/29 11:54:51 INFO client.AHSProxy: Connecting to Application History server at ...:10200
22/03/29 11:54:51 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 1; dirCnt = 0
22/03/29 11:54:51 INFO tools.SimpleCopyListing: Build file listing completed.
22/03/29 11:54:51 INFO tools.DistCp: Number of paths in the copy list: 1
22/03/29 11:54:52 INFO tools.DistCp: Number of paths in the copy list: 1
22/03/29 11:54:52 INFO client.RMProxy: Connecting to ResourceManager at ...:8050
22/03/29 11:54:52 INFO client.AHSProxy: Connecting to Application History server at...:10200
22/03/29 11:54:52 INFO mapreduce.JobSubmitter: number of splits:1
22/03/29 11:54:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1645420289747_13193
22/03/29 11:54:52 INFO impl.YarnClientImpl: Submitted application application_1645420289747_13193
22/03/29 11:54:52 INFO mapreduce.Job: The url to track the job: http://.../proxy/application_1645420289747_13193/
22/03/29 11:54:52 INFO tools.DistCp: DistCp job-id: job_1645420289747_13193
22/03/29 11:54:52 INFO mapreduce.Job: Running job: job_1645420289747_13193
22/03/29 11:54:57 INFO mapreduce.Job: Job job_1645420289747_13193 running in uber mode : false
22/03/29 11:54:57 INFO mapreduce.Job:  map 0% reduce 0%
22/03/29 11:55:02 INFO mapreduce.Job:  map 100% reduce 0%
22/03/29 11:55:04 INFO mapreduce.Job: Job job_1645420289747_13193 completed successfully
22/03/29 11:55:04 INFO mapreduce.Job: Counters: 38
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=149361
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=356
		HDFS: Number of bytes written=0
		HDFS: Number of read operations=10
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		S3A: Number of bytes read=0
		S3A: Number of bytes written=5
		S3A: Number of read operations=10
		S3A: Number of large read operations=0
		S3A: Number of write operations=15
	Job Counters
		Launched map tasks=1
		Other local map tasks=1
		Total time spent by all maps in occupied slots (ms)=3612
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=3612
		Total vcore-milliseconds taken by all map tasks=3612
		Total megabyte-milliseconds taken by all map tasks=3698688
	Map-Reduce Framework
		Map input records=1
		Map output records=0
		Input split bytes=114
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=107
		CPU time spent (ms)=4460
		Physical memory (bytes) snapshot=508637184
		Virtual memory (bytes) snapshot=42862161920
		Total committed heap usage (bytes)=718798848
	File Input Format Counters
		Bytes Read=237
	File Output Format Counters
		Bytes Written=0
	org.apache.hadoop.tools.mapred.CopyMapper$Counter
		BYTESCOPIED=5
		BYTESEXPECTED=5
		COPY=1

s3 명령어를 통해 정상적으로 가져왔는지 확인 하면 정상적으로 가져온것을 확인 할 수 있다.

$ aws s3 ls hong-tmp                                         
2022-03-29 11:55:02          5 000000_0

몇가지 s3 로 이관 하면서 발생한 오류 아래 공유 하니 참고하여 해결하기 바랍니다.

// s3 deprecated error

- 원인 : s3를 더이상 지원하지 않아 S3AFileSystem(s3a) 또는 NativeS3FileSystem(s3n) 을 사용하라는 에러

- 해결 방안: s3 URL 입력시 s3 > s3a or s3n 으로 변경

22/03/29 11:18:10 WARN fs.FileSystem: S3FileSystem is deprecated and will be removed in future releases. Use NativeS3FileSystem or S3AFileSystem instead.

// unknown error

- 원인 : endpoint 를 잘못 기입하여 찾지 못하는 이슈

- 해결 방안 : endpoint 수정 ([aws region] > 's3.[aws region].amazonaws.com')

22/03/29 12:11:52 INFO http.AmazonHttpClient: Unable to execute HTTP request: hong-tmp.ap-northeast-2: unknown error
java.net.UnknownHostException: hong-tmp.ap-northeast-2: unknown error

// Invalid arguments

- 원인 : 유효하지 않은 argument 를 사용하여 에러 발생

- 해결 방안 : 유효하지 않은 argument 를 유효한 것으로 수정

- 수정전 : 

-Dfs.s3a.awsSecretAccessKey
-Dfs.s3a.awsAccessKeyId

- 수정후 : 

-Dfs.s3a.access.key
-Dfs.s3a.secret.key

22/03/29 13:18:27 ERROR tools.DistCp: Invalid arguments:
java.io.InterruptedIOException: doesBucketExist on hong-tmp: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: Unable to load credentials from Amazon EC2 metadata service

// 400 error

- 원인 : 필수 파라미터를 전달하지 않아서

- 해결 방안 :  필수 파라미터를 전달 (저자는 -Dfs.s3a.endpoint 을 전달 하지 않아서 발생한것으로 추가하니 정상 동작)

Invalid arguments: doesBucketExist on hong-tmp: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 645ETJ9DGHZ05TEH), S3 Extended Request ID: 2JZZZCTlPc2TRvym628zE0dhUstMQ3ks0b26eeRsLbxtGfbn2TG1gewjzUHQq31GUf3h09rbs1s=: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 645ETJ9DGHZ05TEH)

 

728x90
Comments