[EMR] ORC 테이블 ClassCastException 오류 발생

Notice

Recent Posts

Recent Comments

Link

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

코알못

[EMR] ORC 테이블 ClassCastException 오류 발생 본문

BIG DATA

[EMR] ORC 테이블 ClassCastException 오류 발생

코린이s 2022. 8. 3. 15:07

728x90

데이터를 AWS 에 이관하였으며 해당 데이터를 읽기 위해 기존 하둡의 테이블 스키마 정보를 보고 그대로 만들었다.

테이블 정보는 아래와 같이 보았으며

hive> show create table tb_test;
CREATE EXTERNAL TABLE `tb_test`(
  `date` string)
PARTITIONED BY (
  `yyyy` int,
  `mm` int,
  `dd` int,
  `hh` int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://.../'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='songid,date');

그대로 복사하여 AWS EMR (hadoop) 에서 해당 create 문을 입력시 정보는 아래와 같았다.

hive> show create table tb_test;
CREATE EXTERNAL TABLE `tb_test`(
  `date` string)
PARTITIONED BY (
  `yyyy` int,
  `mm` int,
  `dd` int,
  `hh` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'=',',
  'line.delim'='\n',
  'serialization.format'=',')
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://.../'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='songid,date');

이관된 데이터를 바라보도록 파티션 복구 작업을 진행하였으며

hive> msck repair table tb_test;

데이터를 읽어 들일시 CAST 오류가 발생하였다.

hive> select * from tb_test where yyyy=2022 and mm=08 and dd=02 limit 10;
org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable

테이블 생성시 포맷 설정이 잘못된 것으로 확인 하였으며 정상적으로 읽기 위해 포맷 지정을 ORC로 아래와 같이 'STORED AS ORC' 를 넣어 생성했다.

CREATE EXTERNAL TABLE `test`(
  `date` string)
PARTITIONED BY (
  `yyyy` int,
  `mm` int,
  `dd` int,
  `hh` int)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION
  's3://../';

다시 테이블 정보를 보니 orc 포맷 형태로 저장 되었다. (기존에는 hive2를 사용하고 있었으며 hive3와 테이블 형식 관련해서 표기 방식과 생성하는 방식이 다른것 같다.)

hive> show create table tb_test;
CREATE EXTERNAL TABLE `tb_test`(
  `date` string)
PARTITIONED BY (
  `yyyy` int,
  `mm` int,
  `dd` int,
  `hh` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
  'field.delim'=',',
  'line.delim'='\n',
  'serialization.format'=',')
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  's3://.../'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='songid,date');

파티션 복구 하여 다시 사용하면 정상적으로 조회가 된다.

끝!

728x90

저작자표시

'BIG DATA' 카테고리의 다른 글

[EMR] hive Async Initialization failed. abortRequested=false OutOfMemoryError 오류 (0)	2022.08.04
[EMR] FileFormatException: Malformed ORC file 이슈 (0)	2022.08.04
[EMR] Hive StatsTask 이슈 (0)	2022.08.03
[EMR] Glue > Aurora Metastore Migration (0)	2022.07.12
[EMR] 해결한 이슈 항목 (0)	2022.05.27

'BIG DATA' Related Articles

Comments

코알못

[EMR] ORC 테이블 ClassCastException 오류 발생 본문

[EMR] ORC 테이블 ClassCastException 오류 발생

'BIG DATA' 카테고리의 다른 글

티스토리툴바