2019년 2월 6일 수요일

Greenplum - kafka 연동 (gpkafka)

----------------------------------------------
0. gpkafka 스크립트 
----------------------------------------------
1) 초기 작성자 : 박영호 (ypark.pivotal.io)
2) kafka & Greenplum Docker 작성자: 홍종현(jhong@pivotal.io)


----------------------------------------------
1. Kafka docker 컨테이너 구성
----------------------------------------------
(1) docker 이미지 가져오기
docker pull centos:6.8
(2) docker 이미지 확인
docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
centos              6.8                 e54faac158ff        5 weeks ago         195MB
(3) docker 컨테이너 생성 (kafkalab)
docker run --name kafkalab --hostname testdk -it e54faac158ff /bin/bash
(4) 필수패키지 설치
yum install -y net-tools which openssh-clients openssh-server less zip unzip iproute.x86_64 java wget
(5) root 패스워드 변경 (optional)
passwd
(6) key 인증
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
(7) ssh 데몬 시작 (docker container를 새로 시작할 때 항상 매뉴얼로 수행)     
/usr/sbin/sshd
(8) /etc/hosts 확인
172.17.0.2      testdk
(9) /etc/sysconfig/network 의 hostname변경
HOSTNAME=testdk


--------------------------------------------------
2.zookeeper 설치
--------------------------------------------------
(1) 다운로드
cd /root
wget http://apache.mirror.cdnetworks.com/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz
(2) 설치 및 구성
cd /usr/local/
tar xvfz /root/zookeeper-3.4.13.tar.gz
ln -s zookeeper-3.4.13 zookeeper
     
mkdir /zdata
echo 1 > /zdata/myid
cd /usr/local/zookeeper/conf/
cp zoo_sample.cfg zoo.cfg
vi zoo.cfg
==>
server.1=localhost:2888:3888
--------------------------------------------------
3.Kafka 설치
--------------------------------------------------
(1) 다운로드
cd /root
wget http://apache.mirror.cdnetworks.com/kafka/2.0.0/kafka_2.12-2.0.0.tgz
(2) 설치 및 구성
cd /usr/local
tar xvfz /root/kafka_2.12-2.0.0.tgz
ln -s kafka_2.12-2.0.0/ kafka
mkdir /kdata1 /kdata2
vi /usr/local/kafka/config/server.properties
==>
broker.id=1
log.dirs=/kdata1,/kdata2
zookeeper.connect=localhost:2181/greenplum-kafka
(4) 테스트 데이터 구성
vi /tmp/sample_data.csv
==>
"1313131","12","1313.13"
"3535353","11","761.35"
"7979797","10","4489.00"
"7979797","11","18.72"
"3535353","10","6001.94"
"7979797","12","173.18"
"1313131","10","492.83"
"3535353","12","81.12"
"1313131","11","368.27"


----------------------------------------------
4. Greenplum docker 컨테이너 구성
----------------------------------------------
(1) docker 이미지 확인
docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
centos              6.8                 e54faac158ff        5 weeks ago         195MB
(2) docker 컨테이너 생성 (kafkalab)
docker run --name dblab --hostname dwserver -it e54faac158ff /bin/bash
(3) 필수패키지 설치
yum install -y net-tools which openssh-clients openssh-server less zip unzip iproute.x86_64
(4) root 패스워드 변경 (optional)
passwd
(5) key 인증
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
(6) ssh 데몬 시작 (docker container를 새로 시작할 때 항상 매뉴얼로 수행)     
/usr/sbin/sshd
(7) /etc/hosts 확인
172.17.0.3      dwserver
172.17.0.2      testdk           <== 추가
(8) /etc/sysconfig/network 의 hostname변경
HOSTNAME=dwserver
(9) /etc/security/limits.conf 추가
* soft nofile 65536
* hard nofile 65536
* soft nproc 131072
* hard nproc 131072
(10) user,group 생성
groupadd -g 501 gpadmin
useradd -g 501 -u 501 -m -d /home/gpadmin -s /bin/bash gpadmin
chown -R gpadmin:gpadmin /home/gpadmin
echo gpadmin | passwd  gpadmin --stdin
(11) ssh testdk 확인


--------------------------------------------------
5. Greenplum 설치
--------------------------------------------------
(1) 제품 업로드 (host prompt에서)
docker cp greenplum-db-5.11.3-rhel6-x86_64.zip dblab:/root
(2) 압축해제
unzip greenplum*
(3) 제품설치
./greenplum-db-5.11.3-rhel6-x86_64.bin
(4) 디렉토리 구성
mkdir -p /data/primary /data/master
chown -R gpadmin:gpadmin /data


--------------------------------------------------
6. Greenplum 초기화
--------------------------------------------------
(1) docker 컨테이너로 접속
docker exec -it dblab /bin/bash
만약 컨테이너가 수행중이지 않을때는
docker start dblab 먼저 수행
(2) gpadmin으로 스위치
su - gpadmin
(3) 구성파일 셋업
source /usr/local/greenplum-db/greenplum_path.sh
cp /usr/local/greenplum-db/docs/cli_help/gpconfigs/gpinitsystem_config .
vi gpinitsystem_config
==>
MASTER_HOSTNAME=dwserver
declare -a DATA_DIRECTORY=(/data/primary /data/primary /data/primary)
(4) 설치 호스트 설정
vi /tmp/host
==>
dwserver
(5) 초기화
gpssh-exkeys -f /tmp/host
gpinitsystem -c gpinitsystem_config -h /tmp/host
(6) gpadmin 환경설정
vi /home/gpadmin/.bash_profile
==>
export MASTER_DATA_DIRECTORY=/data/master/gpseg-1
source /usr/local/greenplum-db/greenplum_path.sh
(7) 테스트 DB 환경 구성
createdb testdb
psql testdb -c "CREATE TABLE data_from_kafka( customer_id int8, expenses decimal(9,2), tax_due decimal(7,2)) distributed by (customer_id)"
(8) kafka load 구성파일 설정
 vi /home/gpadmin/loadcfg.yaml
==>
DATABASE: testdb
USER: gpadmin
HOST: localhost
PORT: 5432
KAFKA:
   INPUT:
     SOURCE:
        BROKERS: testdk:9092
        TOPIC: topic_for_gpkafka
     COLUMNS:
        - NAME: cust_id
          TYPE: int
        - NAME: __IGNORED__
          TYPE: int
        - NAME: expenses
          TYPE: decimal(9,2)
     FORMAT: csv
     ERROR_LIMIT: 125
   OUTPUT:
     TABLE: data_from_kafka
     MAPPING:
        - NAME: customer_id
          EXPRESSION: cust_id
        - NAME: expenses
          EXPRESSION: expenses
        - NAME: tax_due
          EXPRESSION: expenses * .0725
   COMMIT:
     MINIMAL_INTERVAL: 10


-------------------------------------------------
7.kafka 기동
--------------------------------------------------
(1) zookeeper 기동
/usr/local/zookeeper/bin/zkServer.sh start
/usr/local/zookeeper/bin/zkServer.sh status
(2) Kafka 기동
/usr/local/kafka/bin/kafka-server-start.sh -daemon /usr/local/kafka/config/server.properties
(참고) 중지
/usr/local/kafka/bin/kafka-server-stop.sh
/usr/local/zookeeper/bin/zkServer.sh stop


--------------------------------------------------
8.Kafka topic 생성 및 체크
--------------------------------------------------
(1) 토픽생성
/usr/local/kafka/bin/kafka-topics.sh --zookeeper localhost:2181/greenplum-kafka --topic topic_for_gpkafka --partitions 1 --replication-factor 1 --create
(2) 확인
/usr/local/kafka/bin/kafka-topics.sh --list --zookeeper localhost:2181/greenplum-kafka
(참고) /usr/local/kafka/bin/kafka-topics.sh --zookeeper localhost:2181/greenplum-kafka --topic topic_for_gpkafka --delete
(3) 데이터 생성
/usr/local/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic topic_for_gpkafka < /tmp/sample_data.csv
(4) 확인
/usr/local/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topic_for_gpkafka --from-beginning


--------------------------------------------------
9. 데이터 로드
--------------------------------------------------
(1) 1회 수행
gpkafka load --quit-at-eof ./loadcfg.yaml
(2) 연속 대기
gpkafka load ./loadcfg.yaml

--------------------------------------------------
10. gpkafka 동영상
--------------------------------------------------
https://www.youtube.com/watch?v=YqTrLb4sqmU

Greenplum Backup & Restore

Greenplum에서는 gpbackup과 gprestore를 이용해서 대량의 데이터를 병렬로 백업/병렬로 복구를 지원하고 있습니다. Full 백업이외에도 incremental 백업을 지원하고 있습니다.  - incremental 백업시에는 반드시 사전...