2017년 9월 21일 목요일

GPDB 장애대응 - gpstart 실패

1. GPDB - gpstart 실패 시 확인 사항

    1) pg_hba.conf 파일 수정 여부
        -  ip 등록시 class 적용하지 않았을 경우
        - ex ) 0.0.0.0 으로 등록 할 경우 장애 발생 => 0.0.0.0/0 으로 등록

$ cd $MASTER_DATA_DIRECTORY
$ vi pg_hba.conf
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
# Type : local : 마스터 노드에서 접속, host : IP로 접속
# Database : all 은 모든 데이터베이스 접속 허용, 특정 db만 접근할 경우 database 명 기입.
# user : all / user명
# CIDR-ADDRESS: IP/0 : 전부 허용,  IP/8: A class 허용, IP/16 B Class 허용
#                       IP/24 C Class 허용, IP/32 IP 일치시 허용
# METHOD : md5 : 비밀번호 필요, trust : 비밀번호 필요치 않음(비밀번호가 틀리더라도 접속 가능 함)
# 예제
local   all   gpadmin      trust
host    all   all  172.16.0.0/16   md5  # 172.16.X.X B class, 패스워드 필요
host    bmt   all  192.168.10.0/28 trust# bmt DB에 모든 user가 192.168.10.X C class로
                                        # 패스워드 없이 접속 가능함
host    bmt   u1234 192.168.10.101/32  md5 # bmt 데이터 베이스에 u1234 / 해당IP에 대해서 
                                   # 패스워드가 일치해야지만 접속할 수 있음

host    bmt   u1234 192.168.10.101     md5  ###==> 장애 유발 "/" Class 지정되어 있지 않음.  


   2) postgresql.conf 파일 잘 못 적용된 경우
        - 파라미터의 값이 잘 못 저장되었거나, value 임계치 보다 높을 경우 발생
        - 예시
$ cd $MASTER_DATA_DIRECTORY
$ vi postgresql.conf
gp_resqueue_memory_policy=eager_freea   ## value 에러 발생 eager_free 으로 등록해야 함.

 
    3) Log 파일 확인
        - tail -f gpdb-yyyy-mm-dd-xxxxxx.csv
        - tail -f startup.log
$ cd $MASTER_DATA_DIRECTORY
$ cd pg_log
[gpadmin@mdw pg_log]$ tail -f gpdb-2017-09-19_061729.csv

2017-09-19 06:17:29.802815 KST,,,p65274,th708482848,,,,0,,,seg-1,,,,,"LOG","F0000","invalid IP mask ""md5"": Name or service not known",,,,,"line 88 of configuration file ""/data/master/xpseg-1/pg_hba.conf""",,0,,"hba.c",992,
2017-09-19 06:17:29.802913 KST,,,p65274,th708482848,,,,0,,,seg-1,,,,,"FATAL","F0000","could not load pg_hba.conf",,,,,,,0,,"postmaster.c",1509,
^C
[gpadmin@mdw pg_log]$ tail -f startup.log

2. GPDB - 특정 세그먼트가 startup 되어서 대기하는 경우 
     1) GPDB 4.3. old 버전에서 발생되었으나, 현재 Fix 되었음.
     2) 프로세스 확인 (마스터 및 세그먼트 노드)


mdw $ps -ef | grep start | grep StrictHostKeyChecking --프로세스가 있을경우 startup 되지 않음.
sdwx$ps -ef | grep primarymirror -- 세그먼트가 아직 startup 되지 않음을 의미 
    
    3) 세그먼트 노드에서 확인

sdwx $ ps -ef | grep primarymirror 
gpadmin  791028 790943  0 05:40 ?        00:00:00 /usr/local/greenplum-db/./bin/gp_primarymirror -h localhost -p 1028 -i /data1/primary/gpseg19/gp_pmtransition_args
* Check the tail of the segment logs to confirm that it is waiting on file transfer from primary to mirror:

## sdw x segment db log 
2016-03-15 16:03:26.104678 UTC,,,p467438,th1170523936,,,,0,,,seg-1,,,,,"LOG","00000","recovering pg_distributedlog",,,,,,,0,,"slru.c",1363,
2016-03-15 16:04:12.826187 UTC,,,p467362,th1170523936,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2058,
2016-03-15 16:05:03.116920 UTC,,,p467362,th1170523936,,,,0,,,seg-1,,,,,"LOG","00000","TransitiontoPrimary: waiting for filerep startup",,,,,,,0,,"primary_mirror_mode.c",2058

## 세그먼트 Primary process 확인 
gpadmin  595121      1  0 16:15 ?        00:00:00 /usr/local/GP-4.3.7.1/bin/postgres -D /data1/primary/gpseg1064 -p 1025 -b 2122 -z 1152 --silent-mode=true -i -M quiescent -C 1064
gpadmin  595129 595121  0 16:15 ?        00:00:00 postgres: port  1025, logger process                               
gpadmin  595156 595055  0 16:15 ?        00:00:00 /usr/local/greenplum-db/./bin/gp_primarymirror -h localhost -p 1025 -i /data1/primary/gpseg1064/gp_pmtransition_args
gpadmin  595158 595121  0 16:15 ?        00:00:00 postgres: port  1025, filerep transition process                               
gpadmin  595160 595121  0 16:15 ?        00:00:00 postgres: port  1025, primary process  
gpadmin  595164 595160  0 16:15 ?        00:00:00 postgres: port  1025, primary receiver ack process                               
gpadmin  595166 595160  0 16:15 ?        00:00:00 postgres: port  1025, primary sender process                               
gpadmin  595168 595160  0 16:15 ?        00:00:00 postgres: port  1025, primary consumer ack process                               
gpadmin  595169 595160  0 16:15 ?        00:00:00 postgres: port  1025, primary recovery process                               
gpadmin  595170 595160  0 16:15 ?        00:00:00 postgres: port  1025, primary verification process                               
gpadmin  595329 595273  0 16:23 pts/2    00:00:00 grep 1025 
* Issue a kill -11 on the primary process to reset it. Please be careful to not issue it on postmaster, that would cause the segment to fail.

    4) 해당 포트 이슈 프로세스 Kill

       sdwX $ kill -11  595160


    5) 정상적으로 Database startup 수행 됨.
         - 만약 여러개의 프로세스가 걸릴 경우에는 동일한 패턴으로 처리
         - gpdb 4.3.10.0 이상 버전에서는 Fix 됨.




GPDB 5 소개

1. GPDB 구성 환경
 

2017년 9월 13일 수요일

Greenplum Master Failover


1.      Greenplum master failover


1)        Summary


§  This is for Greenplum database of HA for soft only not DCA

§  Limitation:

-          The script checks only VM/Node status of Greenplum database master as ping not gpdb process.

§  Configuration:

-          If the Greenplum database is installed on cloud as vSphere, need to disable vmotion.

2)        File list


 

File Path
Filename
Description
node
/usr/local/bin
 
gpfailover.sh
Daemon for GPDB Master HA on smdw
smdw
gpfailovershutdown.sh
Shutdown Daemon
smdw
vip_env.sh
Vip configuration
Mdw, smdw
vip_start.sh
Start vip
mdw, smdw
vip_stop.sh
Stop vip
mdw, smdw
/etc/rc.d/init.d/
vip
vip service
mdw, smdw
Gpfailover
gpfailover service
smdw
/etc/rc.d/rc3.d/
S99gpfailover
Auto Damon Start when system reboot
smdw

 

3)        Source


 

n  gpfailover.sh

#!/bin/bash
 
## get vip config
. /usr/local/bin/vip_env.sh
 
. /usr/local/greenplum-db/greenplum_path.sh
export PGPORT=5432
 
## killing previous gpfailover daemon
NO_OF_PROCESS=`ps -ef | grep gpfailover.sh | grep -v grep | grep -v status | grep -v stop | wc -l`
 
if [ $NO_OF_PROCESS -gt 2 ]; then
    exit
fi
 
while true
do
    POSTGRESCNT=`ps -ef | grep postgres | wc -l`
    if [ $POSTGRESCNT -gt 8 ]
        then
            echo "${GPMDW} was activated !!!"
            exit 0
        else
        echo "${GPSMDW} is standby !!!"
    fi
 
    CNT_A=`ping -c 6 -i 1 ${GPMDW} | grep ", 0% packet loss" | wc -l`
    if [ $CNT_A -eq 1 ]
    then
        echo "alive"
    else
        echo "run gpactivatestandby!!!"
 
        logger -i -p user.emerg "GP:WARNING : GPDB MASTER VM IS NOT AVAILABLE !!!"
        logger -i -p user.emerg "GP:INFO : GPDB STANDBY MASTER IS STARTING TO FAILOVER !!!"
 
        su - gpadmin -c "gpactivatestandby -d /data/master/gpseg-1 -a -q"
 
        ### Checking the gpactivatestandby.log
        cd /home/gpadmin/gpAdminLogs
        SUCCESS_FG=`ls -lrt gpactivatestandby_*.log | tail -1  | awk '{print $9}' | xargs tail -30 | grep "The activation of the standby master has completed successfully" | wc -l`
        if [ ${SUCCESS_FG} -eq 1 ]
        then
             logger -i -p user.emerg "GP:INFO : The activation of the standby master has completed successfully "
        else
             logger -i -p user.emerg "GP:ERROR : Failed the activation of the standby master !!!"
             exit 1
        fi
 
        ifconfig ${VIP_INTERFACE} ${VIP}  netmask 255.255.255.0 up
 
        ### Checking VIP
        VIP_FG=`ifconfig | grep '{VIP}' | wc -l`
        if [ ${VIP_FG} -eq 1 ]
        then
             logger -i -p user.emerg "GP:INFO : Virtual IP( ${VIP} ) is up  "
        else
             logger -i -p user.emerg "GP:ERROR : Failed to start Virtual IP( ${VIP} ) "
             exit 1
        fi
 
        ### arping
        arping -f -w 10 -s ${VIP}  -U ${VIP_GW} -I ${VIP_INTERFACE}
        logger -i -p user.emerg "GP:IFNO : Executed arping."
        logger -i -p user.emerg "GP:INFO : Greenplum master failover has completed successfully "
        exit 0
    fi
 
    sleep 60
done
exit 0

 

n  gpfailovershutdown.sh

ps -ef | grep -i gpfailover.sh | grep -v grep | grep -v service| awk '{print $2}' | xargs kill -11

 

 

n  vip_env.sh

GPMDW=mdw
GPSMDW=smdw
VIP_INTERFACE=eth3:0
VIP=172.16.150.128
VIP_GW=172.16.150.1

 

 

n  vip_start.sh

#!/bin/bash
 
## get vip configure
. /usr/local/bin/vip_env.sh
 
ifconfig  ${VIP_INTERFACE} ${VIP}  netmask 255.255.255.0 up
arping -f -w 10 -s ${VIP}  -U ${VIP_GW} -I ${VIP_INTERFACE}

 

n  vip_stop.sh

#!/bin/bash
 
## get vip config
. /usr/local/bin/vip_env.sh
 
ifconfig ${VIP_INTERFACE} down

 

n  gpfailover

## cat /etc/rc.d/init.d/gpfailover
#!/bin/bash
 
## get vip config
. /usr/local/bin/vip_env.sh
 
ifconfig ${VIP_INTERFACE} down
[gpadmin@smdw bin]$ cat /etc/rc.d/init.d/gpfailover
case "$1" in
    start) #
        echo -n "Starting GPDB Auto failover Daemon: "
        nohup /usr/local/bin/gpfailover.sh > /dev/null 2>&1 &
        sleep 0.1
        echo
        $0 status
        ;;
 
    stop)
        echo -n "Sutting GPDB Auto failover Daemon: "
        /usr/local/bin/gpfailovershutdown.sh > /dev/null 2>&1
        sleep 0.1
        echo
        $0 status
        ;;
 
    restart)
        $0 stop
        $0 start
        ;;
 
    status)
        NO_OF_PROCESS=`ps -ef | grep gpfailover | grep -v grep | grep -v status | grep -v stop | wc -l`
        if [ $NO_OF_PROCESS -eq 0 ]; then
            echo "gpfailover daemon is not running"
        else
            echo "gpfailover daemon is running"
        fi
        ;;
    *)
    echo "Usage : $0 {start|stop|restart|status}"
    exit 1
esac
exit 0

 

n  S99gpfailover

## cat /etc/rc.d/rc3.d/S99gpfailover
service gpfailover start

 

4)        File copy & location on Server


n   

 
[Master and Standby Master]
tar xf gpdb_master_failover.tar
cd bin
chmod +x *
vi vip_env.sh   ### editing
cp *.sh /usr/local/bin/
cp vip /etc/rc.d/init.d/
cp gpfailover /etc/rc.d/init.d/
 
[Only Standby Master]
cp S99gpfailover /etc/rc.d/rc3.d/
 

5)        Standby Master configuration of bash_profile of gpadmin


n  Bash_profile

# su - gpadmin
$ vi ~/.bash_profile
 
## added below lines
. /usr/local/greenplum-db/greenplum_path.sh
export MASTER_DATA_DIRECTORY=/data/master/gpseg-1
export PGPORT=5432

2.      Greenplum master failback


1)        Checking Standby Master on smdw


$ ssh smdw
$ gpstate -f
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:-Starting gpstate with args: -f
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.16.0 build 1'
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.2.15 (Greenplum Database 4.3.16.0 build 1) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on Jul 28 2017 17:17:49'
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:-Obtaining Segment details from master...
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:-Standby master instance not configured
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:--------------------------------------------------------------
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:--pg_stat_replication
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:--------------------------------------------------------------
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:-No entries found.
20170824:04:21:41:006303 gpstate:smdw:gpadmin-[INFO]:--------------------------------------------------------------

 

2)        Checking for filespace directory /data/master/gpseg-1 on mdw


$ ssh mdw
[gpadmin@smdw ~]$ gpinitstandby -s mdw
20170824:04:24:06:006838 gpinitstandby:smdw:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[INFO]:-Checking for filespace directory /data/master/gpseg-1 on mdw
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[ERROR]:-Filespace directory already exists on host mdw
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[ERROR]:-Failed to create standby
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[ERROR]:-Error initializing standby master: master data directory exists
[gpadmin@smdw ~]$

 

3)        Checking for filespace directory /data/master/gpseg-1 on mdw


$ ssh smdw
[gpadmin@smdw ~]$ gpinitstandby -s mdw
20170824:04:24:06:006838 gpinitstandby:smdw:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[INFO]:-Checking for filespace directory /data/master/gpseg-1 on mdw
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[ERROR]:-Filespace directory already exists on host mdw
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[ERROR]:-Failed to create standby
20170824:04:24:07:006838 gpinitstandby:smdw:gpadmin-[ERROR]:-Error initializing standby master: master data directory exists
[gpadmin@smdw ~]$
 
## mdw
[gpadmin@smdw ~]$ ssh mdw
Last login: Thu Aug 24 03:52:02 2017 from smdw
[gpadmin@mdw ~]$ cd /data/master/gpseg-1
[gpadmin@mdw gpseg-1]$ cd ..
[gpadmin@mdw master]$ mv gpseg-1 gpseg-1.old
[gpadmin@mdw master]$ exit
logout
Connection to mdw closed.
 

 

4)        Initialize standby master on smdw


$ ssh smdw
[gpadmin@smdw ~]$ gpinitstandby -s mdw
Do you want to continue with standby master initialization? Yy|Nn (default=N):
> y
20170824:04:26:16:007218 gpinitstandby:smdw:gpadmin-[INFO]:-Successfully created standby master on mdw
[gpadmin@smdw ~]$

5)        Failback on mdw


·         Shutdown master instance on smdw

·         Activate master instance on mdw

## SMDW
$ ssh smdw
[gpadmin@smdw ~]$ gpstop -m
20170824:04:30:29:008291 gpstop:smdw:gpadmin-[INFO]:-Starting gpstop with args: -m
 
Continue with master-only shutdown Yy|Nn (default=N):
> y
20170824:04:30:33:008291 gpstop:smdw:gpadmin-[INFO]:-Terminating processes for segment /data/master/gpseg-1
[gpadmin@smdw ~]$
 
$ ssh mdw
[gpadmin@mdw ~]$ gpactivatestandby -d $MASTER_DATA_DIRECTORY
Do you want to continue with standby master activation? Yy|Nn (default=N):
> y
20170824:04:32:59:057251 gpactivatestandby:mdw:gpadmin-[INFO]:-found standby postmaster process
20170824:04:33:00:057251 gpactivatestandby:mdw:gpadmin-[INFO]:-mdw is now the new primary master.
20170824:04:33:00:057251 gpactivatestandby:mdw:gpadmin-[INFO]:-You will need to update your user access mechanism to reflect
 [gpadmin@mdw ~]$

6)        Remove old version filespace on smdw


·         Remove filespace directory /data/master/gpseg-1

## SMDW
$ ssh smdw
[gpadmin@smdw ~]$ cd $MASTER_DATA_DIRECTORY
 [gpadmin@smdw master]$ mv gpseg-1 gpseg-1.old

7)        Initilize Standby master on smdw


·         Remove filespace directory /data/master/gpseg-1

# MDW
$ ssh mdw
[gpadmin@mdw ~]$ gpinitstandby -s smdw
20170824:04:43:32:066222 gpinitstandby:mdw:gpadmin-[INFO]:-Validating environment and parameters for standby initialization...
..
Do you want to continue with standby master initialization? Yy|Nn (default=N):
> y
20170824:04:43:52:066222 gpinitstandby:mdw:gpadmin-[INFO]:-Successfully created standby master on smdw
[gpadmin@mdw ~]$

8)        VIP Failback


·         vip down on smdw

·         vip up on mdw

$ ssh smdw
$ su –
# service vip stop
 
# ssh mdw
# service vip start

9)        Connection test with pgadmin (client tool)

Greenplum 6 마스터 Port 변경

Greenplum 마스터 노드 Port 변경 - 이미 설치된 Greenplum 클러스터에서 마스터 port 변경 절차 - https://knowledge.broadcom.com/ external /article?articleNumber= 296...