Google Colaboratory의 무료 GPU로 Keras 사용해 보기

05 March 2018

Colab의 GPU로 Keras 사용해 보기

참고 : 케라스 공식문서에서 멀티GPU 사용하기 예제를 Colab에서 돌아가도록 수정 https://keras.io/utils/#multi_gpu_model
Colab 의 성능은 나쁘지 않은 편이다. 현재 메모리 13기가와 CPU는 Intel(R) Xeon(R) CPU @ 2.30GHz을 사용할 수 있다.
GPU는 Tesla K80 으로 멀티GPU는 지원하지 않는다.

Colaboratory의 메모리 정보보기

!cat /proc/meminfo

MemTotal:       13341960 kB
MemFree:         8533256 kB
MemAvailable:   12413480 kB
Buffers:          592664 kB
Cached:          3106704 kB
SwapCached:            0 kB
Active:          2154112 kB
Inactive:        1833004 kB
Active(anon):     457024 kB
Inactive(anon):   114432 kB
Active(file):    1697088 kB
Inactive(file):  1718572 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:               380 kB
Writeback:             0 kB
AnonPages:        287764 kB
Mapped:           164784 kB
Shmem:            283716 kB
Slab:             746368 kB
SReclaimable:     717980 kB
SUnreclaim:        28388 kB
KernelStack:        3184 kB
PageTables:         4448 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6670980 kB
Committed_AS:    1548724 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      384972 kB
DirectMap2M:    10100736 kB
DirectMap1G:     5242880 kB

Colaboratory의 CPU 정보보기

!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms xsaveopt
bugs		:
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2300.000
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms xsaveopt
bugs		:
bogomips	: 4600.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

케라스 설치하고 불러오기

# https://keras.io/
!pip install -q keras
import keras

케라스는 TensorFlow나 Theano를 백엔드로 사용한다.

Colaboratory는 구글 제품이기 때문에 텐서플로우를 사용하는 듯 하다.
```
  Using TensorFlow backend.
```

케라스의 예제코드 실행

Colaboratory 2018년 3월 5일 현재 멀티 GPU를 지원하지 않는다. 그렇기 때문에 케라스의 multi_gpu_model은 사용할 수 없다.
그럼 어떻게 GPU를 사용해야 될까?

import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np

# 원래 예제는 샘플이 1000개 이지만 빨리 돌려보기 위해 100개로 줄였다.
# 가로세로도 224에서 최소 사이즈인 71로 줄였다.
# 원래 예제로 돌리면 메모리와 CPU가 지쳐버려서 끝까지 실행되지 않는다.
num_samples = 100
height = 71
width = 71
num_classes = 100

# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=1)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

# Save model via the template model (which shares the same weights):
model.save('my_model.h5')

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-6-4353c6d558e3> in <module>()
      6 # Replicates the model on 8 GPUs.
      7 # This assumes that your machine has 8 available GPUs.
----> 8 parallel_model = multi_gpu_model(model, gpus=1)
      9 parallel_model.compile(loss='categorical_crossentropy',
     10                        optimizer='rmsprop')


/usr/local/lib/python3.6/dist-packages/keras/utils/training_utils.py in multi_gpu_model(model, gpus)
    121             raise ValueError('For multi-gpu usage to be effective, '
    122                              'call `multi_gpu_model` with `gpus >= 2`. '
--> 123                              'Received: `gpus=%d`' % gpus)
    124         num_gpus = gpus
    125         target_gpu_ids = range(num_gpus)


ValueError: For multi-gpu usage to be effective, call `multi_gpu_model` with `gpus >= 2`. Received: `gpus=1`

텐서플로우로 해당 장비의 CPU와 GPU정보를 가져온다.

from tensorflow.python.client import device_lib

print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 4221312434634366830
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 356515840
locality {
  bus_id: 1
}
incarnation: 11454811533186484289
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
]

# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
import datetime

start = datetime.datetime.now()
with tf.device('/gpu:0'):
  model = Xception(weights=None,
                   input_shape=(height, width, 3),
                   classes=num_classes)
  model.compile(loss='categorical_crossentropy',
                     optimizer='rmsprop')

  # Generate dummy data.
  x = np.random.random((num_samples, height, width, 3))
  y = np.random.random((num_samples, num_classes))

  # This `fit` call will be distributed on 8 GPUs.
  # Since the batch size is 256, each GPU will process 32 samples.
  # model.fit(x, y, epochs=10, batch_size=256)
  model.fit(x, y, epochs=3, batch_size=16)

  # Save model via the template model (which shares the same weights):
  model.save('my_model.h5')
  
  
end = datetime.datetime.now()
time_delta = end - start

Epoch 1/3
100/100 [==============================] - 7s 66ms/step - loss: 235.2676
Epoch 2/3
100/100 [==============================] - 1s 11ms/step - loss: 231.3431
Epoch 3/3
100/100 [==============================] - 1s 11ms/step - loss: 228.3584

print('GPU 걸린시간: {}초'.format(time_delta.seconds))

GPU 걸린시간: 29초

start = datetime.datetime.now()
with tf.device('/cpu:0'):
  model = Xception(weights=None,
                   input_shape=(height, width, 3),
                   classes=num_classes)
  model.compile(loss='categorical_crossentropy',
                     optimizer='rmsprop')

  # Generate dummy data.
  x = np.random.random((num_samples, height, width, 3))
  y = np.random.random((num_samples, num_classes))

  # This `fit` call will be distributed on 8 GPUs.
  # Since the batch size is 256, each GPU will process 32 samples.
  # model.fit(x, y, epochs=10, batch_size=256)
  model.fit(x, y, epochs=3, batch_size=16)

  # Save model via the template model (which shares the same weights):
  model.save('my_model.h5')
  
  
end = datetime.datetime.now()
time_delta = end - start

Epoch 1/3
100/100 [==============================] - 25s 248ms/step - loss: 271.1356
Epoch 2/3
100/100 [==============================] - 19s 187ms/step - loss: 258.8897
Epoch 3/3
100/100 [==============================] - 19s 186ms/step - loss: 247.9633

print('CPU 걸린시간: {}초'.format(time_delta.seconds))

CPU 걸린시간: 88초

결론: 싱글 CPU와 GPU 환경에서 GPU를 사용했을 때 훨씬 빠르게 실행 되는 것을 확인할 수 있다.

python, 머신러닝, 딥러닝, 텐서플로우, 케라스 1

python 9
ml 6
머신러닝 6
machine learning 6
기계학습 6
딥러닝 4
텐서플로우 1
케라스 2
GPU 1

Google Colaboratory의 무료 GPU로 Keras 사용해 보기

tensorflow를 백엔드로 사용해서 GPU로 케라스를 돌려봅니다.

Colab의 GPU로 Keras 사용해 보기

Colaboratory의 메모리 정보보기

Colaboratory의 CPU 정보보기

케라스 설치하고 불러오기

케라스는 TensorFlow나 Theano를 백엔드로 사용한다.

케라스의 예제코드 실행

텐서플로우로 해당 장비의 CPU와 GPU정보를 가져온다.