Google Colaboratory의 무료 GPU로 Keras 사용해 보기
tensorflow를 백엔드로 사용해서 GPU로 케라스를 돌려봅니다.
05 March 2018
Colab의 GPU로 Keras 사용해 보기
-
참고 : 케라스 공식문서에서 멀티GPU 사용하기 예제를 Colab에서 돌아가도록 수정 https://keras.io/utils/#multi_gpu_model
- Colab 의 성능은 나쁘지 않은 편이다. 현재 메모리 13기가와 CPU는 Intel(R) Xeon(R) CPU @ 2.30GHz을 사용할 수 있다.
- GPU는 Tesla K80 으로 멀티GPU는 지원하지 않는다.
Colaboratory의 메모리 정보보기
!cat /proc/meminfo
MemTotal: 13341960 kB
MemFree: 8533256 kB
MemAvailable: 12413480 kB
Buffers: 592664 kB
Cached: 3106704 kB
SwapCached: 0 kB
Active: 2154112 kB
Inactive: 1833004 kB
Active(anon): 457024 kB
Inactive(anon): 114432 kB
Active(file): 1697088 kB
Inactive(file): 1718572 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 380 kB
Writeback: 0 kB
AnonPages: 287764 kB
Mapped: 164784 kB
Shmem: 283716 kB
Slab: 746368 kB
SReclaimable: 717980 kB
SUnreclaim: 28388 kB
KernelStack: 3184 kB
PageTables: 4448 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6670980 kB
Committed_AS: 1548724 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 384972 kB
DirectMap2M: 10100736 kB
DirectMap1G: 5242880 kB
Colaboratory의 CPU 정보보기
!cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU @ 2.30GHz
stepping : 0
microcode : 0x1
cpu MHz : 2300.000
cache size : 46080 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms xsaveopt
bugs :
bogomips : 4600.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU @ 2.30GHz
stepping : 0
microcode : 0x1
cpu MHz : 2300.000
cache size : 46080 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms xsaveopt
bugs :
bogomips : 4600.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
케라스 설치하고 불러오기
# https://keras.io/
!pip install -q keras
import keras
케라스는 TensorFlow나 Theano를 백엔드로 사용한다.
- Colaboratory는 구글 제품이기 때문에 텐서플로우를 사용하는 듯 하다.
Using TensorFlow backend.
케라스의 예제코드 실행
- Colaboratory 2018년 3월 5일 현재 멀티 GPU를 지원하지 않는다. 그렇기 때문에 케라스의 multi_gpu_model은 사용할 수 없다.
- 그럼 어떻게 GPU를 사용해야 될까?
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
# 원래 예제는 샘플이 1000개 이지만 빨리 돌려보기 위해 100개로 줄였다.
# 가로세로도 224에서 최소 사이즈인 71로 줄였다.
# 원래 예제로 돌리면 메모리와 CPU가 지쳐버려서 끝까지 실행되지 않는다.
num_samples = 100
height = 71
width = 71
num_classes = 100
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=1)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-4353c6d558e3> in <module>()
6 # Replicates the model on 8 GPUs.
7 # This assumes that your machine has 8 available GPUs.
----> 8 parallel_model = multi_gpu_model(model, gpus=1)
9 parallel_model.compile(loss='categorical_crossentropy',
10 optimizer='rmsprop')
/usr/local/lib/python3.6/dist-packages/keras/utils/training_utils.py in multi_gpu_model(model, gpus)
121 raise ValueError('For multi-gpu usage to be effective, '
122 'call `multi_gpu_model` with `gpus >= 2`. '
--> 123 'Received: `gpus=%d`' % gpus)
124 num_gpus = gpus
125 target_gpu_ids = range(num_gpus)
ValueError: For multi-gpu usage to be effective, call `multi_gpu_model` with `gpus >= 2`. Received: `gpus=1`
텐서플로우로 해당 장비의 CPU와 GPU정보를 가져온다.
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 4221312434634366830
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 356515840
locality {
bus_id: 1
}
incarnation: 11454811533186484289
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7"
]
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
import datetime
start = datetime.datetime.now()
with tf.device('/gpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
# model.fit(x, y, epochs=10, batch_size=256)
model.fit(x, y, epochs=3, batch_size=16)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
end = datetime.datetime.now()
time_delta = end - start
Epoch 1/3
100/100 [==============================] - 7s 66ms/step - loss: 235.2676
Epoch 2/3
100/100 [==============================] - 1s 11ms/step - loss: 231.3431
Epoch 3/3
100/100 [==============================] - 1s 11ms/step - loss: 228.3584
print('GPU 걸린시간: {}초'.format(time_delta.seconds))
GPU 걸린시간: 29초
start = datetime.datetime.now()
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
# model.fit(x, y, epochs=10, batch_size=256)
model.fit(x, y, epochs=3, batch_size=16)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
end = datetime.datetime.now()
time_delta = end - start
Epoch 1/3
100/100 [==============================] - 25s 248ms/step - loss: 271.1356
Epoch 2/3
100/100 [==============================] - 19s 187ms/step - loss: 258.8897
Epoch 3/3
100/100 [==============================] - 19s 186ms/step - loss: 247.9633
print('CPU 걸린시간: {}초'.format(time_delta.seconds))
CPU 걸린시간: 88초
- 결론: 싱글 CPU와 GPU 환경에서 GPU를 사용했을 때 훨씬 빠르게 실행 되는 것을 확인할 수 있다.