Rao W 0230 Tuning Rhel For Databases
Rao W 0230 Tuning Rhel For Databases
for Databases
Sanjay Rao
Principal Performance Engineer, Red Hat
Tools
RED HAT
ADVANCED SERVER 2.1
BRINGING LINUX AND OPEN
SOURCE TO THE ENTERPRISE
02
03
04
05
RED HAT
ENTERPRISE LINUX 4
DELIVERING RAS, STORAGE,
MILITARY-GRADE SECURITY
06
RED HAT
ENTERPRISE LINUX 3
MULTI-ARCHITECTURE SUPPORT,
MORE CHOICES WITH A FAMILY
OF OFFERINGS
07
08
RED HAT
ENTERPRISE LINUX 6
LINUX BECOMES MAINSTREAM FOR
PHYSICAL, VIRTUAL, AND CLOUD
09
10
11
RED HAT
ENTERPRISE LINUX 5
VIRTUALIZATION, STATELESS LINUX
ANY APPLICATION, ANYWHERE, ANYTIME
12
13
14
RED HAT
ENTERPRISE LINUX 7
THE FOUNDATION FOR THE
OPEN HYBRID CLOUD
PPC64
s390x
Kernel
Headers
Debug
What To Tune
I/O
Memory
CPU
Network
What is tuned ?
Units
Balanced
throughput-performance
network-throughput
throughput-performance
nanoseconds
nanoseconds
Percent
Percent
Weight 1-100
Boolean
KB
Boolean
nanoseconds
Percent
Bytes
Bytes
Pages
auto-scaling
3000000
20
10
60
deadline
Enabled
ondemand
128
Enabled
normal
500000
auto-scaling
auto-scaling
auto-scaling
auto-scaling
10000000
15000000
40
10
10
performance
4096
performance
100
Max=16777216
Max=16777216
Max=16777216
Units
Balanced
nanoseconds
nanoseconds
percent
percent
Weight 1-100
auto-scaling
3000000
20
10
60
deadline
Enabled
ondemand
N/A
N/A
normal
N/A
Boolean
Boolean
nanoseconds
percent
microseconds
microseconds
Boolean
Boolean
latency-performance
network-latency
latency-performance
10000000
10000000
10
3
10
performance
No
Locked @ 1
performance
Yes
5000000
100
50
50
Enabled
Disabled
Units
throughput-performance
Inherits From/Notes
sched_min_ granularity_ns
nanoseconds
10000000
sched_wakeup_granularity_ns
dirty_ratio
dirty_background_ratio
nanoseconds
percent
percent
15000000
40
10
swappiness
I/O Scheduler (Elevator)
Weight 1-100
10
Filesystem Barriers
CPU Governor
Boolean
Disk Read-ahead
Energy Perf Bias
kernel.sched_migration_cost_ns
Bytes
percent
virtual-host
virtual-guest
throughputperformance
throughputperformance
performance
4096
performance
nanoseconds
5000000
100
30
30
Trans / Min
3.10.0-113
3.10.0-113
3.10.0-113
3.10.0-113
10
40
User Count
80
balanced
TP
LP
pwrsave
Multiple HBAs
Device-mapper multipath
Provides multipathing capabilities and LUN persistence
Check for your storage vendors recommendations (upto 20% performance gains with
correct settings)
Deadline
Two queues per device, one for read and one for writes
I/Os dispatched based on time spent in queue
Used for multi-process applications and systems running enterprise storage
CFQ
Per process queue
Each process queue gets fixed time slice (based on process priority)
Default setting - Slow storage (SATA) root file system
Noop
FIFO
Simple I/O Merging
Lowest CPU Cost
Low latency storage and applications (Solid State Devices)
Direct I/O
Predictable performance
Memory
(file cache)
DIO
Avoid
Double
Caching
DB Cache
Asynchronous I/O
Eliminate synchronous I/O stall
Flat files on
file systems
Database
00:05:46
Completion Time
00:05:02
00:04:19
00:03:36
00:02:53
00:02:10
00:01:26
00:00:43
00:00:00
256
1024
This shows I/O for only the disks that are in use
Transactions / Min
Logs Fusion-io
Logs FC
10U
40U
80U
Transactions / Min
4 database instances
FC
Fusion-io
dm-cache
Caching in the device mapper stack to improve performance
Caching of frequently accessed data in a faster target
tech-preview
Trans / Min
SATA
Dm-cache backed with FusionIO
Dm-cache backed with FusionIO Run2
10U
20U
40U
60U
User set
80U
100U
Memory Tuning
NUMA
Huge Pages
Manage Virtual Memory pages
Flushing of dirty pages
Swapping behavior
M3
M1
S2
c1
c2
c1
c2
c3
c4
c3
c4
c1
c2
c1
c2
c3
S3
c4
c3
c4
M2
M3
M1
M4
M2
M3
M4
M2
M4
S4
S Socket
C Core
M Memory Bank Attached to each Socket
D Database
Access path between sockets
Access path between sockets and memory
D1
S1
D2
D3
D4
S2
S3
S4
No NUMA optimization
D1
D2
D3
D4
S1
S2
S3
S4
NUMA optimization
Taskset
CPU pinning
cgroups
cpusets
cpu and memory cgroup
Libvirt
for KVM guests CPU pinning
2inst-hugepgs
2inst-hugepgs-numapin
4 inst-hugepgs
4 inst-hugepgs-numapin
10U
40U
80U
Trans / Min
4 VM
4 VM - numapin
4 VM - numad
Trans / Min
~ 9 % improvement
in performance
Regular pgs
Huge pages
10U
20U
40U
User Set Count
60U
80U
unused memory
cache
Free pagecache
Free slabcache
Objectives
of
this
session
Anything Hyper has to be good ... right?
Using Hyperthreads improves performance with database workload but the mileage will vary
depending on how the database workload scales.
Having more CPUs sharing the same physical cache can also help performance in some
cases
Some workloads lend themselves to scaling efficiently and they will do very well with
hyperthreads but if the scaling factor for workloads are not linear with physical cpus, it
probably wont be a good candidate for scaling with hyperthreads.
1 node no HT
1 node HT
Trans / min
2 nodes no HT
2 nodes HT
3 nodes no HT
3 nodes HT
4 nodes no HT
4 nodes HT
10U
40U
80U
User sets
Single Instance scaled across NUMA nodes, one node at a time. The 1 node test shows the best gain in performance.
As more NUMA nodes come into play, the performance difference is hard to predict because of the memory placement
and the CPU cache sharing among physical threads and hyperthreads of the CPUs
Avg % gain 1 Node 30% / 2 Nodes 20 % / 3 Nodes 14 % / 4 Nodes 5.5 %
Trans / min
~ 35% improevment
4 Inst HT
4 Inst no HT
10U
20U
40U
60U
80U
User sets
Each of the 4 instances were aligned to an individual NUMA node. This test shows the best gain in
performances as other factors influencing performance like NUMA, I/O are not a factor
Network Performance
Separate network for different functions (Private network for database traffic)
H2
OLTP Workload
1500
9000
10U
40U
100U
DSS workloads
9000MTU
1500MTU
Database Performance
Application tuning
Design
Reduce locking / waiting
Database tools (optimize regularly)
Resiliency is not a friend of performance
Please attend
Oracle Database 12c on Red Hat Enterprise Linux: Best practices
Thursday, April 17 11 a.m. to 12 p.m. in Room 102
Cgroup
Resource Management
Memory, cpus, IO, Network
For performance
For application consolidation
Dynamic resource allocation
Application Isolation
I/O Cgroups
At device level control the % of I/O for each Cgroup if the device is shared
At device level put a cap on the throughput
No Resource Control
Inst 1
Inst 2
Inst 3
Inst 4
Trans / Min
Inst
Inst
Inst
Inst
no NUMA
Cgroup NUMA
4
3
2
1
Trans / Min
Instance 1
Instance 2
Time
35 K
30 K
20 K
Swap In
Swap Out
15 K
10 K
25 K
Instance
Instance
Instance
Instance
5K
K
Time
Regular
Throttled
Even though one application does not have resources and starts swapping,
other applications are not affected
4
3
2
1
Database on RHEV
VM1
Figure 2
VM2
VM1
Host
File Cache
VM2
Trans / min
cache=none
cache=WT
cache=WT-run2
10U
20U
40U
User set count
60U
80U
7.54
no NUMA
Manual Pin
NUMAD
Trans / Min
THP-scan=10000
THP - scan=100
Hugepages
10U
20U
40U
60U
User Set
80U
100U
Trans / Min
40U
80U
100U
User Set
Transactions / minute
TPM- regular
TPM Mig BW - 32
TPM Mig BW - 0
Time
Tools
perfstatwithregular4kpages
Performancecounterstatsfor<databaseworkload>
7344954.315998taskclock#6.877CPUsutilized
64,577,684contextswitches#0.009M/sec
23,074,271cpumigrations#0.003M/sec
1,621,164pagefaults#0.221K/sec
16,251,715,158,810cycles#2.213GHz[83.35%]
12,106,886,605,229stalledcyclesfrontend#74.50%frontendcyclesidle[83.33%]
8,559,530,346,324stalledcyclesbackend#52.67%backendcyclesidle[66.66%]
5,909,302,532,078instructions#0.36insnspercycle
#2.05stalledcyclesperinsn[83.33%]
1,585,314,389,085branches#215.837M/sec[83.31%]
43,276,126,707branchmisses#2.73%ofallbranches[83.35%]
1068.000304798secondstimeelapsed
OLTP Workload
PCI SSD Storage
OLTP Workload
Fibre Channel Storage
Memory Stats
I/O Stats
CPU stats
Swap stats
Thank you