In previous post I benchmarked various HTTP mono backends in linux and found that Nginx+mono-server-fastcgi pair is very slow in comparison with others. There was several times difference in number of served requests per second! So two questions were raised: the first is "Why is so slow?" and second "What can be done to improve performance?". In this post I'll try to answer to both questions
Why is so slow?
Let's profile fastcgi mono server. You should remember that profiling can be enabled by setting appropriate MONO_OPTIONS environment variable. If you don't you can read about web servers profiling options in the first part
After running profile I've got the results
Total(ms) Self(ms) Calls Method name
243637 4 1002 (wrapper remoting-invoke-with-check) Mono.WebServer.FastCgi.ApplicationHost:ProcessRequest (Mono.WebServer.FastCgi.Responder)
140963 4 591 (wrapper runtime-invoke) :runtime_invoke_void__this___object (object,intptr,intptr,intptr)
140863 60 501 Mono.FastCgi.Server:OnAccept (System.IAsyncResult)
140570 25 501 Mono.FastCgi.Connection:Run ()
129977 3 501 Mono.FastCgi.Request:AddInputData (Mono.FastCgi.Record)
129971 5 501 Mono.FastCgi.ResponderRequest:OnInputDataReceived (Mono.FastCgi.Request,Mono.FastCgi.DataReceivedArgs)
129964 0 501 Mono.FastCgi.ResponderRequest:Worker (object)
129963 1 501 Mono.WebServer.FastCgi.Responder:Process ()
129959 34 501 (wrapper xdomain-invoke) Mono.WebServer.FastCgi.ApplicationHost:ProcessRequest (Mono.WebServer.FastCgi.Responder)
122777 3 501 (wrapper xdomain-dispatch) Mono.WebServer.FastCgi.ApplicationHost:ProcessRequest (object,byte[]&,byte[]&)
113673 3 501 Mono.WebServer.FastCgi.ApplicationHost:ProcessRequest (Mono.WebServer.FastCgi.Responder)
112227 14 501 Mono.WebServer.BaseApplicationHost:ProcessRequest (Mono.WebServer.MonoWorkerRequest)
112205 2 501 Mono.WebServer.MonoWorkerRequest:ProcessRequest ()
111942 2 501 System.Web.HttpRuntime:ProcessRequest (System.Web.HttpWorkerRequest)
111761 3 501 System.Web.HttpRuntime:RealProcessRequest (object)
111745 11 501 System.Web.HttpRuntime:Process (System.Web.HttpWorkerRequest)
110814 7 501 System.Web.HttpApplication:System.Web.IHttpHandler.ProcessRequest (System.Web.HttpContext)
110785 7 501 System.Web.HttpApplication:Start (object)
110148 14 501 System.Web.HttpApplication:Tick ()
110133 346 501 System.Web.HttpApplication/c__Iterator1:MoveNext ()
73347 92 6012 System.Web.HttpApplication/c__Iterator0:MoveNext ()
64025 32 501 System.Web.Security.FormsAuthenticationModule:OnAuthenticateRequest (object,System.EventArgs)
62704 141 21042 Mono.WebServer.FastCgi.WorkerRequest:GetKnownRequestHeader (int)
62550 250 45647 System.Runtime.Serialization.Formatters.Binary.ObjectReader:ReadObject (System.Runtime.Serialization.Formatters.Binary.BinaryElement,System.IO.BinaryReader,long&,object&,System.Runtime.Serialization.SerializationInfo&)
62273 5 1002 System.Web.HttpRequest:get_Cookies ()
62203 134 20040 Mono.WebServer.FastCgi.WorkerRequest:GetUnknownRequestHeaders ()
56381 6 1002 (wrapper remoting-invoke-with-check) Mono.WebServer.FastCgi.Responder:GetParameters ()
56373 34 501 (wrapper xdomain-invoke) Mono.WebServer.FastCgi.Responder:GetParameters ()
54634 368 44653 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteObjectInstance (System.IO.BinaryWriter,object,bool)
51554 16 1514 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Deserialize (System.IO.Stream)
51537 47 1514 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:NoCheckDeserialize (System.IO.Stream,System.Runtime.Remoting.Messaging.HeaderHandler)
51531 34 12007 System.Runtime.Remoting.RemotingServices:DeserializeCallData (byte[])
50521 19 1514 System.Runtime.Serialization.Formatters.Binary.ObjectReader:ReadObjectGraph (System.Runtime.Serialization.Formatters.Binary.BinaryElement,System.IO.BinaryReader,bool,object&,System.Runtime.Remoting.Messaging.Header[]&)
48246 46 7536 System.Runtime.Serialization.Formatters.Binary.ObjectReader:ReadNextObject (System.IO.BinaryReader)
47020 999 54096 System.Runtime.Serialization.Formatters.Binary.ObjectReader:ReadValue (System.IO.BinaryReader,object,long,System.Runtime.Serialization.SerializationInfo,System.Type,string,System.Reflection.MemberInfo,int[])
35051 143 22013 System.Runtime.Remoting.RemotingServices:SerializeCallData (object)
34198 7 1516 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Serialize (System.IO.Stream,object)
34190 15 1516 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Serialize (System.IO.Stream,object,System.Runtime.Remoting.Messaging.Header[])
33354 28 1516 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteObjectGraph (System.IO.BinaryWriter,object,System.Runtime.Remoting.Messaging.Header[])
33253 78 1516 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteQueuedObjects (System.IO.BinaryWriter)
29792 539 16549 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteObject (System.IO.BinaryWriter,long,object)
28486 656 49652 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteValue (System.IO.BinaryWriter,System.Type,object)
26041 101 501 System.Runtime.Serialization.Formatters.Binary.ObjectReader:ReadGenericArray (System.IO.BinaryReader,long&,object&)
24552 16 501 System.Web.HttpApplication:PipelineDone ()
23851 58 501 System.Web.HttpApplication:OutputPage ()
23782 20 501 System.Web.HttpResponse:Flush (bool)
23079 598 16539 System.Runtime.Serialization.Formatters.Binary.ObjectReader:ReadObjectContent (System.IO.BinaryReader,System.Runtime.Serialization.Formatters.Binary.ObjectReader/TypeMetadata,long,object&,System.Runtime.Serialization.SerializationInfo&)
22542 24 501 (wrapper xdomain-dispatch) Mono.WebServer.FastCgi.Responder:GetParameters (object,byte[]&,byte[]&)
19536 39 3030 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteArray (System.IO.BinaryWriter,long,System.Array)
18377 105 501 System.Runtime.Serialization.Formatters.Binary.ObjectWriter:WriteGenericArray (System.IO.BinaryWriter,long,System.Array)
In profile you can see there are alot of binary serialization calls which take most of the processing time. But if you look into the mono fastcgi code, you don't find any explicit calls of BinarySerializer. What is going on? I hope you've already guessed what caused such overhead in serialization calling in other case let's look on to the picture:
New FastCGI request handler is created for every request from Nginx, than request looks for corresponding web application by HTTP_HOST server variable and after application have found creates new HttpWorkerRequest inside of it, and calls Process method to process it. While processing web application communicates with FastCGI request handler (asks for HTTP headers, returns HTTP response and so on). Because FastCGI request handler and web application are located in different domains all calls between them goes through remoting. Remoting calls binary serialization for objects are passed and this makes application slow. I'd rather say remoting makes application VERY VERY VERY SLOW if you pass complex types between endpoints. It's a prime evil of distributed applications which need to be performant. Don't use remoting if you have another choice to communicate between your apps.
OK, we found, that fastcgi server actively uses remoting inside of it and this can reduce performance. But is the remoting only one thing which dramatically reduces the performance? Maybe FastCGI protocol itself is a very slow and we couldn't use fast and reliable mono web server with nginx?
To check this I decided to write simple application based on mono-server-fastcgi source code. The application should instantly return "Hello, world!" http response for every http request without using remoting. If I could write such app and it would be more performant, I would proved that more reliable web server could be created.
Proof of concept
I took FastCGI server sources and wrote my own network server based on async sockets. From the old sources I only got FastCGI record parser, all other I rid off. After the simple app has been completed, I made a benchmarks
Before publishing results, let's remember benchmarks of mono-server-fastcgi were maden in previous post.
Configuration | requests/sec | Standart deviation | std dev % | Comments |
Nginx+fastcgi-server+ServiceStack | 571.36 | 8.81 | 1.54 | Memory Leaks |
Nginx+fastcgi-server hello.html | 409.48 | 9.14 | 2.23 | Memory Leaks |
Nginx+fastcgi-server hello.aspx | 458.55 | 9.89 | 2.16 | Memory Leaks, Crashes |
Nginx+proxy xsp4+ServiceStack | 1402.33 | 45.42 | 3.24 | Unstable Results, Errors |
This benchmarks were maden with Apache ab tool using 10 concurrent requests. You can see, that fastcgi mono server performs 400-500 requests per second. In new benchmarks I additionally variate number of concurrent requests to see influence on the results. The command was
ab -n 100000 -c <concurency> http://testurl
Nginx configuration:
server {
listen 81;
server_name ssbench3;
access_log /var/log/nginx/ssbench3.log;
location / {
root /var/www/ssbench3/;
index index.html index.htm default.aspx Default.aspx;
fastcgi_index Default.aspx;
fastcgi_pass 127.0.0.1:9000;
include /etc/nginx/fastcgi_params;
}
}
Benchmark results:
Nginx fastcgi settings | Concurency | Requests/Sec | Standart deviation | std dev % |
TCP sockets | 10 | 2619.56 | 49.95 | 1.83 |
TCP sockets | 20 | 2673.198 | 19.43 | 0.72 |
TCP sockets | 30 | 2681.166 | 15.83 | 0.59 |
Significant difference isn't it? These results give us a hope, that we can increase throughoutput of fastcgi server if we change the architecture and remove remoting communication from it. By the way there is a room to increase performance. Are you ready to go further?
Faster higher stronger
Next step I've done I switched connumication between nginx and server from TCP sockets to Unix sockets. Config and results
server {
listen 81;
server_name ssbench3;
access_log /var/log/nginx/ssbench3.log;
location / {
root /var/www/ssbench3/;
index index.html index.htm default.aspx Default.aspx;
fastcgi_index Default.aspx;
fastcgi_pass unix:/tmp/fastcgi.socket;
include /etc/nginx/fastcgi_params;
}
}
Results
Nginx fastcgi settings | Concurency | Requests/Sec | Standart deviation | std dev % |
Unix sockets | 10 | 2743.622 | 40.91 | 1.49 |
Unix sockets | 20 | 2952.244 | 67.86 | 2.29 |
Unix sockets | 30 | 2949.118 | 86.19 | 2.92 |
It gained up to 5-10%. Not so bad but I want to increase performance more better, because when we'll change simple http response from fastcgi request handler to real ASP.NET process method we will loose a lot of performance points.
One of the questions, answer to it could help to increase performance: is there a way to keep connection between nginx and fastcgi server instead of create it for every request? In above configurations nginx requires to close connection from fastcgi server to approve end of processing request. By the way FastCGI protocol has EndRequest command and keeping connection and using EndRequest command instead of closing connection could save huge amount of time in processing small requests. Fortunately, nginx has support of such feature, it's called keepalive. I enabled keepalive and set minimal number of open connections to 32 between nginx and my server. I choosen this number, because it was higher than the maximum number of concurrent requests I did with ab.
upstream fastcgi_backend {
# server 127.0.0.1:9000;
server unix:/tmp/fastcgi.socket;
keepalive 32;
}
server {
listen 81;
server_name ssbench3;
access_log /var/log/nginx/ssbench3.log;
location / {
root /var/www/ssbench3/;
index index.html index.htm default.aspx Default.aspx;
fastcgi_index Default.aspx;
fastcgi_keep_conn on;
fastcgi_pass fastcgi_backend;
include /etc/nginx/fastcgi_params;
}
}
Nginx fastcgi settings | Concurency | Requests/Sec | Standart deviation | std dev % |
TCP sockets. KeepAlive | 10 | 3720.23 | 49.36 | 1.33 |
TCP sockets. KeepAlive | 30 | 3907.85 | 80.48 | 2.06 |
Unix sockets. KeepAlive | 10 | 4024.678 | 122.33 | 3.04 |
Unix sockets. KeepAlive | 20 | 4458.714 | 72.87 | 1.63 |
Unix sockets. KeepAlive | 30 | 4482.648 | 19.40 | 0.43 |
Wow! That is a huge performance gains! Up to 50% compared with previous results! So I thought this is enough for proof of concept and I could start to create more faster fastcgi mono web server. To proove my thought I made simple .NET web server (without nginx), which always returns "Hello, world!" http response and test it with ab. It shows me ~5000 reqs/sec and this is close to my fastcgi proof of concept server
HyperFastCGI server
The target is clear now. I am going to create fast and reliable fastcgi server for mono, which can serve in second as much requests as possible and be stable. Unfortunatly it cannot be maden as just performance tweaking of current mono fastcgi server. The architecture needs to be changed to avoid cross-domain calls while processing requests.
What I did:
- I wrote my own connection handling using async sockets. It should also decrease processor usage, but I did not compare servers by this parameter.
- I totally rewrote FastCGI packets parsing, trying to decrease number of operations needed to handle them.
- I changed the architecture by moving FastCGI packet handling to the same domain, where web application is located.
- Currently there are no known memory leaks when processing requests.
This helped to improve performance of the server, here are the benchmarks:
Url | Nginx fastcgi settings/Concurency | Requests/Sec | Standart deviation | std dev % |
/hello.aspx | TCP keepalive/10 | 1404.174 | 24.93 | 1.78 |
/servicestack/json | TCP keepalive/10 | 1671.15 | 21.40 | 1.28 |
/servicestack/json | TCP keepalive/20 | 1718.158 | 41.46 | 2.41 |
/servicestack/json | TCP keepalive/30 | 1752.69 | 34.56 | 1.97 |
/servicestack/json | Unix sockets keepalive/10 | 1755.55 | 40.30 | 2.30 |
/servicestack/json | Unix sockets keepalive/20 | 1817.488 | 39.30 | 2.16 |
/servicestack/json | Unix sockets keepalive/30 | 1822.984 | 36.48 | 2.00 |
The performance compared to original mono fastcgi server raised up serveral times! But this is not enough. While testing I found that threads created and destroyed very often. Creation of threads is expensive operation and I decided to increase minimal number of threads in threadpool. I added new option /minthreads to the server and set it to /minthreads=20,8 which means that there will be at least 20 running working threads in threadpool and 8 IO threads (for async sockets communications).
/minthreads=20,8 benchmarks:
Url | Nginx fastcgi settings/Concurency | Requests/Sec | Standart deviation | std dev % |
/servicestack/json | TCP keepalive/10 | 2041.246 | 23.18 | 1.14 |
/servicestack/json | TCP keepalive/20 | 2070.08 | 10.95 | 0.53 |
/servicestack/json | TCP keepalive/30 | 2093.526 | 24.27 | 1.16 |
/servicestack/json | Unix sockets keepalive/10 | 2156.754 | 37.74 | 1.75 |
/servicestack/json | Unix sockets keepalive/20 | 2182.774 | 42.96 | 1.97 |
/servicestack/json | Unix sockets keepalive/30 | 2268.676 | 28.39 | 1.25 |
Such easy thing gives performance boost up to 20%!
Finally, I place all nginx configurations benchmarks in one chart
At the end I say that HyperFactCgi server can be found at github. Currently it's not well tested, so use it at your own risk. But at least all ServiceStack(v3) WebHosts.Integration tests which passed with XSP passed with HyperFastCgi too. To install HyperFastCgi simply do:
git clone https://github.com/xplicit/HyperFastCgi.git
cd HyperFastCgi
./autogen.sh --prefix=/usr && make
sudo make install
configuration options are the same as mono-server-fastcgi plus few new parameters:
/minthreads=nw,nio - minimal number of working and iothreads
/maxthreads=nw,nio - maximal number of working and iothreads
/keepalive=<true|false> - use keepalive feature or not. Default is true
/usethreadpool=<true|false> - use threadpool for processing requests. Default is true
If HyperFastCgi server be interesting to others for using it in production I am going to improve it. What can be improved:
- Support several virtual paths in one server.Currently only one web application is supported
- Write unit tests to be sure, that the server is working properly
- Catch and properly handle UnloadDomain() command from ASP.NET. This command is raised when web.config is changed or under some health checking by web-server. (Edit: already done)
- Add management and monitoring application which shows server statistics (number of requests serverd and so on) and recommends performance tweaks
- Additional performance improvements
Links:
HyperFastCgi server source code
ServiceStack performance in mono. Part 1
ServiceStack performance in mono. Part 2
ServiceStack performance in mono. Part 4