Use 'set follow-fork-mode child' for gdb to follow childs. [ ] Not clear what happens with CFLAGS passed by rpmbuild (for example). [ ] Do not set affinity if we have a single worker? [ ] Should I have a link to the master, to remove a lot of pointers to common stuff? For example C->web. Or the callbacks. [ ] We share C->web between connections! So struct Conn_web must contail only 'urls' (for now), to be able to dispatch requests, but the rest of the fields must be per C. Check Conn_web_dispatch for the logic. [ ] Shouldn't Conn_web_script receive a *web parameter instead of C?! Because we do not expose the web stuff! [ ] Parent must register the pipe socket to be able to receive notifications! And the loglevel when entering a function should be the minimum of all Log calls inside the function. If is not ok, log full info in errors! [ ] We must standardize on [C->id __func__] in all functions. [ ] Before Conn_commit, call Conn_private_size(C, xxx); and auto alloc priv area. [ ] It is bad. If we start only one worker, seems I do not accept connections! [ ] I need to pass the bind info using control socket and do the bind in the client. Think about a crash. Think about multiple listening sockets. Maybe the best thing to do is in master to not deal with any Conn structure. Only the control interfaces. Not really possible because the API controls a Conn struct. But, maybe, do not init anything except callbacks. [ ] On master sockets, we must not try to get peername getpeername(5, 0x7ffc5c935fe0, [16]) = -1 ENOTCONN (Transport endpoint is not connected) [ ] I must fork workers in Conn_commit, to have the callbacks. [ ] Check SO_BUSY_POLL (man 7 socket) [ ] Start as many threads as cpumask is: avoid forbidden CPUs [ ] 4.4: Add setsockopt() support for SO_INCOMING_CPU and extend SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot [ ] http://thread.gmane.org/gmane.linux.network/337836 - SO_INCOMING_CPU [ ] SO_INCOMING_CPU http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2c8c56e15df3d4c2af3d656e44feb18789f75837 [ ] https://lwn.net/Articles/659199/ listenter improvements - see about numa [ ] https://lwn.net/Articles/655299/ MSG_ZEROCOPY [ ] When alocating a poll of Conns, check the alignment! [ ] Scenario: raspundem la un request, vedem ca nu mai avem nimic de trimis, datele sint inca in buffer, apelam close => datele se pierd! Trebuie sa facem shutdown! [ ] De facut o schema cu starile prin care trece o conexiune, suspectez ca atunci cind obuf e 0, nu fac shutdown in loc de close. [ ] La http/1.1, default e sa nu inchida conexiunea. [ ] Check the new batch mode of epoll [ ] Use SO_REUSEPORT for accept(): !!! http://lists.dragonflybsd.org/pipermail/users/2013-July/053632.html !!! https://github.com/monkey/monkey/commit/d1da249a0b5e8f5765ea8031919fb32e93c57cb8 [ ] Use defer accept! [ ] == Devel point == [ ] I think that I must switch back to processes. Too much overhead for threads. And I do not know if I gain something by using threads. [ ] Now I am working on simple web requests. Static (/) and dynamic (/cgi?a=1). [ ] We must send "HTTP/x.x code message" respecting incoming request. Our API must deal with it. == Some history == 2014-04-02: Se pare ca bat gwan-ul. Cam 4300 vs 3700. Dar eu nu fac chiar tot ce face el. Trec la un API pentru a crea un server web. Sa vedem cam cum ar trebui sa arate. C = Conn_alloc(); wp = Conn_wpool_create(); Conn_set_wp(C, wp); Conn_commit(C); while (1) { Conn_poll(-1); } ws = Conn_ws_create(C); Conn_ws_path(C, "/static", "/home/x/public_html"); Conn_ws_script(C, "/cgi-bin/script1", function_script1); Sounds good. Another thing: libConn - 40k, wpool2 - 10k! 2014-03-25: Se pare ca syscall-urile mele dureaza mai mult decit ale lui. Chiar nu am nici o explicatie. Cum naiba de se intimpla asta? Oare se contorizeaza si cod-ul dintre syscall-uri? Probabil ca se contorizeaza si asteptarea! Si atunci e corect. Dar la shutdown ce explicatie am?! Se pare ca dupa un 'ab', wpool2 nu se mai opreste din mincat CPU! 2014-03-25: Concluzie: Eu petrec 87% din timp in epoll_wait! gwan doar 6!!! Se pare ca vine EPOLLIN si EPOLLRDHUP si nu fac nimic! Dezactivez EPOLLRDHUP! Wow! 5600 req/s! Dar, se pare ca tot am 95% in epoll_wait. 112724 apeluri fata de 10k! Tot multe! Se pare ca ma blochez undeva si nu mai progresez de acolo! strace -c (-n20000 -c10): gwan: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 19.84 0.031245 1 60000 setsockopt 19.52 0.030749 2 20000 writev 18.32 0.028859 1 40000 20000 shutdown 16.58 0.026120 0 60000 20000 epoll_ctl 8.68 0.013667 1 20002 close 6.23 0.009811 0 20183 183 accept4 5.94 0.009357 1 10440 epoll_wait 4.88 0.007684 0 20042 20 read 0.00 0.000000 0 2 open 0.00 0.000000 0 12 stat 0.00 0.000000 0 2 fstat 0.00 0.000000 0 2 mmap 0.00 0.000000 0 1 mprotect 0.00 0.000000 0 2 munmap ------ ----------- ----------- --------- --------- ---------------- 100.00 0.157492 250688 40203 total % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 87.27 3.011111 46 65802 epoll_wait 3.17 0.109265 5 20000 sendto 2.37 0.081884 4 20146 145 accept4 2.28 0.078617 1 63925 recvfrom 1.90 0.065479 3 20005 epoll_ctl 1.75 0.060324 3 20000 shutdown 1.22 0.041954 2 20008 close 0.03 0.001000 500 2 socketpair 0.02 0.000535 20 27 19 open 0.00 0.000077 19 4 munmap 0.00 0.000038 5 7 read 0.00 0.000000 0 1 write 0.00 0.000000 0 5 fstat 0.00 0.000000 0 19 mmap 0.00 0.000000 0 12 mprotect 0.00 0.000000 0 4 brk 0.00 0.000000 0 2 rt_sigaction 0.00 0.000000 0 1 rt_sigprocmask 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 socket 0.00 0.000000 0 1 bind 0.00 0.000000 0 1 listen 0.00 0.000000 0 2 setsockopt 0.00 0.000000 0 2 clone 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 getcwd 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 3 1 futex 0.00 0.000000 0 2 sched_setaffinity 0.00 0.000000 0 1 sched_getaffinity 0.00 0.000000 0 1 epoll_create 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 3 set_robust_list 0.00 0.000000 0 3 epoll_create1 ------ ----------- ----------- --------- --------- ---------------- 100.00 3.450284 229996 166 total 2014-03-24 Am adaugat cancel_disable. ab -n20000 -c10 http://localhost:60000/100.html: 4763 req/sec sub perf FARA PERF: 4374 req/sec WTF?! perf report: 14.45% wpool2 [kernel.kallsyms] [k] ep_poll 13.84% wpool2 [kernel.kallsyms] [k] set_normalized_timespec 13.63% wpool2 [vdso] [.] 0x0000000000000cb0 6.24% wpool2 [kernel.kallsyms] [k] read_hpet 5.71% wpool2 [kernel.kallsyms] [k] select_estimate_accuracy 5.41% wpool2 libConn.so.1.0.33 [.] Conn_wpool_worker_func Probabil ca apelez gettimeofday de prea multe ori. Da, se pare ca 0x0000000000000cb0 este gettimeofday. Daca scot sched_yield, cu perf record am 4778 req/s [pid 1787] SYS_mmap(0, 0x8000000, 0, 0x4022) = 0x7f5bb264f000 [pid 1787] SYS_munmap(0x7f5bb264f000, 26939392) = 0 [pid 1787] SYS_munmap(0x7f5bb8000000, 40169472) = 0 Se pare ca se face un mmap si apoi imediat munmap. WTF?! Ulterior nu mai face. Eu gwan epoll_wait epoll_wait accept4 accept4 mmap - munmap - munmap - mprotect - - setsockopt(NODELAY) epoll_ctl epoll_ctl epoll_wait epoll_wait recvfrom read - setsockopt(NODELAY) - open - fstat - open,stat,fstat,mmap,read,close,munmap sendto writev shutdown shutdown - setsockopt(NODELAY) epoll_wait epoll_wait - epoll_ctl(DEL) - shutdown! - epoll_ctl(DEL)! close close Clar pot mai bine de atit! 2014-03-11 Use pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL); Maybe this way cancelation will not appear in perf reports. 2013-12-12 Concluzii: - Eu stau mult mai mult timp in epoll_wait. Very strange. E vorba de 22 de secunde in plus! - Eu chem accept4 cu 40000 mai mult! Fuck! - Cum naiba eu chem de 50.000 ori shutdown, fara erori, iar el cheama de 100.000 si dureaza mai putin?! - E incredibil cum reuseste. Doar daca syscall-urile mele sint intrerupte de prea multe ori. - Macar eu fac de 3 ori mai putine setsockopt. Next steps: Nu mai chem accept4 inca o data, pentru ca veni cu notificarea. Din ce in ce mai putin cred in EPOLLET. What a fuck?! Decit sa fac un apel la accept, pe fiecare thread, mai bine apelez epoll_wait. wpool2: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.87 35.710386 187 191175 epoll_wait 1.26 0.459349 3 141576 91576 accept4 0.52 0.189052 4 50000 sendto 0.22 0.080552 2 50000 shutdown 0.06 0.023073 0 50000 close 0.03 0.010337 0 50306 recvfrom 0.03 0.009216 0 50000 epoll_ctl 0.02 0.007091 0 50000 setsockopt ------ ----------- ----------- --------- --------- ---------------- 100.00 36.489056 633057 91576 total gwan: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.60 13.464343 89 150462 epoll_wait 0.91 0.124850 2 50000 writev 0.49 0.067336 1 100000 50000 shutdown 0.44 0.060369 1 101351 51351 accept4 0.25 0.034678 0 150000 setsockopt 0.13 0.017909 0 150000 50000 epoll_ctl 0.11 0.015268 0 50123 4 read 0.07 0.009198 0 50002 close ------ ----------- ----------- --------- --------- ---------------- 100.00 13.795050 802006 151355 total 2013-12-10: after -O3 me prof: -c2+5000 8% -c2+50000 3719 2013-11-24: on r1 (after putting free structures in front of free list) me -c10+50000 5034 -c2+50000 3777 4700 -c1+50000 2973 3900 2013-11-21: on r1 me gwan -c1+50000 1703 3206 -c2+50000 3619 4900 -c10+50000 3647 5028 2013-11-20: on r1 (after doing allocations per thread): me branch 1+5000 7 -c1+50000 -c2+50000 5093!broken? 2013-11-17: on r1 (after doing accept in all workers): me me+Log branch 1+5000 6.4! -c1+50000 2180 -c2+50000 3400 2013-11-16: on r1: me me+Log gwan branch 1+5000 9.5% -c1+50000: 3882 2094! -c2+50000: 4100 4927 2013-11-13: on r1: ~4270 req/s (-n50000 -c2 + taskset + nice) branch mispredict: 8% K:3.11.4-201 so(Conn)=448 gwan:-c1:~2000 me-c1:3875 2013-11-12: on r1: ~3850 req/s (-n50000 -c2 + taskset + nice) branch mispredict: 9% K:3.11.4-201 so(Conn)=480 == SHOWSTOPPERS == [ ] Call Conn_ws_free when freeing a Conn. [ ] Make sure we compile with -O3 [ ] Should we call again accept or go to poll mode? I think we should go to poll. [ ] Compile with -s to obtain profiling on assembly code. [ ] We may get rid of NODELAY because we write and do shutdown. I hope this is triggering a flush. To test. [ ] Prima data, ar trebui sa ignor O pentru ca nu am cum sa am ceva in buffer. [ ] Imi trebui un mecanism, preferabil fara locking, ca sa trimit statistici catre master. Eventual doar la cerere, ca sa evit trafic inutil. Dar, conexiunea pentru statistici, o sa vina pe un worker. Probabil ca pot sa fac o semnalizare prin pipe. Copiez intr-un buffer statisticile curente, apoi trimit pointer-ul prin pipe. Aste pentru update. In momentul in care vine o cerere de statistici, trebuie sa le cer de la master si apoi sa le servesc. [ ] Stop using callbacks for send/receive to speed up operations. [ ] We should do not call initial out hook. We can just try to send at first kick and react to EAGAIN. Very probably we can send. [ ] Probabil ca o sa avem structuri diferite pentru ce seteaza clientul (Conn_alloc/commit) si alta pentru bookkeeping-ul intern. [ ] Move main pollfd to all threads. Tis way they will be "equal" and every core will be at full speed without migrations. [ ] Check with gdb why we get a segmentation fault in line 2267. [ ] Limit the number of acepts to not starve read/write. == HIGH PRIORITY == [ ] Should we do Conn_now per thread? It is updated from all worker threads! [ ] Verify likely/unlikely. I suspect are not working correctly. [ ] http://lwn.net/Articles/257209/ [ ] http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html [ ] http://fasterdata.es.net/host-tuning/linux/ [ ] Investigate MSG_MORE when sending. [ ] When init Conn, preallocate a 1 worker wp and set it and when user requests another wp, just put(wp) and set the new one? Or at commit time? Use cases: want to alloc 1 core for a listen port and for other many cores. [ ] Split Conn_poll_cb into MASTER/NON_MASTER and do not make it callback but inline. [ ] ->next pointer can be removed from struct Conn. This way I can save a lot of space. [ ] Do not make the fd -1, is pointless. [ ] Replace Conn_X with Conn_get_socket_X! [ ] Use shutdown(2) before closing connection. Done, but see the link. http://www.developerweb.net/forum/archive/index.php/t-2940.html. [ ] Switch all pointers to callbacks to a single callback with paramenters + a flag that will say for what type of callbacks to call the callback. What happends when I want to change one callback? [ ] Nu pare ca inchid conexiunea: fac shutdown, dar atit. [ ] Conn_free_intern is not called. Because of callbacks? [ ] Try to alloc bigger chunks for wpool and maybe other stuff. [ ] Alloc private area just after Conn structure. Add a function to set private size. [ ] Set on master socket the needed in/out buffer sizes and inherit to accepted ones. This is because we may need different buffers for different masters. [ ] Daca am luat HUP, nu mai trebuie sa permit parsarea in continuare! [ ] Ar trebui sa-l scoatem din lista de active C-ul caruia ii facem free. [ ] Align Conn structures to 8 bytes in allocations blocks. [ ] Investigate the idea to put free buffers in front of the queue because they are hot. [ ] == LOW PRIORITY == [ ] Use enums for enum types. [ ] Cache getaddrinfo responses [ ] Investigate moving TCP stack in userspace. [ ] Conn_join(C1, C2) (Bridge 2 connections together for proxy stuff.) [ ] See http://highscalability.com/blog/2012/9/10/russ-10-ingredient-recipe-for-making-1-million-tps-on-5k-har.html [ ] Dump all memory statistics [ ] SCTP [ ] .error_state -> error_type [ ] if (.error_state...) -> if (.state == CONN_STATE_ERROR) [ ] Audit CONN_STATE_EMPTY vs CONN_STATE_FREE [ ] Add a function to set the maximum number of connections. [ ] Fix the whole list scanning for expiration, band and closing. [ ] Put callbacks in a structure to free some space from struct Conn. [ ] wpool: When we free a Conn structure, we have to Conn_del_wp! [ ] wpool: What if we add master sockets also to workers and do nothing in main thread? Check ma.c example. Verified: accept wakes up only one thread. Still to check if epoll wakes in all threads! Seems it wakes all threads! Not very good. [ ] Investigate splice. [ ] Investigate MSG_MORE as an alternative to CORK or writev. [ ] Check if we are swapping and warn. [ ] Log faults and io. [ ] Add access control Conn_ac_set_default(C, CONN_AC_DENY) - default deny (or CONN_AC_ALLOW) Conn_ac_add(C, CONN_AC_ALLOW, "2001::1/64"); - for ipv6 Conn_ac_add(C, CONN_AC_ALLOW, "192.168.0.0/25"); - for ipv4 [ ] A la redir stuff [ ] Check PACKET: can we send with "send" without knowing the MAC? [ ] UDP [ ] Ce se intimpla daca se ajunge la ~ sfirsitul buffer-ului si nu pot inca sa procesez datele? We should log and close the connection. It is programmer's fault or a DoS. [ ] Queue for delete/trytoconnect/etc. Performance: [ ] net.core.somaxconn [ ] Take care for /proc/net/netstat [ ] /proc/sys/net/ipv4/tcp_mem Now (512M): 49152 65536 98304 Now (256M): 24576 32768 49152 - 55 conns/sec Test with: 80000 120000 240000 - 92 conns/sec Test with 160000 240000 480000 - 96 conns/sec After: echo "16000 64000 512000" > tcp_[rw]mem - 96 After echo 1 > /proc/sys/net/ipv4/tcp_low_latency - 156 conns/sec Pentru a reduce numarul de conexiuni in TIME-WAIT: echo 200 > /proc/sys/net/ipv4/tcp_max_tw_buckets [ ] Add loadbalancing and failover in the base code. [ ] Automaticaly put \0 at the end of receive data. What for?! [ ] Add the possibility to wait for an char/string before calling recv/data callback. Maybe do this with socket filtering or in kernel? [ ] Change socket buffer accordingly with user settings to minimize needed memory. [ ] Dump how many memory is in use vor various parts of the internal data. [ ] Do not mix slot and id and fd in examples. [ ] Test suite [ ] Free memory when the number of connections is going down. [ ] Bandwidth part should have a separate pointer, to not load too much Conn structure. [ ] Maybe we should have Bandwidth classes so we can group connections. [ ] http://www.erlang-solutions.com/thesis/tcp_optimisation/tcp_optimisation.html [ ] === When we switch to Conn version 2 library === [ ] Conn_socket will call Conn_socket_proto [ ] use enums! [ ] http://urbanairship.com/blog/2010/09/29/linux-kernel-tuning-for-c500k/ [ ]