DragonFly BSD
DragonFly users List (threaded) for 2005-03
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]

Re: Current stable tag slip status 23-Mar-2005

From: Bill Hacker <wbh@xxxxxxxxxxxxx>
Date: Thu, 24 Mar 2005 13:12:49 +0800

Matthew Dillon wrote:

:Dodgy TCP/IP  - or even the possibility of it - is a
:'showstopper' to pushing this into a 'production' setting.
:Can a 'blueprint' be assembled and posted with all
:necessary environment, configuration, test procedures,
:and necessary documentation of same, so that some of
:us 'non-coders' can help with *relevant* tests?
:I can put 4 to 8 servers onto it over the next 48 hours, but
:am not sure where to start, what to look for, or what
:coders need as output from the process in order to
:isolate - and either fix or confirm as safe to ignore.

It's not quite so easy.

If it was 'easy' I'd have asked a local script kiddie ;-)

Just reproducing these issues can be a major
task, because there are a huge number of unknowns... the kernel the
originator was running might be old, the bug might have been indirectly fixed by some other commit made later on, the originator may be using odd compiler optimizations or (as in the case of the ppbus
report) gcc-3.4 (which nobody running a production system should be using
yet). The particular hardware could bad, the issue could be driver
related... the originator might have tweaked some sysctls in odd ways that are causing problem. There are a lot of unknowns.

ACK. The very reason I was/am trying for 'group think' advice as to where to look first...

Most of the bugs were reported on machines running older kernels, and those people are now trying to reproduce them on the latest kernels.
This is a particular liability for TCP related bugs due to the number of
bug fixes that have been committed recently.

IF any have NOT been reporoduced (at least by) by the same folks on the same platforms, I am happy to consider those 'no longer of interest'.

    If you want to have a go at reproducing some of these issues please do!
    I'm working on NFS right now.  If you have an SMP box see if you can
    reproduce the IPV4 connection issue reported by Peter Avalos.

Not running SMP presently. Can do, but this isn't a casus belli.

I'll give a quick summary of where my thinking is on the issues still
open. The stable tag is still going to be slipped today regardless,
because the existing stable has become a liability... there are just
too many bugs that have been fixed since then, some quite serious
(certainly more serious then a non-fatal repeatable tcp issue).

ACK. Put that way, it makes good sense. Risk/reward ratio is positive.

    If we suddenly get a flurry of bug reports from other people related to
    these particular bugs, you can be sure that the issue will be tracked
    down quickly and fixed.

ACK. A Fresh starting point, as it were.

    IPV4 connection problems    - still diagnosing (probably SMP related)
        (reported by Peter Avalos)

	This issue is either due to an (as yet unknown) wildcard listen
	socket problem or some resource is getting blown out and we just
	haven't figured out which resource it is.

	Symptoms:  connections to apache or ftp on localhost sometimes
	timeout, but then a few seconds later connect just find.  A packet
	trace shows the TCP connection requests going out but no
	acknowledgement or RST coming back.  Only one person has reported
	the problem, so we don't know if it is a software bug or if some
	resource is being blown out or if the machine is being attacked or

Archaeology in both cases. Neither ftp or Apache of current interest.

I am going to try to reproduce it on my SMP test box today.

    TCP Flooding issue          - still diagnosing
        (reported by Atte Peltomaki)

Atte's report is:

	Every once in a while, when application crashes and leaves an open TCP
	connection, data starts flowing full speed back and forth the boxes.

Here's tcpdump output from one occasion where Opera crashed:

=06:56:33.123424 IP webserverip.80 > myboxip.1632: . ack 1 win 33580 <nop,nop,ti
mestamp 859 139667162>
06:56:33.123461 IP myboxip.1632 > webserverip.80: F 838440838:838440838(0) ack 3
219133104 win 65535 <nop,nop,timest
	[repeats at a high rate]

netstat says the connection is in CLOSE_WAIT state.

This worried me, as I had a similar happening within the 'house' side of an ADSL router on 3 occasions, evidenced by switching hub LED's going bug-fsck and loss of usable connectivity all around. Simply updated DragonFly 'til is ceased rather than analyze.

Two other boxen active at the time, PowerBook 17",
Mac OS X, patched to 2005-002, FreeBSD 4.11-STABLE
of 2 FEB 05. No stack tweaks.  'factory' defaults.

Atte was originally running a faily old kernel and is now retesting with the latest kernel.

'Whatever it is' has not reappeared *here* in either gcc2 or gcc3 ISO's nor cvs head with back-to-back make cycles in any release since o/a 13 MAR. May have gone away earlier, I am not updating every day.

	It is unknown whether the bug exists in the latest kernel.  I have
	not been able to replicate it as yet.  Clearly some TCP parameters
	have been tweaked (e.g. the default window size is not 65535), perhaps
	the issue is related to some of the tweaks that have been made.

Highly probable. My undocumented issues, above, were bog-standard, as-issued stacks, had no TCP tweaks.

I still have that ISO on CD, could try to back-pedal
and replicate, but disinclined to bother with scatology.

Whether fixed by design,  or 'sideswipe' fixed as
peripheral to some other work, that one seems
to have gone away.

    NFS TCP connection failure  - still diagnosing
        (reported by someone through Jeffrey Hsu)

The report is that a large volume of NFS operations over a TCP
mount result in the TCP connection dying. A TCP trace seems to
show the TCP connection getting a FIN and resetting.

	We don't have any further information, so we don't know why the TCP
	connection was closed.  It could have been an nfsd being killed on
	the server, or it could have been something else.

Way too little info here and, far too many external variables in general with NFS. Not generally used here for other reasons.

I have run gigabytes through an NFS tcp connection and not yet been able to replicate the problem.



Too little meat on the NFS bone, other items appear to
have been local abberations, and/or not reproduced with
any certainty, if at all, on later releases.

Benefits of slipping tag clearly outweigh risks, IMO.

I'm for it!

Thanks for the analysis and sitrep.


[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]