Discussion:
Kernel oops on setting sky2 interfaces down
Rene Mayrhofer
2009-07-21 16:26:39 UTC
Permalink
Hi everybody,

[Please CC me in replies, I am not currently subscribed to this list.]

I have a fully reproducible kernel oops in the sky2 module in kernel
2.6.28.10. The kernel is a vanilla 2.6.28.10 (and I can't switch to
anything newer at this time because of missing squashfs-lzma support),
patched with PaX, netfilter-layer7, squashfs (with LZMA), and IMQ. The
base system is a Debian Lenny with some updates from testing/unstable.

Whenever interfaces using the sky2 module (this box has 8 network
interfaces in a 19" rack appliance) go down, the oops occurs:

[~]# ifdown -a --exclude=lo
[ 1535.000069] sky2 0000:01:00.0: error interrupt status=0xffffffff
[ 1535.006649] sky2 0000:01:00.0: PCI hardware error (0xffff)
[ 1535.012608] sky2 0000:01:00.0: PCI Express error (0xffffffff)
[ 1535.018821] sky2 wan: ram data read parity error
[ 1535.023827] sky2 wan: ram data write parity error
[ 1535.028913] sky2 wan: MAC parity error
[ 1535.032992] sky2 wan: RX parity error
[ 1535.036983] sky2 wan: TCP segmentation error
[ 1535.041655] general protection fault: 0000 [#1] PREEMPT SMP
[ 1535.045601] last sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed

[ 1535.045601] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport serio_raw pcspkr i2c_i801
i2c_core iTCO_wdt rng_core intel_agp agpgart squashfs sqlzma unlzma loop
aufs exportfs nls_utf8 nls_cp437 ide_generic sd_mod ide_gd_mod
ata_generic pata_acpi ata_piix piix ide_pci_generic ide_core skge sky2
thermal_sys
[ 1535.045601]

[ 1535.045601] Pid: 9960, comm: mv Not tainted (2.6.28.10 #2)

[ 1535.045601] EIP: 0060:[<f808085a>] EFLAGS: 00010286 CPU: 0

[ 1535.045601] EIP is at sky2_mac_intr+0x22/0x9d [sky2]

[ 1535.045601] EAX: f8090f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff

[ 1535.045601] ESI: 00000000 EDI: f682cb80 EBP: 00000080 ESP: f5f13ed4

[ 1535.045601] DS: 0068 ES: 0068 FS: 00d8 GS: 0033 SS: 0068

[ 1535.045601] Process mv (pid: 9960, ti=f5f12000 task=f4a961c0
task.ti=f5f12000)

[ 1535.045601] Stack:

[ 1535.045601] ff08340b f682cb88 ffffffff ffffffff f712b800 f80839d6
00000040 f682cb88

[ 1535.045601] 00000000 00000001 f682cb80 c082111a 00000000 00000000
00000003 f7014b80
[ 1535.045601] c0a604e8 00000246 f7014b80 c0838f21 00000000 c0a604e8
00000101 c1d10124
[ 1535.045601] Call Trace:
[ 1535.045601] [<f80839d6>] sky2_poll+0x1cb/0xbed [sky2]
[ 1535.045601] [<c082111a>] __wake_up+0x29/0x39
[ 1535.045601] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39
[ 1535.045601] [<c0838f21>] __queue_work+0x4d/0x5a
[ 1535.045601] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39
[ 1535.045601] [<c09eda45>] net_rx_action+0xb8/0x1f6
[ 1535.045601] [<c082f954>] __do_softirq+0x95/0x142
[ 1535.045601] [<c082fa49>] do_softirq+0x48/0x57
[ 1535.045601] [<c082fbc9>] irq_exit+0x3b/0x78
[ 1535.045601] [<c081218f>] smp_apic_timer_interrupt+0x75/0x7f
[ 1535.045601] [<c0804f48>] apic_timer_interrupt+0x28/0x30
[ 1535.045601] [<c0a60000>] rwsem_down_failed_common+0xa4/0x175
[ 1535.045601] Code: c0 83 c4 14 5b 5e 5f 5d c3 55 89 d5 57 89 c7 56 53
89 d3 c1 e5 07 83 ec 04 8b 74 90 30 8d 85 08 0f 00 00 03 07 8a 10 88 54
24 03 <f6> 86 0d 05 00 00 02 74 12 0f b6 c2 50 56 68 84 5b 08 f8 e8 cd
[ 1535.045601] EIP: [<f808085a>] sky2_mac_intr+0x22/0x9d [sky2] SS:ESP
0068:f5f13ed4
[ 1535.302490] Kernel panic - not syncing: Fatal exception in interrupt
[ 1535.309412] Rebooting in 30 seconds..


Or even when doing it more slowly, interface by interface:

[~]# ifdown tun6to4; cat /proc/net/dev | cut -d: -f1 | grep -v Inter |
grep -v face | sort -u | while read iface; do echo $iface; ifdown
$iface; sleep 3s; done
hb

lo

dmz

lan
[ 1127.000261] sky2 0000:04:00.0: error interrupt status=0xffffffff
[ 1127.007348] sky2 0000:04:00.0: PCI hardware error (0xffff)
[ 1127.013745] sky2 0000:04:00.0: PCI Express error (0xffffffff)
[ 1127.020468] sky2 lan: ram data read parity error
[ 1127.025834] sky2 lan: ram data write parity error
[ 1127.031302] sky2 lan: MAC parity error
[ 1127.035671] sky2 lan: RX parity error
[ 1127.039910] sky2 lan: TCP segmentation error
[ 1127.045079] general protection fault: 0000 [#1] PREEMPT SMP
[ 1127.048879] last sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed

[ 1127.048879] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport pcspkr serio_raw i2c_i801
i2c_core iTCO_wdt rng_core intel_agp agpgart squashfs sqlzma unlzma loop
aufs exportfs nls_utf8 nls_cp437 ide_generic sd_mod ide_gd_mod
ata_generic pata_acpi ata_piix piix ide_pci_generic ide_core skge sky2
thermal_sys
[ 1127.048879]

[ 1127.048879] Pid: 20150, comm: rndc Not tainted (2.6.28.10 #2)

[ 1127.048879] EIP: 0060:[<f808085a>] EFLAGS: 00010286 CPU: 0

[ 1127.048879] EIP is at sky2_mac_intr+0x22/0x9d [sky2]

[ 1127.048879] EAX: f80d8f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff

[ 1127.048879] ESI: 00000000 EDI: f68c2a80 EBP: 00000080 ESP: eb83fb38

[ 1127.048879] DS: 0068 ES: 0068 FS: 00d8 GS: 0000 SS: 0068

[ 1127.048879] Process rndc (pid: 20150, ti=eb83e000 task=f695bb00
task.ti=eb83e000)

[ 1127.048879] Stack:

[ 1127.048879] ff08340b f68c2a88 ffffffff ffffffff f712c000 f80839d6
00000040 f68c2a88

[ 1127.048879] c0a78d54 f70344e0 f68c2a80 f695bb00 c0a78d54 c0a604e8
c1d10980 c0a78d54

[ 1127.048879] c0827013 00000000 0000000f 00000246 f70344e0 00000102
c0be5180 c0832dc6

[ 1127.048879] Call Trace:

[ 1127.048879] [<f80839d6>] sky2_poll+0x1cb/0xbed [sky2]

[ 1127.048879] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39

[ 1127.048879] [<c0827013>] try_to_wake_up+0x158/0x162

[ 1127.048879] [<c0832dc6>] process_timeout+0x0/0x5

[ 1127.048879] [<c09eda45>] net_rx_action+0xb8/0x1f6

[ 1127.048879] [<c082f954>] __do_softirq+0x95/0x142

[ 1127.048879] [<c082fa49>] do_softirq+0x48/0x57

[ 1127.048879] [<c082fbc9>] irq_exit+0x3b/0x78

[ 1127.048879] [<c081218f>] smp_apic_timer_interrupt+0x75/0x7f

[ 1127.048879] [<c0804f48>] apic_timer_interrupt+0x28/0x30

[ 1127.048879] [<c0867764>] get_page_from_freelist+0x2b8/0x3df

[ 1127.048879] [<c0867ae0>] __alloc_pages_internal+0x98/0x37f

[ 1127.048879] [<c0862ee0>] find_lock_page+0x10/0x43

[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23

[ 1127.048879] [<c086fb70>] __do_fault+0xaa/0x3bc

[ 1127.048879] [<c08718e1>] handle_mm_fault+0x54a/0xbfa

[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23

[ 1127.048879] [<c089442d>] __d_lookup+0xfa/0x116

[ 1127.048879] [<c088cb78>] do_lookup+0x53/0x153

[ 1127.048879] [<c0893375>] dput+0x16/0xfc

[ 1127.048879] [<c088eb25>] __link_path_walk+0xb01/0xbfb

[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23

[ 1127.048879] [<c086f05e>] kmap_high+0x17c/0x186

[ 1127.048879] [<c0819b76>] default_spin_lock_flags+0x5/0x7

[ 1127.048879] [<c081a64b>] do_page_fault+0x335/0x86e

[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23

[ 1127.048879] [<c0870a91>] unmap_vmas+0x498/0x6ab

[ 1127.048879] [<c087321e>] free_pgtables+0x7d/0x93

[ 1127.048879] [<c086d42e>] vma_prio_tree_insert+0x17/0x7f

[ 1127.048879] [<c0874a45>] vma_link+0x51/0x73

[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23

[ 1127.048879] [<c0874a5f>] vma_link+0x6b/0x73

[ 1127.048879] [<c08763b8>] mmap_region+0x475/0x58c

[ 1127.048879] [<c08767a4>] do_mmap_pgoff+0x2d5/0x326

[ 1127.048879] [<c08081db>] sys_mmap2+0x62/0x77

[ 1127.048879] [<c08081e9>] sys_mmap2+0x70/0x77

[ 1127.048879] [<c081a316>] do_page_fault+0x0/0x86e

[ 1127.048879] [<c0a60805>] error_code+0x75/0x80

[ 1127.048879] Code: c0 83 c4 14 5b 5e 5f 5d c3 55 89 d5 57 89 c7 56 53
89 d3 c1 e5 07 83 ec 04 8b 74 90 30 8d 85 08 0f 00 00 03 07 8a 10 88 54
24 03 <f6> 86 0d 05 00 00 02 74 12 0f b6 c2 50 56 68 84 5b 08 f8 e8 cd

[ 1127.048879] EIP: [<f808085a>] sky2_mac_intr+0x22/0x9d [sky2] SS:ESP
0068:eb83fb38

[ 1127.470534] Kernel panic - not syncing: Fatal exception in interrupt

[ 1127.478035] Rebooting in 30 seconds..



It seems that the oops occurs when the last network interface using the
sky2 module goes down, although I am not completely certain about this.
I am also fairly sure that the other patches applied to 2.6.28.10 are
not at fault, as the same kernel works perfectly well on different
hardware (which is not using the sky2 NIC module).

Attached are the lspci -v output and the kernel config.

Any hints on what may be wrong would be highly appreciated. I am able to
try patches to sky2 and/or give remote ssh access to the box (although
it will be offline for 5 minutes after triggering the oops...).

best regards,
Rene
Stephen Hemminger
2009-07-21 16:58:53 UTC
Permalink
On Tue, 21 Jul 2009 18:26:39 +0200
Post by Rene Mayrhofer
Hi everybody,
[Please CC me in replies, I am not currently subscribed to this list.]
I have a fully reproducible kernel oops in the sky2 module in kernel
2.6.28.10. The kernel is a vanilla 2.6.28.10 (and I can't switch to
anything newer at this time because of missing squashfs-lzma support),
patched with PaX, netfilter-layer7, squashfs (with LZMA), and IMQ. The
base system is a Debian Lenny with some updates from testing/unstable.
Whenever interfaces using the sky2 module (this box has 8 network
Looks like the device is disappearing from the PCI bus when
brought down. Can you reproduce it with 2.6.30.2 or 2.6.31-rc3?
Post by Rene Mayrhofer
[~]# ifdown -a --exclude=lo
[ 1535.000069] sky2 0000:01:00.0: error interrupt status=0xffffffff
[ 1535.006649] sky2 0000:01:00.0: PCI hardware error (0xffff)
[ 1535.012608] sky2 0000:01:00.0: PCI Express error (0xffffffff)
[ 1535.018821] sky2 wan: ram data read parity error
[ 1535.023827] sky2 wan: ram data write parity error
[ 1535.028913] sky2 wan: MAC parity error
[ 1535.032992] sky2 wan: RX parity error
[ 1535.036983] sky2 wan: TCP segmentation error
[ 1535.041655] general protection fault: 0000 [#1] PREEMPT SMP
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
[ 1535.045601] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport serio_raw pcspkr i2c_i801
i2c_core iTCO_wdt rng_core intel_agp agpgart squashfs sqlzma unlzma loop
aufs exportfs nls_utf8 nls_cp437 ide_generic sd_mod ide_gd_mod
ata_generic pata_acpi ata_piix piix ide_pci_generic ide_core skge sky2
thermal_sys
[ 1535.045601]
[ 1535.045601] Pid: 9960, comm: mv Not tainted (2.6.28.10 #2)
[ 1535.045601] EIP: 0060:[<f808085a>] EFLAGS: 00010286 CPU: 0
[ 1535.045601] EIP is at sky2_mac_intr+0x22/0x9d [sky2]
[ 1535.045601] EAX: f8090f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff
[ 1535.045601] ESI: 00000000 EDI: f682cb80 EBP: 00000080 ESP: f5f13ed4
[ 1535.045601] DS: 0068 ES: 0068 FS: 00d8 GS: 0033 SS: 0068
[ 1535.045601] Process mv (pid: 9960, ti=f5f12000 task=f4a961c0
task.ti=f5f12000)
[ 1535.045601] ff08340b f682cb88 ffffffff ffffffff f712b800 f80839d6
00000040 f682cb88
[ 1535.045601] 00000000 00000001 f682cb80 c082111a 00000000 00000000
00000003 f7014b80
[ 1535.045601] c0a604e8 00000246 f7014b80 c0838f21 00000000 c0a604e8
00000101 c1d10124
[ 1535.045601] [<f80839d6>] sky2_poll+0x1cb/0xbed [sky2]
[ 1535.045601] [<c082111a>] __wake_up+0x29/0x39
[ 1535.045601] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39
[ 1535.045601] [<c0838f21>] __queue_work+0x4d/0x5a
[ 1535.045601] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39
[ 1535.045601] [<c09eda45>] net_rx_action+0xb8/0x1f6
[ 1535.045601] [<c082f954>] __do_softirq+0x95/0x142
[ 1535.045601] [<c082fa49>] do_softirq+0x48/0x57
[ 1535.045601] [<c082fbc9>] irq_exit+0x3b/0x78
[ 1535.045601] [<c081218f>] smp_apic_timer_interrupt+0x75/0x7f
[ 1535.045601] [<c0804f48>] apic_timer_interrupt+0x28/0x30
[ 1535.045601] [<c0a60000>] rwsem_down_failed_common+0xa4/0x175
[ 1535.045601] Code: c0 83 c4 14 5b 5e 5f 5d c3 55 89 d5 57 89 c7 56 53
89 d3 c1 e5 07 83 ec 04 8b 74 90 30 8d 85 08 0f 00 00 03 07 8a 10 88 54
24 03 <f6> 86 0d 05 00 00 02 74 12 0f b6 c2 50 56 68 84 5b 08 f8 e8 cd
[ 1535.045601] EIP: [<f808085a>] sky2_mac_intr+0x22/0x9d [sky2] SS:ESP
0068:f5f13ed4
[ 1535.302490] Kernel panic - not syncing: Fatal exception in interrupt
[ 1535.309412] Rebooting in 30 seconds..
[~]# ifdown tun6to4; cat /proc/net/dev | cut -d: -f1 | grep -v Inter |
grep -v face | sort -u | while read iface; do echo $iface; ifdown
$iface; sleep 3s; done
hb
lo
dmz
lan
[ 1127.000261] sky2 0000:04:00.0: error interrupt status=0xffffffff
[ 1127.007348] sky2 0000:04:00.0: PCI hardware error (0xffff)
[ 1127.013745] sky2 0000:04:00.0: PCI Express error (0xffffffff)
[ 1127.020468] sky2 lan: ram data read parity error
[ 1127.025834] sky2 lan: ram data write parity error
[ 1127.031302] sky2 lan: MAC parity error
[ 1127.035671] sky2 lan: RX parity error
[ 1127.039910] sky2 lan: TCP segmentation error
[ 1127.045079] general protection fault: 0000 [#1] PREEMPT SMP
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
[ 1127.048879] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport pcspkr serio_raw i2c_i801
i2c_core iTCO_wdt rng_core intel_agp agpgart squashfs sqlzma unlzma loop
aufs exportfs nls_utf8 nls_cp437 ide_generic sd_mod ide_gd_mod
ata_generic pata_acpi ata_piix piix ide_pci_generic ide_core skge sky2
thermal_sys
[ 1127.048879]
[ 1127.048879] Pid: 20150, comm: rndc Not tainted (2.6.28.10 #2)
[ 1127.048879] EIP: 0060:[<f808085a>] EFLAGS: 00010286 CPU: 0
[ 1127.048879] EIP is at sky2_mac_intr+0x22/0x9d [sky2]
[ 1127.048879] EAX: f80d8f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff
[ 1127.048879] ESI: 00000000 EDI: f68c2a80 EBP: 00000080 ESP: eb83fb38
[ 1127.048879] DS: 0068 ES: 0068 FS: 00d8 GS: 0000 SS: 0068
[ 1127.048879] Process rndc (pid: 20150, ti=eb83e000 task=f695bb00
task.ti=eb83e000)
[ 1127.048879] ff08340b f68c2a88 ffffffff ffffffff f712c000 f80839d6
00000040 f68c2a88
[ 1127.048879] c0a78d54 f70344e0 f68c2a80 f695bb00 c0a78d54 c0a604e8
c1d10980 c0a78d54
[ 1127.048879] c0827013 00000000 0000000f 00000246 f70344e0 00000102
c0be5180 c0832dc6
[ 1127.048879] [<f80839d6>] sky2_poll+0x1cb/0xbed [sky2]
[ 1127.048879] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39
[ 1127.048879] [<c0827013>] try_to_wake_up+0x158/0x162
[ 1127.048879] [<c0832dc6>] process_timeout+0x0/0x5
[ 1127.048879] [<c09eda45>] net_rx_action+0xb8/0x1f6
[ 1127.048879] [<c082f954>] __do_softirq+0x95/0x142
[ 1127.048879] [<c082fa49>] do_softirq+0x48/0x57
[ 1127.048879] [<c082fbc9>] irq_exit+0x3b/0x78
[ 1127.048879] [<c081218f>] smp_apic_timer_interrupt+0x75/0x7f
[ 1127.048879] [<c0804f48>] apic_timer_interrupt+0x28/0x30
[ 1127.048879] [<c0867764>] get_page_from_freelist+0x2b8/0x3df
[ 1127.048879] [<c0867ae0>] __alloc_pages_internal+0x98/0x37f
[ 1127.048879] [<c0862ee0>] find_lock_page+0x10/0x43
[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23
[ 1127.048879] [<c086fb70>] __do_fault+0xaa/0x3bc
[ 1127.048879] [<c08718e1>] handle_mm_fault+0x54a/0xbfa
[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23
[ 1127.048879] [<c089442d>] __d_lookup+0xfa/0x116
[ 1127.048879] [<c088cb78>] do_lookup+0x53/0x153
[ 1127.048879] [<c0893375>] dput+0x16/0xfc
[ 1127.048879] [<c088eb25>] __link_path_walk+0xb01/0xbfb
[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23
[ 1127.048879] [<c086f05e>] kmap_high+0x17c/0x186
[ 1127.048879] [<c0819b76>] default_spin_lock_flags+0x5/0x7
[ 1127.048879] [<c081a64b>] do_page_fault+0x335/0x86e
[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23
[ 1127.048879] [<c0870a91>] unmap_vmas+0x498/0x6ab
[ 1127.048879] [<c087321e>] free_pgtables+0x7d/0x93
[ 1127.048879] [<c086d42e>] vma_prio_tree_insert+0x17/0x7f
[ 1127.048879] [<c0874a45>] vma_link+0x51/0x73
[ 1127.048879] [<c0a60555>] _spin_unlock+0x10/0x23
[ 1127.048879] [<c0874a5f>] vma_link+0x6b/0x73
[ 1127.048879] [<c08763b8>] mmap_region+0x475/0x58c
[ 1127.048879] [<c08767a4>] do_mmap_pgoff+0x2d5/0x326
[ 1127.048879] [<c08081db>] sys_mmap2+0x62/0x77
[ 1127.048879] [<c08081e9>] sys_mmap2+0x70/0x77
[ 1127.048879] [<c081a316>] do_page_fault+0x0/0x86e
[ 1127.048879] [<c0a60805>] error_code+0x75/0x80
[ 1127.048879] Code: c0 83 c4 14 5b 5e 5f 5d c3 55 89 d5 57 89 c7 56 53
89 d3 c1 e5 07 83 ec 04 8b 74 90 30 8d 85 08 0f 00 00 03 07 8a 10 88 54
24 03 <f6> 86 0d 05 00 00 02 74 12 0f b6 c2 50 56 68 84 5b 08 f8 e8 cd
[ 1127.048879] EIP: [<f808085a>] sky2_mac_intr+0x22/0x9d [sky2] SS:ESP
0068:eb83fb38
[ 1127.470534] Kernel panic - not syncing: Fatal exception in interrupt
[ 1127.478035] Rebooting in 30 seconds..
It seems that the oops occurs when the last network interface using the
sky2 module goes down, although I am not completely certain about this.
I am also fairly sure that the other patches applied to 2.6.28.10 are
not at fault, as the same kernel works perfectly well on different
hardware (which is not using the sky2 NIC module).
Attached are the lspci -v output and the kernel config.
Any hints on what may be wrong would be highly appreciated. I am able to
try patches to sky2 and/or give remote ssh access to the box (although
it will be offline for 5 minutes after triggering the oops...).
Try later kernels.
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-07-21 19:59:50 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Stephen Hemminger
Looks like the device is disappearing from the PCI bus when
brought down. Can you reproduce it with 2.6.30.2 or 2.6.31-rc3?
Try later kernels.
This is, as mentioned in my initial email, unfortunately not an option
until later kernel support squashfs-lzma again. These embedded
appliances boot from compact flash with the root FS in squashfs.

The question probably is if newer sky2 module sources would compile and
work with 2.6.28.10. Is this expected to work?

Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpmHjIACgkQq7SPDcPCS95QPwCdHoFXzDWKJVc7ZX5pfGvGq0JB
q6AAnR633XVUQA0DlVDObSKLGvBIJ6nM
=drfT
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2009-07-21 20:54:52 UTC
Permalink
On Tue, 21 Jul 2009 21:59:50 +0200
Post by Rene Mayrhofer
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Stephen Hemminger
Looks like the device is disappearing from the PCI bus when
brought down. Can you reproduce it with 2.6.30.2 or 2.6.31-rc3?
Try later kernels.
This is, as mentioned in my initial email, unfortunately not an option
until later kernel support squashfs-lzma again. These embedded
appliances boot from compact flash with the root FS in squashfs.
The question probably is if newer sky2 module sources would compile and
work with 2.6.28.10. Is this expected to work?
Kernel api (net_device_ops) changes are not in 2.6.28 as remember.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2009-07-23 17:28:48 UTC
Permalink
On Tue, 21 Jul 2009 18:26:39 +0200
Post by Rene Mayrhofer
Hi everybody,
[Please CC me in replies, I am not currently subscribed to this list.]
I have a fully reproducible kernel oops in the sky2 module in kernel
2.6.28.10. The kernel is a vanilla 2.6.28.10 (and I can't switch to
anything newer at this time because of missing squashfs-lzma support),
patched with PaX, netfilter-layer7, squashfs (with LZMA), and IMQ. The
base system is a Debian Lenny with some updates from testing/unstable.
Whenever interfaces using the sky2 module (this box has 8 network
You could try commenting out sky2_shutdown which does the Wol
power down stuff. Maybe changing setting of Wake On Lan would
help as well.

What happens if you take interface down 'ip link set eth0 down' (or ifconfig)?

There are several different register writes in the shutdown path.
You could add code to check if a particular access is disabling
the PCI buss with:

sky2_write(... som register...)
BUG_ON(sky2_read16(sky2->hw, B0_CTST) == 0xffff);
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-07-27 11:03:17 UTC
Permalink
Post by Stephen Hemminger
You could try commenting out sky2_shutdown which does the Wol
power down stuff. Maybe changing setting of Wake On Lan would
help as well.
What happens if you take interface down 'ip link set eth0 down' (or ifconfig)?
There are several different register writes in the shutdown path.
You could add code to check if a particular access is disabling
sky2_write(... som register...)
BUG_ON(sky2_read16(sky2->hw, B0_CTST) == 0xffff);
I tried adding that wherever it seemed to make sense, resulting in

[~]# modinfo sky2
filename: /lib/modules/2.6.28.10/kernel/drivers/net/sky2.ko
version: 1.22
license: GPL
author: Stephen Hemminger <***@linux-foundation.org>
description: Marvell Yukon 2 Gigabit Ethernet driver
srcversion: 1A63521E698522157C4F229
alias: pci:v000011ABd00004380sv*sd*bc*sc*i*
alias: pci:v000011ABd00004370sv*sd*bc*sc*i*
alias: pci:v000011ABd0000436Dsv*sd*bc*sc*i*
alias: pci:v000011ABd0000436Csv*sd*bc*sc*i*
alias: pci:v000011ABd0000436Bsv*sd*bc*sc*i*
alias: pci:v000011ABd0000436Asv*sd*bc*sc*i*
alias: pci:v000011ABd00004369sv*sd*bc*sc*i*
alias: pci:v000011ABd00004368sv*sd*bc*sc*i*
alias: pci:v000011ABd00004367sv*sd*bc*sc*i*
alias: pci:v000011ABd00004366sv*sd*bc*sc*i*
alias: pci:v000011ABd00004365sv*sd*bc*sc*i*
alias: pci:v000011ABd00004364sv*sd*bc*sc*i*
alias: pci:v000011ABd00004363sv*sd*bc*sc*i*
alias: pci:v000011ABd00004362sv*sd*bc*sc*i*
alias: pci:v000011ABd00004361sv*sd*bc*sc*i*
alias: pci:v000011ABd00004360sv*sd*bc*sc*i*
alias: pci:v000011ABd0000435Asv*sd*bc*sc*i*
alias: pci:v000011ABd00004357sv*sd*bc*sc*i*
alias: pci:v000011ABd00004356sv*sd*bc*sc*i*
alias: pci:v000011ABd00004355sv*sd*bc*sc*i*
alias: pci:v000011ABd00004354sv*sd*bc*sc*i*
alias: pci:v000011ABd00004353sv*sd*bc*sc*i*
alias: pci:v000011ABd00004352sv*sd*bc*sc*i*
alias: pci:v000011ABd00004351sv*sd*bc*sc*i*
alias: pci:v000011ABd00004350sv*sd*bc*sc*i*
alias: pci:v000011ABd00004347sv*sd*bc*sc*i*
alias: pci:v000011ABd00004346sv*sd*bc*sc*i*
alias: pci:v000011ABd00004345sv*sd*bc*sc*i*
alias: pci:v000011ABd00004344sv*sd*bc*sc*i*
alias: pci:v000011ABd00004343sv*sd*bc*sc*i*
alias: pci:v000011ABd00004342sv*sd*bc*sc*i*
alias: pci:v000011ABd00004341sv*sd*bc*sc*i*
alias: pci:v000011ABd00004340sv*sd*bc*sc*i*
alias: pci:v00001186d00004B03sv*sd*bc*sc*i*
alias: pci:v00001186d00004B02sv*sd*bc*sc*i*
alias: pci:v00001186d00004001sv*sd*bc*sc*i*
alias: pci:v00001186d00004B00sv*sd*bc*sc*i*
alias: pci:v00001148d00009E00sv*sd*bc*sc*i*
alias: pci:v00001148d00009000sv*sd*bc*sc*i*
depends:
vermagic: 2.6.28.10 SMP preempt mod_unload 586
parm: debug:Debug level (0=none,...,16=all) (int)
parm: copybreak:Receive copy threshold (int)
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)
parm: entropy:Allow sky2 to populate the /dev/random entropy
pool (int)
[~]# uname -a
Linux gibraltar3-esys-master 2.6.28.10 #3 SMP PREEMPT Wed Jul 22
10:31:57 UTC 2009 i686 GNU/Linux

[~]# ifdown -a
[ 311.812527] Dead loop on virtual device tun6to4, fix it urgently!
[ 312.000037] sky2 0000:01:00.0: error interrupt status=0xffffffff
[ 312.006303] sky2 0000:01:00.0: PCI hardware error (0xffff)
[ 312.011902] sky2 0000:01:00.0: PCI Express error (0xffffffff)
[ 312.017885] sky2 wan: ram data read parity error
[ 312.022680] sky2 wan: ram data write parity error
[ 312.027546] sky2 wan: MAC parity error
[ 312.031444] sky2 wan: RX parity error
[ 312.035241] sky2 wan: TCP segmentation error
[ 312.039705] general protection fault: 0000 [#1] PREEMPT SMP
[ 312.043677] last sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
[ 312.043677] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport i2c_i801 serio_raw pcspkr
i2c_core iTCO_wdt rng_core intel_agp agpgart squashfs sqlzma unlzma loop
aufs exportfs nls_utf8 nls_cp437 ide_generic sd_mod ide_gd_mod
ata_generic pata_acpi ata_piix piix ide_pci_generic ide_core skge sky2
thermal_sys
[ 312.043677]
[ 312.043677] Pid: 0, comm: swapper Not tainted (2.6.28.10 #3)
[ 312.043677] EIP: 0060:[<f8080b6d>] EFLAGS: 00010286 CPU: 0
[ 312.043677] EIP is at sky2_mac_intr+0x22/0x9d [sky2]
[ 312.043677] EAX: f8090f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff
[ 312.043677] ESI: 00000000 EDI: f681ab80 EBP: 00000080 ESP: c0badec0
[ 312.043677] DS: 0068 ES: 0068 FS: 00d8 GS: 0000 SS: 0068
[ 312.043677] Process swapper (pid: 0, ti=c0bac000 task=c0b59228
task.ti=c0bac000)
[ 312.043677] Stack:
[ 312.043677] ff0844de f681ab88 ffffffff ffffffff f7146800 f8084acb
00000040 f681ab88
[ 312.043677] c0819b76 c0a6033a f681ab80 c0be5180 c08330e4 f6873c00
c0a604e8 c0be5180
[ 312.043677] c0be5180 c0833277 00000bd6 00000282 f6873c00 00000102
c0be5180 f86c8789
[ 312.043677] Call Trace:
[ 312.043677] [<f8084acb>] sky2_poll+0x1cb/0xbec [sky2]
[ 312.043677] [<c0819b76>] default_spin_lock_flags+0x5/0x7
[ 312.043677] [<c0a6033a>] _spin_lock_irqsave+0x2d/0x33
[ 312.043677] [<c08330e4>] lock_timer_base+0x19/0x35
[ 312.043677] [<c0a604e8>] _spin_unlock_irqrestore+0x22/0x39
[ 312.043677] [<c0833277>] __mod_timer+0xc9/0xd2
[ 312.043677] [<f86c8789>] garp_join_timer+0x0/0x46 [garp]
[ 312.043677] [<c0832bcb>] run_timer_softirq+0x145/0x1a4
[ 312.043677] [<c0a6030a>] _spin_lock_irq+0x1e/0x21
[ 312.043677] [<c09eda45>] net_rx_action+0xb8/0x1f6
[ 312.043677] [<c082f954>] __do_softirq+0x95/0x142
[ 312.043677] [<c082fa49>] do_softirq+0x48/0x57
[ 312.043677] [<c082fbc9>] irq_exit+0x3b/0x78
[ 312.043677] [<c081218f>] smp_apic_timer_interrupt+0x75/0x7f
[ 312.043677] [<c0804f48>] apic_timer_interrupt+0x28/0x30
[ 312.043677] [<c080a2c6>] mwait_idle+0x2f/0x3b
[ 312.043677] [<c0802ac9>] cpu_idle+0x7a/0xad
[ 312.043677] Code: 06 89 10 5f 5b 5e 5f 5d c3 55 89 d5 57 89 c7 56 53
89 d3 c1 e5 07 83 ec 04 8b 74 90 30 8d 85 08 0f 00 00 03 07 8a 10 88 54
24 03 <f6> 86 0d 05 00 00 02 74 12 0f b6 c2 50 56 68 13 6c 08 f8 e8 ba
[ 312.043677] EIP: [<f8080b6d>] sky2_mac_intr+0x22/0x9d [sky2] SS:ESP
0068:c0badec0
[ 312.315704] Kernel panic - not syncing: Fatal exception in interrupt
[ 312.322253] Rebooting in 30 seconds..

Does that help in any way?

best regards,
Rene
Stephen Hemminger
2009-07-27 16:30:18 UTC
Permalink
On Mon, 27 Jul 2009 13:03:17 +0200
Post by Rene Mayrhofer
Post by Stephen Hemminger
You could try commenting out sky2_shutdown which does the Wol
power down stuff. Maybe changing setting of Wake On Lan would
help as well.
What happens if you take interface down 'ip link set eth0 down' (or ifconfig)?
There are several different register writes in the shutdown path.
You could add code to check if a particular access is disabling
sky2_write(... som register...)
BUG_ON(sky2_read16(sky2->hw, B0_CTST) == 0xffff);
I tried adding that wherever it seemed to make sense, resulting in
Does the platform use MSI? Perhaps it generates a bogus interrupt when
powered off.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-07-28 07:21:27 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Stephen Hemminger
On Mon, 27 Jul 2009 13:03:17 +0200
Post by Rene Mayrhofer
Post by Stephen Hemminger
You could try commenting out sky2_shutdown which does the Wol
power down stuff. Maybe changing setting of Wake On Lan would
help as well.
What happens if you take interface down 'ip link set eth0 down' (or ifconfig)?
There are several different register writes in the shutdown path.
You could add code to check if a particular access is disabling
sky2_write(... som register...)
BUG_ON(sky2_read16(sky2->hw, B0_CTST) == 0xffff);
I tried adding that wherever it seemed to make sense, resulting in
Does the platform use MSI? Perhaps it generates a bogus interrupt when
powered off.
Potentially:

[***@gibraltar3-esys-master ~]# cat /proc/interrupts
CPU0
0: 311 IO-APIC-edge timer
1: 2 IO-APIC-edge i8042
2: 0 XT-PIC-XT cascade
4: 440 IO-APIC-edge serial
7: 0 IO-APIC-edge parport0
8: 87 IO-APIC-edge rtc0
14: 124214 IO-APIC-edge ide0
15: 0 IO-APIC-edge ide1
19: 95 IO-APIC-fasteoi ata_piix
20: 962 IO-APIC-fasteoi asak
21: 3644 IO-APIC-fasteoi testnet
22: 108021 IO-APIC-fasteoi hb
23: 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, voip
504: 188598 PCI-MSI-edge lan
505: 7989 PCI-MSI-edge dmz
506: 317686 PCI-MSI-edge gibsrv
507: 84129 PCI-MSI-edge wan
NMI: 0 Non-maskable interrupts
LOC: 11928173 Local timer interrupts
RES: 0 Rescheduling interrupts
CAL: 0 Function call interrupts
TLB: 0 TLB shootdowns
TRM: 0 Thermal event interrupts
SPU: 0 Spurious interrupts
ERR: 0
MIS: 0

Do I interpret this correctly that MSI is used by sky2 (those above are
the network interface names)? Sorry for my ignorance in this regard, but
I haven't consciously used or debugged MSI so far.

best regards,
Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpupvcACgkQq7SPDcPCS94ULQCgkcTQe5/HepuXuncx4grujtrv
adwAoOIERCJIVph/uwPTjVAwDQj7vnBC
=4/Xu
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2009-07-27 22:35:48 UTC
Permalink
Does this help?

--- a/drivers/net/sky2.c 2009-07-27 15:28:27.653757064 -0700
+++ b/drivers/net/sky2.c 2009-07-27 15:34:24.358730966 -0700
@@ -2763,6 +2763,11 @@ static int sky2_poll(struct napi_struct
int work_done = 0;
u16 idx;

+ if (unlikely(status == ~0)) {
+ dev_info(&hw->pdev->dev, "device status error\n");
+ goto clear_napi;
+ }
+
if (unlikely(status & Y2_IS_ERROR))
sky2_err_intr(hw, status);

@@ -2779,6 +2784,7 @@ static int sky2_poll(struct napi_struct
goto done;
}

+clear_napi:
napi_complete(napi);
sky2_read32(hw, B0_Y2_SP_LISR);
done:
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-07-28 07:25:03 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Stephen Hemminger
Does this help?
Trying right now, will report results as soon as I have them.

Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpup84ACgkQq7SPDcPCS96z8wCfZmWQMt1f5DHdOtsI1oCouqGU
dXwAoMXHAXKJNmZaWLiM6WjoIxEQWNlg
=+qLh
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-07-28 09:48:05 UTC
Permalink
Rene Mayrhofer
2009-08-03 11:55:34 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have now tried again with the newest stable kernel (2.6.30.4), without
PaX and squashfs-lzma support. Still the same problem:

[~]# uname -a
Linux gibraltar3-esys-master 2.6.30.4 #9 SMP PREEMPT Fri Jul 31 15:32:55
UTC 2009 i686 GNU/Linux
[~]# /etc/init.d/networking restart
Reconfiguring network interfaces...[ 277.816049] sky2 0000:01:00.0:
error interrupt status=0xffffffff
[ 277.822124] sky2 0000:01:00.0: PCI hardware error (0xffff)
[ 277.827656] sky2 0000:01:00.0: PCI Express error (0xffffffff)
[ 277.833449] sky2 wan: ram data read parity error
[ 277.838107] sky2 wan: ram data write parity error
[ 277.842852] sky2 wan: MAC parity error
[ 277.846643] sky2 wan: RX parity error
[ 277.850345] sky2 wan: TCP segmentation error
[ 277.854688] BUG: unable to handle kernel NULL pointer dereference at
0000038d
[ 277.858653] IP: [<f8050ca5>] sky2_mac_intr+0x30/0xc1 [sky2]
[ 277.858653] *pde = 00000000
[ 277.858653] Oops: 0000 [#1] PREEMPT SMP
[ 277.858653] last sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
[ 277.858653] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport serio_raw i2c_i801 pcspkr
i2c_core iTCO_wdt rng_core intel_agp loop aufs exportfs nls_utf8
nls_cp437 ide_generic sd_mod ide_gd_mod ata_generic pata_acpi skge
ata_piix piix ide_pci_generic ide_core sky2 thermal_sys
[ 277.858653]
[ 277.858653] Pid: 9423, comm: tlsmgr Not tainted (2.6.30.4 #9)
[ 277.858653] EIP: 0060:[<f8050ca5>] EFLAGS: 00010286 CPU: 0
[ 277.858653] EIP is at sky2_mac_intr+0x30/0xc1 [sky2]
[ 277.858653] EAX: f8068f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff
[ 277.858653] ESI: 00000000 EDI: f6901b80 EBP: f6acfce4 ESP: f6acfccc
[ 277.858653] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 277.858653] Process tlsmgr (pid: 9423, ti=f6ace000 task=f7176e70
task.ti=f6ace000)
[ 277.858653] Stack:
[ 277.858653] 00000080 ff901b80 968c5f08 f71ed840 ffffffff ffffffff
f6acfd6c f80542d8
[ 277.858653] 00000000 c181d260 00000040 f6901b88 f6acfd08 c04ee2b5
f6901b80 ffffffff
[ 277.858653] c022ded2 f71ef000 00000000 00000000 0000000f c181d260
00000000 00000246
[ 277.858653] Call Trace:
[ 277.858653] [<f80542d8>] ? sky2_poll+0x1d2/0xb1e [sky2]
[ 277.858653] [<c04ee2b5>] ? _spin_unlock_irqrestore+0x31/0x44
[ 277.858653] [<c022ded2>] ? try_to_wake_up+0x291/0x2ac
[ 277.858653] [<c022df62>] ? wake_up_process+0x1b/0x2e
[ 277.858653] [<c04772f4>] ? __qdisc_run+0x73/0x1ca
[ 277.858653] [<c0463cc2>] ? net_rx_action+0x9e/0x1a2
[ 277.858653] [<c0237b5e>] ? __do_softirq+0xb2/0x188
[ 277.858653] [<c0237c73>] ? do_softirq+0x3f/0x5c
[ 277.858653] [<c0237dfd>] ? irq_exit+0x37/0x80
[ 277.858653] [<c0213cfd>] ? smp_apic_timer_interrupt+0x7c/0x9b
[ 277.858653] [<c02037dd>] ? apic_timer_interrupt+0x31/0x38
[ 277.858653] [<c0371524>] ? radix_tree_lookup_slot+0x34/0x79
[ 277.858653] [<c0284852>] ? find_get_page+0x34/0xc6
[ 277.858653] [<c0284c9e>] ? find_lock_page+0x21/0x67
[ 277.858653] [<c0285214>] ? filemap_fault+0x97/0x366
[ 277.858653] [<c0297054>] ? __do_fault+0x56/0x3b0
[ 277.858653] [<c02503a2>] ? getnstimeofday+0x5f/0xf3
[ 277.858653] [<c0252d85>] ? clockevents_program_event+0xe8/0x108
[ 277.858653] [<c0298f33>] ? handle_mm_fault+0x2b9/0x668
[ 277.858653] [<c024b121>] ? hrtimer_interrupt+0x13e/0x15f
[ 277.858653] [<c021d3f6>] ? do_page_fault+0x1fb/0x21b
[ 277.858653] [<c021d1fb>] ? do_page_fault+0x0/0x21b
[ 277.858653] [<c04ee72a>] ? error_code+0x7a/0x80
[ 277.858653] Code: c7 56 53 89 d3 83 ec 0c 65 a1 14 00 00 00 89 45 f0
31 c0 8b 74 97 3c c1 e2 07 89 d0 05 08 0f 00 00 89 55 e8 03 07 8a 10 88
55 ef <f6> 86 8d 03 00 00 02 74 12 0f b6 c2 50 56 68 30 64 05 f8 e8 74
[ 277.858653] EIP: [<f8050ca5>] sky2_mac_intr+0x30/0xc1 [sky2] SS:ESP
0068:f6acfccc
[ 277.858653] CR2: 000000000000038d
[ 278.173200] ---[ end trace bec12ce036036cbf ]---
[ 278.177861] Kernel panic - not syncing: Fatal exception in interrupt
[ 278.184259] Pid: 9423, comm: tlsmgr Tainted: G D 2.6.30.4 #9
[ 278.190654] Call Trace:
[ 278.193140] [<c04eb04e>] ? printk+0x1d/0x30
[ 278.197452] [<c04eaf8c>] panic+0x53/0xf8
[ 278.201506] [<c0206368>] oops_end+0x9f/0xbf
[ 278.205817] [<c021ceb4>] no_context+0x11a/0x135
[ 278.210480] [<c021d005>] __bad_area_nosemaphore+0x136/0x14f
[ 278.216177] [<c0374e68>] ? vsnprintf+0x91/0x332
[ 278.220840] [<c04ee2b5>] ? _spin_unlock_irqrestore+0x31/0x44
[ 278.226622] [<c04ee2b5>] ? _spin_unlock_irqrestore+0x31/0x44
[ 278.232404] [<c0232f3f>] ? release_console_sem+0x18b/0x1c9
[ 278.238015] [<c021d03b>] bad_area_nosemaphore+0x1d/0x34
[ 278.243370] [<c021d30b>] do_page_fault+0x110/0x21b
[ 278.248287] [<c021d1fb>] ? do_page_fault+0x0/0x21b
[ 278.253209] [<c04ee72a>] error_code+0x7a/0x80
[ 278.257693] [<c037007b>] ? kobject_uevent_env+0x42/0x387
[ 278.263141] [<f8050ca5>] ? sky2_mac_intr+0x30/0xc1 [sky2]
[ 278.268673] [<f80542d8>] sky2_poll+0x1d2/0xb1e [sky2]
[ 278.273850] [<c04ee2b5>] ? _spin_unlock_irqrestore+0x31/0x44
[ 278.279632] [<c022ded2>] ? try_to_wake_up+0x291/0x2ac
[ 278.284818] [<c022df62>] ? wake_up_process+0x1b/0x2e
[ 278.289914] [<c04772f4>] ? __qdisc_run+0x73/0x1ca
[ 278.294750] [<c0463cc2>] net_rx_action+0x9e/0x1a2
[ 278.299578] [<c0237b5e>] __do_softirq+0xb2/0x188
[ 278.304321] [<c0237c73>] do_softirq+0x3f/0x5c
[ 278.308801] [<c0237dfd>] irq_exit+0x37/0x80
[ 278.313111] [<c0213cfd>] smp_apic_timer_interrupt+0x7c/0x9b
[ 278.318807] [<c02037dd>] apic_timer_interrupt+0x31/0x38
[ 278.324165] [<c0371524>] ? radix_tree_lookup_slot+0x34/0x79
[ 278.329869] [<c0284852>] find_get_page+0x34/0xc6
[ 278.334619] [<c0284c9e>] find_lock_page+0x21/0x67
[ 278.339447] [<c0285214>] filemap_fault+0x97/0x366
[ 278.344276] [<c0297054>] __do_fault+0x56/0x3b0
[ 278.348842] [<c02503a2>] ? getnstimeofday+0x5f/0xf3
[ 278.353847] [<c0252d85>] ? clockevents_program_event+0xe8/0x108
[ 278.359899] [<c0298f33>] handle_mm_fault+0x2b9/0x668
[ 278.364997] [<c024b121>] ? hrtimer_interrupt+0x13e/0x15f
[ 278.370445] [<c021d3f6>] do_page_fault+0x1fb/0x21b
[ 278.375364] [<c021d1fb>] ? do_page_fault+0x0/0x21b
[ 278.380287] [<c04ee72a>] error_code+0x7a/0x80
[ 278.384779] Rebooting in 30 seconds..

To allow easier debugging, I have now put our whole kernel tree up in a
public (read-only) git repository at
https://www.gibraltar.at/git/linux-2.6-gibraltar.git. The branch for
this kernel is origin/gibraltar-3.0, although the above dump was
produced by a version slightly "older" then HEAD, which did not yet have
the latest PaX patch applied (no PaX and no lzma-squashfs in this kernel).

Any hints/pointers/patches/etc. would be highly appreciated.

best regards,
Rene

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp20DYACgkQq7SPDcPCS96R3QCdGTJsPiJGLfiWUZk67f6wms9Y
rVgAoPMO2hnT3jwRtY0Qz40NRp0DpKxT
=8NsP
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-03 18:19:09 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Rene Mayrhofer
I have now tried again with the newest stable kernel (2.6.30.4), without
Sorry for replying to myself, but I tried a few more things to do with MSI:

Neither
find /sys -name "msi_bus" | while read f; do echo 0 > $f; done
nor booting with
pci=nomsi
changed anything. The oops still happens when setting the last sky2
interface down.
Post by Rene Mayrhofer
To allow easier debugging, I have now put our whole kernel tree up in a
public (read-only) git repository at
https://www.gibraltar.at/git/linux-2.6-gibraltar.git. The branch for
this kernel is origin/gibraltar-3.0, although the above dump was
produced by a version slightly "older" then HEAD, which did not yet have
the latest PaX patch applied (no PaX and no lzma-squashfs in this kernel).
I have now updated the branch with both patches (the one from Stephen
and the other one Mike). Still trying if it changes anything with
2.6.30.4 (they didn't help with 2.6.28.10, though...).

best regards,
Rene

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkp3Kh0ACgkQq7SPDcPCS97QHgCgwdpi7RBPZNV1Of85/8qg5DsE
DWoAnjlT8U5wqN9ywxUyUpLyivH/Ex1h
=DCdB
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-04 07:38:19 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Rene Mayrhofer
Post by Rene Mayrhofer
To allow easier debugging, I have now put our whole kernel tree up in a
public (read-only) git repository at
https://www.gibraltar.at/git/linux-2.6-gibraltar.git. The branch for
this kernel is origin/gibraltar-3.0, although the above dump was
produced by a version slightly "older" then HEAD, which did not yet have
the latest PaX patch applied (no PaX and no lzma-squashfs in this kernel).
I have now updated the branch with both patches (the one from Stephen
and the other one Mike). Still trying if it changes anything with
2.6.30.4 (they didn't help with 2.6.28.10, though...).
Result with both patches: there is no immediate crash when setting all
sky2 interfaces down, but I get the following messages repeated roughly
every second:

2009-08-04T09:35:31.030812+02:00 gibraltar3-esys-master kernel: [
592.000071] sky2 0000:01:00.0: device status error
2009-08-04T09:35:32.030908+02:00 gibraltar3-esys-master kernel: [
593.000058] sky2 0000:01:00.0: device status error
2009-08-04T09:35:33.030839+02:00 gibraltar3-esys-master kernel: [
594.000082] sky2 0000:01:00.0: device status error
2009-08-04T09:35:34.030864+02:00 gibraltar3-esys-master kernel: [
595.000118] sky2 0000:01:00.0: device status error
2009-08-04T09:35:35.030975+02:00 gibraltar3-esys-master kernel: [
596.000259] sky2 0000:01:00.0: device status error
2009-08-04T09:35:36.030974+02:00 gibraltar3-esys-master kernel: [
597.000198] sky2 0000:01:00.0: device status error
2009-08-04T09:35:37.030980+02:00 gibraltar3-esys-master kernel: [
598.000203] sky2 0000:01:00.0: device status error

and the network interface fails to work (no ping, nothing with tcpdump,
etc.).

Does anybody have an idea on what might be wrong in sky2_down?

best regards,
Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp35WsACgkQq7SPDcPCS97wHQCcCYWO2qgg+LdW+BFUmeOXjGVT
B68AniD3Ur2NugPGhuvz3Fxy68Zl+3f4
=5MhE
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike McCormack
2009-08-04 11:18:36 UTC
Permalink
Post by Rene Mayrhofer
Does anybody have an idea on what might be wrong in sky2_down?
I had a look into this, and noticed that we don't hold phy_lock when calling
sky2_phy_power_down() in sky2_down(). sky2_phy_power_down() does some PCI
manipulation, so it's possible this could cause bad things to happen...

Does the following patch help?

Mike



Subject: [PATCH] sky2: Hold phy_lock when powering down phy

Make sure to hold phy_lock when calling sky2_phy_power_down(),
as is done when calling sky2_phy_power_up(),

Signed-off-by: Mike McCormack <***@ring3k.org>
---
drivers/net/sky2.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index e9cb1e7..47e5bae 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -1894,7 +1894,9 @@ static int sky2_down(struct net_device *dev)
synchronize_irq(hw->pdev->irq);
napi_synchronize(&hw->napi);

+ spin_lock_bh(&sky2->phy_lock);
sky2_phy_power_down(hw, port);
+ spin_unlock_bh(&sky2->phy_lock);

/* turn off LED's */
sky2_write16(hw, B0_Y2LED, LED_STAT_OFF);
--
1.5.6.5
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-04 21:31:53 UTC
Permalink
Rene Mayrhofer
2009-08-04 22:55:04 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Rene Mayrhofer
Does anybody have an idea on what might be wrong in sky2_down?
btw. for 2.6.30, I found I could copy sky2.c from the netdev git into
my 2.6.30 tree if I added the following line at the end of
dev->trans_start = jiffies; /* prevent tx timeout */
This seems to be already included in the current netdev git.

Nonetheless, the current unmodified version from netdev git solves the
oops in sky2. I have not diffed my old vs. this version, but whoever is
interested in which change fixed the oops, it should be somewhere in
commit 0a1449c in our Gibraltar kernel git repository.

Thanks a lot for that hint!
Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp4vEQACgkQq7SPDcPCS97BtgCfZy1QTeQOL340hD0HIgTC1c3O
Gy0An1u8zdh4wyU4DchLfxNWzqlJExV+
=0+E4
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-04 22:59:43 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Rene Mayrhofer
Nonetheless, the current unmodified version from netdev git solves the
oops in sky2.
Actually, it doesn't. I managed to run networking restart twice without
an oops (with the netdev git version of sky2.c), but after generating
some minor traffic and trying to restart again, I still get this oops:

[~]# /etc/init.d/networking restart
Reconfiguring network interfaces...[ 844.000236] sky2 0000:01:00.0:
error interrupt status=0xffffffff

[ 844.007309] sky2 0000:01:00.0: PCI hardware error (0xffff)

[ 844.013657] sky2 0000:01:00.0: PCI Express error (0xffffffff)

[ 844.020290] sky2 wan: ram data read parity error

[ 844.025697] sky2 wan: ram data write parity error

[ 844.031148] sky2 wan: MAC parity error

[ 844.035522] sky2 wan: RX parity error

[ 844.039812] sky2 wan: TCP segmentation error

[ 844.044966] BUG: unable to handle kernel NULL pointer dereference at
0000038d
[ 844.048782] IP: [<f8050d2d>] sky2_mac_intr+0x30/0xc1 [sky2]

[ 844.048782] *pde = 00000000

[ 844.048782] Oops: 0000 [#1] PREEMPT SMP

[ 844.048782] last sysfs file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed

[ 844.048782] Modules linked in: xt_multiport cpufreq_userspace xt_DSCP
xt_length xt_mark xt_dscp xt_MARK xt_CONNMARK xt_comment xt_policy
ipt_REDIRECT ip6t_LOG xt_tcpudp ip6table_mangle iptable_mangle
ip6table_filter ip6_tables sit tunnel4 8021q garp stp llc ipt_LOG
xt_limit xt_state iptable_nat iptable_filter ip_tables x_tables dm_mod
p4_clockmod speedstep_lib freq_table tun imq nf_nat_ftp nf_nat
nf_conntrack_ftp nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ipv6 evdev parport_pc parport serio_raw i2c_i801 i2c_core
iTCO_wdt rng_core pcspkr intel_agp loop aufs exportfs nls_utf8 nls_cp437
ide_generic sd_mod ide_gd_mod ata_generic pata_acpi ata_piix skge piix
ide_pci_generic ide_core sky2 thermal_sys

[ 844.048782]

[ 844.048782] Pid: 13285, comm: postfix Not tainted (2.6.30.4 #2)

[ 844.048782] EIP: 0060:[<f8050d2d>] EFLAGS: 00010286 CPU: 0

[ 844.048782] EIP is at sky2_mac_intr+0x30/0xc1 [sky2]

[ 844.048782] EAX: f8068f88 EBX: 00000001 ECX: 00000008 EDX: 000000ff

[ 844.048782] ESI: 00000000 EDI: f6901b80 EBP: e1c83e9c ESP: e1c83e84

[ 844.048782] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068

[ 844.048782] Process postfix (pid: 13285, ti=e1c82000 task=e1d105b0
task.ti=e1c82000)

[ 844.048782] Stack:

[ 844.048782] 00000080 ff901b80 eda21a93 f71ed840 ffffffff ffffffff
e1c83f28 f8054181

[ 844.048782] c022594e 00000000 00000040 f6901b88 00000003 eda21a93
f6901b80 ffffffff

[ 844.048782] c181d7a4 f71ef000 c0243594 00000000 c181d7a0 f702e130
eda21a93 e1c83eec

[ 844.048782] Call Trace:

[ 844.048782] [<f8054181>] ? sky2_poll+0x1d2/0xb65 [sky2]

[ 844.048782] [<c022594e>] ? __wake_up+0x41/0x5c

[ 844.048782] [<c0243594>] ? insert_work+0xa5/0xbf

[ 844.048782] [<c04ee2a5>] ? _spin_unlock_irqrestore+0x31/0x44

[ 844.048782] [<c0243e4b>] ? __queue_work+0x36/0x4d

[ 844.048782] [<c047731c>] ? __qdisc_run+0x73/0x1ca

[ 844.048782] [<c0463ce6>] ? net_rx_action+0x9e/0x1a2

[ 844.048782] [<c0237b6e>] ? __do_softirq+0xb2/0x188

[ 844.048782] [<c0237c83>] ? do_softirq+0x3f/0x5c

[ 844.048782] [<c0237e0d>] ? irq_exit+0x37/0x80

[ 844.048782] [<c0213cfd>] ? smp_apic_timer_interrupt+0x7c/0x9b

[ 844.048782] [<c02037dd>] ? apic_timer_interrupt+0x31/0x38

[ 844.048782] Code: c7 56 53 89 d3 83 ec 0c 65 a1 14 00 00 00 89 45 f0
31 c0 8b 74 97 3c c1 e2 07 89 d0 05 08 0f 00 00 89 55 e8 03 07 8a 10 88
55 ef <f6> 86 8d 03 00 00 02 74 12 0f b6 c2 50 56 68 d0 64 05 f8 e8 df

[ 844.048782] EIP: [<f8050d2d>] sky2_mac_intr+0x30/0xc1 [sky2] SS:ESP
0068:e1c83e84

[ 844.048782] CR2: 000000000000038d

[ 844.345647] ---[ end trace d7398807329498ac ]---

[ 844.351055] Kernel panic - not syncing: Fatal exception in interrupt

[ 844.358606] Pid: 13285, comm: postfix Tainted: G D 2.6.30.4
#2
[ 844.366298] Call Trace:

[ 844.369278] [<c04eb041>] ? printk+0x1d/0x30

[ 844.374388] [<c04eaf7f>] panic+0x53/0xf8

[ 844.379197] [<c0206368>] oops_end+0x9f/0xbf

[ 844.384303] [<c021ceb4>] no_context+0x11a/0x135

[ 844.389791] [<c021d005>] __bad_area_nosemaphore+0x136/0x14f

[ 844.396489] [<c0374f60>] ? vsnprintf+0x91/0x332

[ 844.401994] [<c04ee2a5>] ? _spin_unlock_irqrestore+0x31/0x44

[ 844.408787] [<c04ee2a5>] ? _spin_unlock_irqrestore+0x31/0x44

[ 844.415546] [<c0232f4f>] ? release_console_sem+0x18b/0x1c9

[ 844.422152] [<c021d03b>] bad_area_nosemaphore+0x1d/0x34

[ 844.428464] [<c021d30b>] do_page_fault+0x110/0x21b

[ 844.434271] [<c021d1fb>] ? do_page_fault+0x0/0x21b

[ 844.440026] [<c04ee71a>] error_code+0x7a/0x80

[ 844.445442] [<c037007b>] ? add_uevent_var+0x17/0xb9

[ 844.451413] [<f8050d2d>] ? sky2_mac_intr+0x30/0xc1 [sky2]

[ 844.457981] [<f8054181>] sky2_poll+0x1d2/0xb65 [sky2]

[ 844.464050] [<c022594e>] ? __wake_up+0x41/0x5c

[ 844.469437] [<c0243594>] ? insert_work+0xa5/0xbf

[ 844.475055] [<c04ee2a5>] ? _spin_unlock_irqrestore+0x31/0x44

[ 844.481817] [<c0243e4b>] ? __queue_work+0x36/0x4d

[ 844.487516] [<c047731c>] ? __qdisc_run+0x73/0x1ca

[ 844.493201] [<c0463ce6>] net_rx_action+0x9e/0x1a2

[ 844.498883] [<c0237b6e>] __do_softirq+0xb2/0x188

[ 844.504446] [<c0237c83>] do_softirq+0x3f/0x5c

[ 844.509720] [<c0237e0d>] irq_exit+0x37/0x80

[ 844.514791] [<c0213cfd>] smp_apic_timer_interrupt+0x7c/0x9b

[ 844.521488] [<c02037dd>] apic_timer_interrupt+0x31/0x38

[ 844.527811] Rebooting in 30 seconds..

This is with the newest version of sky2 as of today. Is this any
indication that traffic is needed to reproduce it? E.g. that a certain
number of interrupts must have already been handled to trigger the bug?

Again, any hints would be greatly appreciated (and sorry for being
persistent about this annoying little bug...).

best regards,
Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp4vV8ACgkQq7SPDcPCS95UvgCfTNzwXKGxXi1SUfrMyLglF5Hf
mCkAnRZqfuA5KYkKCz53leWgxHBOLWMo
=Shq7
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2009-08-04 23:08:05 UTC
Permalink
On Wed, 05 Aug 2009 00:59:43 +0200
Post by Rene Mayrhofer
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Rene Mayrhofer
Nonetheless, the current unmodified version from netdev git solves the
oops in sky2.
Actually, it doesn't. I managed to run networking restart twice without
an oops (with the netdev git version of sky2.c), but after generating
[~]# /etc/init.d/networking restart
error interrupt status=0xffffffff
[ 844.007309] sky2 0000:01:00.0: PCI hardware error (0xffff)
[ 844.013657] sky2 0000:01:00.0: PCI Express error (0xffffffff)
There is something about the hardware on your system that causes
the Marvell chip to not be present on the bus after the steps taken
in sky2_down. Is there something unique about how it is wired to
the PCI express bus?

The sky2 driver has to handle the rare case of dual port board, so
in sky2_down in only shuts off part of the chip. Driver turns off the PHY
and stops receiver/transmitter. It could be the power control bits
on your hardware turn off more than just the PHY. Or perhaps,
most systems have a low power input to keep chip alive for Wake On
Lan and that isn't present on your system.

Maybe an option to not power down phy would be the simplest fix.
--
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike McCormack
2009-08-04 23:53:55 UTC
Permalink
Post by Rene Mayrhofer
Again, any hints would be greatly appreciated (and sorry for being
persistent about this annoying little bug...).
Hi Rene,

Thanks for being persistent in testing :-) Looks like you've got a
fairly unusual piece of hardware, as Stephen indicated.

Would you mind adding the phy_lock fix on top of the latest net-2.6
git version of sky2 and testing that?

thanks,

Mike
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-05 12:14:24 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Mike McCormack
Post by Rene Mayrhofer
Again, any hints would be greatly appreciated (and sorry for being
persistent about this annoying little bug...).
Thanks for being persistent in testing :-) Looks like you've got a
fairly unusual piece of hardware, as Stephen indicated.
Indeed, although I didn't think it _that_ unusual. It's just a 19" rack
appliance with 2 expansion slots for 4x LAN ports each. And those are
based around sky2.

But we have had problems before with kernel 2.4.34/.36 as well with that
hardware. They just weren't as easily reproducible but manifested
themselves in occasional malfunctions of the network devices that could
be solved by an ifdown/ifup cycle.
We still have one spare box and will try that one in case the hardware
is really flaky (which would be strange, given how reproducible it is
right now).
Post by Mike McCormack
Would you mind adding the phy_lock fix on top of the latest net-2.6
git version of sky2 and testing that?
Tried it, doesn't fix the issue.

What would be the simplest change to stop disabling phy when the last
device goes down?

best regards,
Rene
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp5d6AACgkQq7SPDcPCS95OuACggTuTHsZd7m6IqHt0mrqUZbju
G4wAoPfPGr5G05E6HdO9kcKflGaSx7f5
=78yk
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike McCormack
2009-08-05 22:50:38 UTC
Permalink
Post by Rene Mayrhofer
What would be the simplest change to stop disabling phy when the last
device goes down?
Commenting out the following line should stop all the phys from powering off:

sky2_phy_power_down(hw, port);

If you have a chance, please test "sky2: Add a mutex around ethtools operations" also.
it probably won't fix the problem you're seeing, but you never know...

thanks,

Mike

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-10 10:28:58 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Mike McCormack
Post by Rene Mayrhofer
What would be the simplest change to stop disabling phy when the last
device goes down?
sky2_phy_power_down(hw, port);
If you have a chance, please test "sky2: Add a mutex around ethtools operations" also.
it probably won't fix the problem you're seeing, but you never know...
It seems that hardware is faulty, although in a very "interesting" way.
We tried changing the "slot" modules with 4 NICs each, which did not
change matters. However, another similar hardware appliance works. I am
thus not sure which component is at fault here, as (parts of) the NICs
were changed. Maybe the interrupt controller is weird on the "faulty"
box? ACPI issues? If anybody wants to track this any further, I am still
willing to test patches.

best regards,
Rene

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp/9moACgkQq7SPDcPCS979XACfRD6e5ixtX3oPiQCpC78nowO4
TH4Anivuo53VZsRO9LAIDIg7zYurW8UI
=MwmU
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-11 08:54:53 UTC
Permalink
Rene Mayrhofer
2009-08-19 07:01:57 UTC
Permalink
Hi everybody,
Thus, there really seems to be an uncaught case in sky2.c. When
sky2_phy_power_down is not called, chip should not go down, right? But
still sky2_poll seems to be called (maybe by an interrupt belonging to
another network interface but the same chip)?
Is there anything else I could try? We still have this issue, making one range
of hardware appliances unusable with 2.6 kernels...

best regards,
Rene
Mike McCormack
2009-08-19 15:00:31 UTC
Permalink
Post by Rene Mayrhofer
Hi everybody,
Thus, there really seems to be an uncaught case in sky2.c. When
sky2_phy_power_down is not called, chip should not go down, right? But
still sky2_poll seems to be called (maybe by an interrupt belonging to
another network interface but the same chip)?
Is there anything else I could try? We still have this issue, making one range
of hardware appliances unusable with 2.6 kernels...
Hi Rene,

There's a couple of things to try:

* try the latest sky2 code from net-next-2.6
* try adding an msleep(1) after sky2_rx_stop() in sky2_down()
* try adding a check for rx_ring and tx_ring being NULL in
sky2_status_intr(), and disable napi while freeing the buffers in
sky2_down()

I've got an untested, ad-hoc patch against net-next-2.6 for the last
two bits ...

thanks,

Mike
Rene Mayrhofer
2009-08-19 15:11:23 UTC
Permalink
Hi Mike,
Post by Mike McCormack
* try the latest sky2 code from net-next-2.6
* try adding an msleep(1) after sky2_rx_stop() in sky2_down()
* try adding a check for rx_ring and tx_ring being NULL in
sky2_status_intr(), and disable napi while freeing the buffers in
sky2_down()
I've got an untested, ad-hoc patch against net-next-2.6 for the last
two bits ...
I will try all these (hopefully in the next 24h), starting with the patch you
sent me in PM, then the patch attached to this email, and finally pulling
sky2.c from net-next-2.6 and applying this patch.

best regards,
Rene
--
-------------------------------------------------
Gibraltar firewall http://www.gibraltar.at/
Rene Mayrhofer
2009-08-19 21:07:21 UTC
Permalink
Hi Mike,
Post by Mike McCormack
* try the latest sky2 code from net-next-2.6
* try adding an msleep(1) after sky2_rx_stop() in sky2_down()
* try adding a check for rx_ring and tx_ring being NULL in
sky2_status_intr(), and disable napi while freeing the buffers in
sky2_down()
I've got an untested, ad-hoc patch against net-next-2.6 for the last
two bits ...
Pulling the latest sky2.c and sky2.h from net-next-2.6 and applying the patch
rids me of the oops - it is unreproducible right now. However, a networking
restart (i.e. all interfaces attached to sky2) leaves the devices in a state
where they no longer receive any network packets (at least nothing visible in
tcpdump). In this state, rmmod sky2 / modprobe sky2 gives:

[ 718.502717] sky2 0000:01:00.0: unsupported chip type 0xff
[ 718.510517] sky2: probe of 0000:01:00.0 failed with error -95
[ 718.517900] sky2 0000:02:00.0: unsupported chip type 0xff
[ 718.524408] sky2: probe of 0000:02:00.0 failed with error -95
[ 718.531617] sky2 0000:03:00.0: unsupported chip type 0xff
[ 718.538104] sky2: probe of 0000:03:00.0 failed with error -95
[ 718.545344] sky2 0000:04:00.0: unsupported chip type 0xff
[ 718.551818] sky2: probe of 0000:04:00.0 failed with error -95

I will now try the net-next-2.6 version without your patch again but was under
the impression that it still oopsed when I tried it initially. Better to
double-check before I give erroneous debugging results, though...

best regards,
Rene
--
-------------------------------------------------
Gibraltar firewall http://www.gibraltar.at/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike McCormack
2009-08-19 22:05:03 UTC
Permalink
Post by Rene Mayrhofer
Pulling the latest sky2.c and sky2.h from net-next-2.6 and applying the patch
rids me of the oops - it is unreproducible right now. However, a networking
restart (i.e. all interfaces attached to sky2) leaves the devices in a state
where they no longer receive any network packets (at least nothing visible in
After you've got it into that state, does "rmmod sky2; modprobe sky2"
change anything?

thanks,

Mike
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephen Hemminger
2009-08-20 00:46:29 UTC
Permalink
On Thu, 20 Aug 2009 07:05:03 +0900
Post by Mike McCormack
Post by Rene Mayrhofer
Pulling the latest sky2.c and sky2.h from net-next-2.6 and applying the patch
rids me of the oops - it is unreproducible right now. However, a networking
restart (i.e. all interfaces attached to sky2) leaves the devices in a state
where they no longer receive any network packets (at least nothing visible in
After you've got it into that state, does "rmmod sky2; modprobe sky2"
change anything?
thanks,
Mike
Please send (I forget) the hardware info (lspci) and the register values
from ethtool -d ethX.

Some part of the power control doesn't work on Rene's system, so
device falls off the bus. Probably no auxilary +5 supplied, the register
values will tell whether driver is at fault (turning on aux when not
available), or hardware is lying about vaux.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-20 20:37:43 UTC
Permalink
Hi Stephen,
Post by Stephen Hemminger
On Thu, 20 Aug 2009 07:05:03 +0900
Post by Mike McCormack
Post by Rene Mayrhofer
Pulling the latest sky2.c and sky2.h from net-next-2.6 and applying the
patch rids me of the oops - it is unreproducible right now. However, a
networking restart (i.e. all interfaces attached to sky2) leaves the
devices in a state where they no longer receive any network packets (at
least nothing visible in tcpdump). In this state, rmmod sky2 / modprobe
After you've got it into that state, does "rmmod sky2; modprobe sky2"
change anything?
Nope, tried it multiple times.


I've also tried the net-next-2.6 version of sky2.[ch] as of yesterday without
Mike's "bandaid" patches. With that version (the last one in branch
gibraltar-3.0 at https://www.gibraltar.at/git/linux-2.6-gibraltar.git), I
managed to successfully do a networking restart (with "light" traffic on one
interface), leaving the interfaces functional after the restart. This worked
even twice in a row, so mabye we are onto something here. This is certainly an
improvement over the version with Mike's last patch (from yesterday) applied,
which left the interfaces broken after a restart (and with the quoted errors
on a rmmod/modprobe sky2).

However, after doing a ping -f on one of the (GBit) interfaces to another host
on the same switch for a few seconds and then executing networking restart,
the result was an immediate reboot of the box without any oops being printed
to the (serial) console beforehand.

The same thing happened without particularly heavy traffic after executing
networking restart twice and then just letting the box sit idle (with some
traffic on the interfaces) for a few minutes.
Post by Stephen Hemminger
Please send (I forget) the hardware info (lspci) and the register values
from ethtool -d ethX.
Attached (lspci -v).
Post by Stephen Hemminger
Some part of the power control doesn't work on Rene's system, so
device falls off the bus. Probably no auxilary +5 supplied, the register
values will tell whether driver is at fault (turning on aux when not
available), or hardware is lying about vaux.
Is it possible to disable (kernel-level) power control for particular PCI
devices to check if this is indeed the issue? Or do you suspect hardware-level
power-off when the chip state goes down?

best regards,
Rene
--
-------------------------------------------------
Gibraltar firewall http://www.gibraltar.at/
Mike McCormack
2009-08-21 11:03:26 UTC
Permalink
Post by Rene Mayrhofer
I've also tried the net-next-2.6 version of sky2.[ch] as of yesterday without
Mike's "bandaid" patches. With that version (the last one in branch
gibraltar-3.0 at https://www.gibraltar.at/git/linux-2.6-gibraltar.git), I
managed to successfully do a networking restart (with "light" traffic on one
interface), leaving the interfaces functional after the restart. This worked
even twice in a row, so mabye we are onto something here. This is certainly an
improvement over the version with Mike's last patch (from yesterday) applied,
which left the interfaces broken after a restart (and with the quoted errors
on a rmmod/modprobe sky2).
However, after doing a ping -f on one of the (GBit) interfaces to another host
on the same switch for a few seconds and then executing networking restart,
the result was an immediate reboot of the box without any oops being printed
to the (serial) console beforehand.
How about trying to remove the skge module, then running tests on the
sky2 interfaces only? This way you might be able isolate the
remaining problems to sky2 or skge...?

thanks,

Mike
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-20 19:42:30 UTC
Permalink
Post by Mike McCormack
Post by Rene Mayrhofer
Pulling the latest sky2.c and sky2.h from net-next-2.6 and applying the
patch rids me of the oops - it is unreproducible right now. However, a
networking restart (i.e. all interfaces attached to sky2) leaves the
devices in a state where they no longer receive any network packets (at
least nothing visible in tcpdump). In this state, rmmod sky2 / modprobe
After you've got it into that state, does "rmmod sky2; modprobe sky2"
change anything?
Nope, tried that multiple times...

Rene
--
-------------------------------------------------
Gibraltar firewall http://www.gibraltar.at/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rene Mayrhofer
2009-08-19 21:25:08 UTC
Permalink
Post by Rene Mayrhofer
In this state, rmmod sky2 / modprobe
[ 718.502717] sky2 0000:01:00.0: unsupported chip type 0xff
[ 718.510517] sky2: probe of 0000:01:00.0 failed with error -95
[ 718.517900] sky2 0000:02:00.0: unsupported chip type 0xff
[ 718.524408] sky2: probe of 0000:02:00.0 failed with error -95
[ 718.531617] sky2 0000:03:00.0: unsupported chip type 0xff
[ 718.538104] sky2: probe of 0000:03:00.0 failed with error -95
[ 718.545344] sky2 0000:04:00.0: unsupported chip type 0xff
[ 718.551818] sky2: probe of 0000:04:00.0 failed with error -95
Just for the sake of completeness, I get this during bootup (i.e. before
networking restart):

[ 1.637607] sky2 driver version 1.24
[ 1.637665] sky2 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ
16iled with error -95 3:00.0: unsupported chi[ 1.637686]
sky2 0000:01:00.0: setting latency timer to 64rror -95
[ 1.637718] sky2 0000:01:00.0: Yukon-2 EC chip revision 2 0xff
[ 1.637832] sky2 0000:01:00.0: irq 28 for MSI/MSI-X
[ 1.638395] sky2 eth0: addr 00:90:0b:09:55:42
[ 1.638418] sky2 0000:02:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[ 1.638431] sky2 0000:02:00.0: setting latency timer to 64
[ 1.638458] sky2 0000:02:00.0: Yukon-2 EC chip revision 2
[ 1.638567] sky2 0000:02:00.0: irq 29 for MSI/MSI-X
[ 1.639117] sky2 eth1: addr 00:90:0b:09:55:43
[ 1.639140] sky2 0000:03:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 1.639152] sky2 0000:03:00.0: setting latency timer to 64
[ 1.639180] sky2 0000:03:00.0: Yukon-2 EC chip revision 2
[ 1.639289] sky2 0000:03:00.0: irq 30 for MSI/MSI-X
[ 1.639869] sky2 eth2: addr 00:90:0b:09:55:44
[ 1.639894] sky2 0000:04:00.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
[ 1.639906] sky2 0000:04:00.0: setting latency timer to 64
[ 1.639934] sky2 0000:04:00.0: Yukon-2 EC chip revision 2
[ 1.680096] sky2 0000:04:00.0: irq 31 for MSI/MSI-X
[ 1.680786] sky2 eth3: addr 00:90:0b:09:55:45
[ 106.751517] sky2 wan: enabling interface
[ 107.404801] sky2 gibsrv: enabling interface
[ 107.631583] sky2 dmz: enabling interface
[ 107.869878] sky2 lan: enabling interface
[ 109.225106] sky2 wan: Link is up at 1000 Mbps, full duplex, flow control
both
[ 109.232519] sky2 gibsrv: Link is up at 100 Mbps, full duplex, flow control
both
[ 109.983861] sky2 dmz: Link is up at 1000 Mbps, full duplex, flow control rx
[ 110.453276] sky2 lan: Link is up at 1000 Mbps, full duplex, flow control rx

Then, at some point the interfaces get disabled (most probably by the
networking restart):

[ 224.652146] sky2 lan: disabling interface
[ 224.684483] sky2 0000:04:00.0: PCI INT A disabled
[ 224.700164] sky2 dmz: disabling interface
[ 224.736502] sky2 0000:03:00.0: PCI INT A disabled
[ 224.760219] sky2 gibsrv: disabling interface
[ 224.812491] sky2 0000:02:00.0: PCI INT A disabled
[ 224.832221] sky2 wan: disabling interface
[ 224.904479] sky2 0000:01:00.0: PCI INT A disabled

best regards,
Rene
--
-------------------------------------------------
Gibraltar firewall http://www.gibraltar.at/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...