5.01M
Category: programmingprogramming

20 and 30 Series VGA card repair guide

1.

20 and 30 Series VGA card
repair guide

2.

1. Requirements of the testing platform
2. Make sure no short circuit before power on
Power-on confirmation voltage and timing
The analysis voltage is normal but no display
Analysis of voltage no ok.
3. Analysis of Common Bad Codes

3.

1.Requirements of the testing platform

4.

1.Requirements of the testing platform
1.
275/FCT/NVLINK and other functional tests: both GEN3 and GEN4
platforms are available
2. GEN4 test: Only PCIE platforms that support GEN4 can be used, and
the memory must be dual-channel
3. NVUberstress must use 32G memory or more

5.

1.Requirements of the testing platform
AMD GEN4 Platform Setup:
Before testing the diag, you need to set IOMMU to disable, SMT Control
Disabled and SVMMODE to enable.

6.

1.Requirements of the testing platform
AMD GEN4 Platform Setup:
AMD X570 platform: NV models need to change the GPULIST ID in test.cfg
to "2d: 00.0" when testing diag.
00.0".

7.

1.Requirements of the testing platform
AMD GEN4 Platform Setup:
Gen4 test PCIE set to Auto or Gen4 are available

8.

1.Requirements of the testing platform
AMD GEN4 Platform Setup:
FCT and other diag default with GEN3 platform test items, PCIE need to
be set to Gen3, or platform set Gen4, change the diag xxx_fct_659.spc file
pcie speed: 8000 to 16000.
Or platform set Gen4, change the diag xxx_fct_659.spc file pcie speed:
8000 to 16000

9.

2 . Make sure no short circuit before power on
Power-on confirmation voltage and timing
The analysis voltage is normal but no display
Analysis of voltage no ok.

10.

Make sure no short circuit before power on
- Check that the board is free of errors, missing parts, collisions, and other process
problems.
- Verify that the impedance of each voltage is not short-circuited.
12V/3V3/5V/1V8/NVVDD/MSVDD/FBVDD/PEX_VDD
- 12V fuse is not fused
NVVDD/MSVDD/FBVDD short circuit.
If PWM is shorted, repair PWM terminal first
No abnormalities on the PWM side, you can repair the GPU or Memory If there is a
problem with the above process, do the repair first

11.

Power-on confirmation voltage and timing
12V
5V
3V3
1V8
NVVDD: 0.75V
MSVDD: 0.75V
PEX_VDD: 0.95V
FBVDDQ: 1.36V

12.

Power-on confirmation voltage and timing
Power-up: 12V→5V→3V3_SEQ→1V8_AON→DDC_5V
1V8 > MSVDD > PEXVDD > NVVDD > FBVDDQ
(1V8 > NVDD > PEXVDD > FBVDDQ

13.

Power-on confirmation voltage and timing
• Power-down:

14.

The analysis voltage is normal but no display
Check Goldfinger.
Measure the impedance of 16 pairs of capacitors is normal, not normal for repair
Use an oscilloscope to measure whether the phase signal, RST signal and power
sequnce of each group are normal.
Use another card to guide or set the county boot.
Check the BIOS, if the view is correct, you can memory scan point, if the memory
has error bit, replace the corresponding memory; scan point PASS, you can
replace the GPU.
Can't catch GPU/BIOS, can't view BIOS information.
Check whether ROM IC U614 and its peripheral circuit components are normal,
not normal first replace.
ROM confirmed OK, you can change the GPU processing.

15.

The analysis voltage is normal but no display
Mods0 folder under the sweep command: Tibco sweep: .
. /mods mods.js -skip_rm_state_init
. /mats -n 1 -e 50
Legacy system sweep: .
. /mats -e 50
UEFI system scan point: you can run the edited batch file

16.

The analysis voltage is normal but no display
. /mats.sh mats.sh nvmt sweeps for the following commands.
. /nvmt ts >> xxx.log
Note: nvmt scan command is executed before running . /nvflash -v command to
check Bios.
Tibco scan point to run fail, according to the message to find bad memory Take
V389 for example:

17.

The analysis voltage is normal but no display
The set scan point to run fail, according to the prompt message to find the bad
memory Take V389 for example.NV_PFB_FBPA_0_TRAINING_STATUS=0x00000000 -----channel A passNV_PFB_FBPA_1_TRAINING_STATUS=0x00000000 ------ channel B
passNV_PFB_FBPA_2_TRAINING_STATUS=0x00000000 ------ channel C
passNV_PFB_FBPA_3_TRAINING_STATUS=0xbadf2013 ------ channel D not
onNV_PFB_FBPA_4_TRAINING_STATUS=0x00000000 ------ channel E pass
Tibco scan point to run fail, according to the prompt information to find the bad
memory to V389 for example.
0x00000002 00000010 (hex to binary) bit[1:0] failSUBP0 failF0 fail
0x00000008 00001000 (hex to binary) bit[3:2] failSUBP1 failF1 fail
0x0000000A 00001010 (hexadecimal to binary) bit[3:0] failSUBP0&1 failF0&F1 fail Note:
0 stands for PASS; 1 represents FAIL bits 0 and 1: low memory particle (F0) bits 2 and 3:
high memory particle

18.

The analysis voltage is normal but no display
Tibco sweep point to execute fail, also can execute sweep point command while
closing the channel
. /mods mods.js -skip_rm_state_init
-floorsweep fbio_disable:0xXX:fbp_disable:0xXX

19.

The analysis voltage is normal but no display
- nvmt scan method.
Copy the nvmt tool to the mods0 folder under diag
Run the command: . /nvmt ts >> x.log
View the x.log file, according to the log file, the
BRLSHFT value difference may be bad points,
judged as bad memory
C0, 0 mean the byte with bit C00~C07
C0, 1 mean the byte with bit C08~C15
C0, 2 mean the byte with bit C16~C23
C0, 3 mean the byte with bit C24~C31
C1, 0 mean the byte with bit C32~C39
C1, 1 mean the byte with bit C40~C47
C1, 2 mean the byte with bit C48~C55
C1, 3 mean the byte with bit C56~C63

20.

Analysis of voltage abnormalities
12V/5V/3V3/1V8 no voltage
According to the timing sequence:
12V→5V→3V3_SEQ→1V8_AON→NV3V3→DDC_5V First check if the enable voltage is
normal, if it is, then check if the IC generating the voltage is OK Check the circuit
components related to the voltage supply.
Example: V389-1.0 1V8 no voltage analysis
(PS: Make sure to measure the impedance without short circuit before power on)
First measure the PS_1V8_EN voltage is OK, if there is abnormal check U212 and
related components

21.

Analysis of voltage abnormalities
Example: V389-1.0 1V8 no voltage analysis
(PS: Make sure to measure the impedance without short circuit before power on)
Enable voltage is normal, you can first disconnect L1, and then measure whether the
front-end voltage is abnormal or the back-end line pulled down 1V8.
1V8. front-end abnormal check U3 and related components. Back-end pull down
then check the 1V8 power supply line.

22.

Analysis of voltage abnormalities
Example: V389-1.0 NVVDD without voltage analysis
(PS: Make sure to measure the impedance without short circuit before power on)
Measure the voltage of NVVDD_EN, check U16 and surrounding parts without
voltage.

23.

Analysis of voltage abnormalities
Example: V389-1.0 NVVDD without voltage analysis
(PS: Make sure to measure the impedance without short circuit before power on)
NVVDD_EN voltage is normal, check PWM IC U812 related lines and components
I2C related: R348,R363
Voltage related: R1031,R996,R1026,R1022,R1011,R1032,R550

24.

Analysis of voltage abnormalities
Example: V389-1.0 NVVDD without voltage analysis
(PS: Make sure to measure the impedance without short circuit before power on)
Use an oscilloscope to capture the waveform of each phase during boot-up
All no waveform, then priority to replace the PWM IC U812
A group or more groups of abnormalities, priority check whether the MOS and
peripheral components of the group without waveform OK.
All of the above to confirm OK, and then replace the GPU.

25.

3. Analysis of Common Bad Codes

26.

Analysis of Common Bad Codes
582 (gpu stress test found pixel miscompares)
GPU/Memory Overfrequency defect
083 CRC/Checksum miscompare GPU/Memory
194 bad memory memory
134 EDC detected a memory-bus error memory Overfrequency defect
Example: 242194: Check the logfile, find the error bit, and replace the
corresponding memory.
Failling Bit is B048, then replace the B1 corresponding memory particles

27.

Analysis of Common Bad Codes
Example: 242194: Check the logfile, find the error bit, and replace the corresponding
memory.
Failling Bit is B048, then replace the B1 corresponding memory particles

28.

Analysis of Common Bad Codes
134 bad: check the logfile, according to the bad memory channel and Byte,
replace the corresponding memory. to V389 for example: 242134
Channel A: Byte0, Byte3 bad, then error bit: 0~7,24~31, replace the
corresponding memory A0.

29.

Analysis of Common Bad Codes
134 bad: check the logfile, find the bad memory particles, replace the
corresponding memory.
Bad byte and bit conversions.
Byte 0 : bit 0~7
Byte 1 : bit 8~15
Byte 2 : bit 16~23
Byte 3 : bit 24~31
Byte 4 : bit 32~39
Byte 5 : bit 40~47
Byte 6 : bit 48~55
Byte 7 : bit 56~63

30.

Analysis of Common Bad Codes
582/083 Analysis steps.
Confirm whether the down-clocking bios can test PASS
Overclocking test single sweep whether there is a fail bit, there is an error bit to
replace the corresponding memory particles
No error bit, off channel test.
Note: There may be more than one memory bad, you can open a single channel
to verify.

31.

Analysis of Common Bad Codes
- Overclocking sweep point
Edit the test.cfg file: nano test.cfg to block out the FTB and 275 items without
testing.
As follows: Add # to the top of the corresponding test item

32.

Analysis of Common Bad Codes
Overclocking sweep point
Go into the mods0 folder, edit the pg132sku30_fta_659.spc file, modify the
memory frequency +5 steps, add the single item to be tested, open
_run_on_error.
Add the single item to be tested, open_run_on_error. as follows.
Note: When testing a single item, the -skip
command cannot appear in the command at the
same time.

33.

Analysis of Common Bad Codes
Overclocking sweep point
Go into the mods0 folder, edit the pg132sku30_fta_659.spc file, modify the
memory frequency +5 steps, add the single item to be tested, open
_run_on_error.
Add the single item to be tested, open_run_on_error. as follows.
Note: When testing a single item, the -skip
command cannot appear in the command at the
same time.

34.

Close channels or open one channel
Use set onboard to display
Confirm the actual memory channel of the board.(Check the actual loading of
the board, or see the results of mats scan)
Turn off the channel that you want to turn off along with the channel that is not
loaded.
Add a command to the spc file in the mods0 folder, and open _run_on_error;
both fta and ftb spc files should be added.
Edit the test.cfg file to block out 275 items without testing
The test result of 196668 and 14773 codes, which are caused by memory channel
changes, is considered PASS.

35.

Close channels or open one channel
Change spc file.

36.

Close channels or open one channel
Corresponding values per channel.
Channel A 0x01
Channel B 0x02
Channel C 0x04
Channel D 0x08
Channel E 0x10
Channel F 0x20

37.

Close channels or open one channel
Note: The above numbers are in hexadecimal system
Example 1: V389-1.2-07S
Actual memory: you can look at the board loading, you can also look at the
sweep point results of mats, the following is the sweep point results of mats:

38.

Close channels or open one channel
Note: The above numbers are in hexadecimal system
Example 1: V389-1.2-07S
Actual memory: you can look at the board loading, you can also look at the
sweep point results of mats, the following is the sweep point results of mats:
From the results, we can see that the board
does not have channel D on it.

39.

Close channels or open one channel
Example 1: V389-1.2-07S
Turn off channel A: To turn off channel A and channel D together
then: 0x01 + 0x08 = 0x09
The command is: -floorsweep fbio_disable:0x09:fbp_disable:0x09
Also close channel A, channel B, channel C: To close channel A, channel B,
channel C and channel D together: 0x01+ 0x02 + 0x04 + 0x08 = 0x0F The
command is: -floorsweep fbio_disable:0x09:fbp_disable:0x09 floorsweep
fbio_disable:0x0F:fbp_disable:0x0F

40.

Close channels or open one channel
Example 2: V388-1.1-11S actual on memory.
From the above we can see: channel A,B,C,D,E,F all have on the memory.

41.

Close channels or open one channel
• Example 2: V388-1.2-11S Off
• The command is
-floorsweep fbio_disable:0x01:fbp_disable:0x01
• Close channel D, channel E, channel F at the same time: To close channel D,
channel E, channel F together: 0x08 + 0x10 + 0x20 = 0x38 The command is
-floorsweep fbio_disable:0x38:fbp_disable:0x38

42.

Analysis of Common Bad Codes
- 539: NVRM Generic falcon error
- 818 Mods detected an assertion failure
Party machine or diag can not be tested
Make sure the diag/bios is correct first
Run mats scan command, if there is an error bit, change the corresponding
memory; if PASS, then use nvmt tool to scan points. Turn off channel verification

43.

Analysis of Common Bad Codes
275270 Acoustic test failed, noise too hig
Excessive noise, mostly due to fan or fan control
275281 Temperature above specified limit
The temperature is too high, mostly due to fan or fan control
139 Acceptable temperature limits exceeded or the thermal sensor is broken or
miscalibrated
The temperature is too high, mostly due to fan or fan control

44.

Analysis of Common Bad Codes
- 275270/275281/139.
Confirm that the fan assembly OK, the fan blade is not stuck by foreign objects
Verify that the cooling draft is OK, too little or too much is not conducive to heat
dissipation
Verify that the crossover fan is not defective.
Verify that the FAN controller circuit on the board is OK.

45.

Analysis of Common Bad Codes
- 78599: fan does not seem to cool the chip fan can not reduce the chip
temperature
Diag default fan fan duty error is 10% and needs to be adjusted to 35%:
command: -testarg 78 EndpointError 35
Note: The fan with stop function should skip 78 tests.

46.

Analysis of Common Bad Codes
- 275280 Power above specified limit Power consumption exceeds the set spec
First, make sure the bios/diag is up-to-date.
Check the log file, power consumption exceeds spec 20%, need to check the
power detection related lines.
If the log file power consumption exceeds less, you need to confirm whether the
board cooling (thermal paste application, fan work, etc.) is OK.
Downconverting

47.

Analysis of Common Bad Codes
- 275273 Clock speed below specified limit Frequency below spec limit Base or
boost frequency too low will appear
- 10273 Clock speed below specified limit Frequency below spec limit
P0.max GPU voltage is about 750mv, power detection abnormal
- 10854: Unexpected hardware slowdown
Hardware slowdown
P0.max GPU voltage is about 750mv, power consumption detection abnormal

48.

Analysis of Common Bad Codes
- 275273/10273/10854
Run . /mods -s to check NVVDD voltage, if the voltage is low (normal is 1.xxxV).
a. Check the power consumption detection IC and peripheral components first.
b. Check the external 12V and PCIE BUS power supply related circuit
components.
c. Check the NVVDD voltage regulation line for correct components.
Check the log file, if the voltage is normal, but the GPU boost frequency is too low
fail, first confirm the bios is the latest, and then check the hardware problem.
a. Grab the signal of each group of Phase working properly.
b. Measure the choke voltage difference under load to confirm the current
balance.
After all of the above are confirmed to be OK, if there is a downconverted model
can turn to downconverted, no or downconverted ineffective to replace the
GPU.

49.

Analysis of Common Bad Codes
Example: Power consumption detection IC and related circuitry

50.

Analysis of Common Bad Codes
Example: 12V input and power consumption detection related lines

51.

Analysis of Common Bad Codes
13855 ADC Calibration error
ADC (GPU voltage) error is large
Bios and diag are verified with the latest version
. /mods -s(. /mods -s mods.js -adc_cal_check_ignore -no_gold) to compare the
displayed value of NVVDD with the actual measured voltage (GPU back
capacitance).

52.

Analysis of Common Bad Codes
13855 NVVDD PWM Controller IC is MP2888A.
The PWM IC is used to control the NVVDD voltage by burning the data, you need
to check whether the IC process and Sense resistance is OK, and then replace
the IC.

53.

Analysis of Common Bad Codes
13855 NVVDD PWM Controller IC is uP9512/NCP81610, then you need to check
the resistance related to NVVDD voltage value.
R1(R746),R2(R770),R3(R763),R4(R760),R5(R740) If all the above are confirmed OK,
you can ask HW to help again.

54.

Analysis of Common Bad Codes
Test crash or black screen, you need to reboot to view the logfile, according to
the logfile and then do the analysis

55.

Intel® Extreme
Masters Certified PC
English     Русский Rules