logo
Αρχική Σελίδα Υποθέσεις

NVIDIA L4 GPU Review – Low-Power Inferencing Wizard

Πιστοποίηση
Κίνα Beijing Qianxing Jietong Technology Co., Ltd. Πιστοποιήσεις
Κίνα Beijing Qianxing Jietong Technology Co., Ltd. Πιστοποιήσεις
Αναθεωρήσεις πελατών
Το προσωπικό πωλήσεων της Co. τεχνολογίας του Πεκίνου Qianxing Jietong, ΕΠΕ είναι πολύ επαγγελματικό και υπομονετικό. Μπορούν να παρέχουν τις αναφορές γρήγορα. Η ποιότητα και η συσκευασία των προϊόντων είναι επίσης πολύ υψηλές. Η συνεργασία μας είναι πολύ ομαλή.

—— 《Festfing DV》 LLC

Όταν έψαχνα τη Intel ΚΜΕ και Toshiba SSD επειγόντως, αμμώδης από το Πεκίνο Qianxing Jietong η Co. τεχνολογίας, ΕΠΕ μου έδωσε πολλή βοήθεια και με πήρε τα προϊόντα που χρειάστηκα γρήγορα. Την εκτιμώ πραγματικά.

—— Γεν γατακιών

Αμμώδης του Πεκίνου Qianxing Jietong η Co. τεχνολογίας, ΕΠΕ είναι πολύ προσεκτικός πωλητής, ο οποίος μπορεί να υπενθυμίσει σε με τα λάθη διαμόρφωσης εγκαίρως πότε αγοράζω έναν κεντρικό υπολογιστή. Οι μηχανικοί είναι επίσης πολύ επαγγελματικοί και μπορούν γρήγορα να ολοκληρώσουν την εξεταστική διαδικασία.

—— Strelkin Mikhail Vladimirovich

Είμαστε πολύ ευχαριστημένοι με την εμπειρία μας συνεργασίας με την Beijing Qianxing Jietong. Η ποιότητα των προϊόντων είναι εξαιρετική και η παράδοση γίνεται πάντα στην ώρα της. Η ομάδα πωλήσεων είναι επαγγελματική, υπομονετική και πολύ εξυπηρετική με όλα μας τα ερωτήματα. Εκτιμούμε πραγματικά την υποστήριξή τους και προσβλέπουμε σε μια μακροχρόνια συνεργασία. Συνιστάται ανεπιφύλακτα!

—— Ahmad Navid

Ποιότητα: Μεγάλη εμπειρία με τον προμηθευτή μου.Το MikroTik RB3011 είχε ήδη χρησιμοποιηθεί, αλλά ήταν σε πολύ καλή κατάσταση και όλα λειτουργούν τέλεια.Η επικοινωνία ήταν γρήγορη και ομαλή.Και όλες μου οι ανησυχίες λύθηκαν γρήγορα.- Πολύ αξιόπιστος προμηθευτής.

—— Γκεράν Κολέσιο

Είμαι Online Chat Now

NVIDIA L4 GPU Review – Low-Power Inferencing Wizard

March 13, 2026
In the unrelenting wave of innovation in today’s AI landscape, measuring and understanding the capabilities of various hardware platforms is critical. Not all AI applications require massive GPU training farms—there is a vital segment of inferencing AI that often demands less GPU power, particularly at the edge. In this review, we examine several NVIDIA L4 GPUs across three different Dell servers and a range of workloads, including MLperf, to evaluate how the L4 performs.
 
τελευταία εταιρεία περί NVIDIA L4 GPU Review – Low-Power Inferencing Wizard  0
 
NVIDIA L4
NVIDIA L4 GPU
At its core, the L4 delivers an impressive 30.3 teraFLOPs of FP32 performance, making it ideal for high-precision computational tasks. Its capabilities extend to mixed-precision computations via TF32, FP16, and BFLOAT16 Tensor Cores—critical features for enhancing deep learning efficiency. According to the L4 spec sheet, performance in these mixed-precision modes ranges from 60 to 121 teraFLOPs.
 
The L4 excels in low-precision tasks, boasting 242.5 teraFLOPs with its FP8 and INT8 Tensor Cores, which significantly boost neural network inferencing performance. Equipped with 24GB of GDDR6 memory and a 300GB/s bandwidth, it can easily handle large datasets and complex models. What stands out most about the L4, however, is its energy efficiency: with a 72W TDP, it is well-suited for a wide variety of computing environments. This combination of high performance, memory efficiency, and low power consumption makes the NVIDIA L4 a compelling option for addressing edge computing challenges.
 
τελευταία εταιρεία περί NVIDIA L4 GPU Review – Low-Power Inferencing Wizard  1
 
NVIDIA L4 Specifications
FP 32 30.3 teraFLOPs
TF32 Tensor Core 60 teraFLOPs
FP16 Tensor Core 121 teraFLOPs
BFLOAT16 Tensor Core 121 teraFLOPs
FP8 Tensor Core 242.5 teraFLOPs
INT8 Tensor Core 242.5 TOPs
GPU Memory 24GB GDDR6
GPU Memory Bandwidth 300GB/s
Max Thermal Design Power (TDP) 72W
Form Factor 1-slot low-profile PCIe
Interconnect PCIe Gen4 x16
Spec Chart L4

 

 

Of course, with the L4 pricing somewhere near $2500, the A2 coming in at roughly half the price, and the aged (yet still pretty capable) T4 available for under $1000 used, the obvious question is what’s the difference between these three inferencing GPUs.

NVIDIA L4, A2 and T4 Specifications NVIDIA L4 NVIDIA A2 NVIDIA T4
FP 32 30.3 teraFLOPs 4.5 teraFLOPs 8.1 teraFLOPs
TF32 Tensor Core 60 teraFLOPs 9 teraFLOPs N/A
FP16 Tensor Core 121 teraFLOPs 18 teraFLOPs N/A
BFLOAT16 Tensor Core 121 teraFLOPs 18 teraFLOPs N/A
FP8 Tensor Core 242.5 teraFLOPs N/A N/A
INT8 Tensor Core 242.5 TOPs 36 TOPS 130 TOPS
GPU Memory 24GB GDDR6 16GB GDDR6 16GB GDDR6
GPU Memory Bandwidth 300GB/s 200GB/s 320+ GB/s
Max Thermal Design Power (TDP) 72W 40-60W 70W
Form Factor 1-slot low-profile PCIe
Interconnect PCIe Gen4 x16 PCIe Gen4 x8 PCIe Gen3 x16
Spec Chart L4 A2 T4

 

 

One thing to understand when looking at these three cards is that they’re not exactly generational one-to-one replacements, which explains why the T4 still remains, many years later, a popular choice for some use cases. The A2 came out as a replacement for the T4 as a low-power and more compatible (x8 vs x16 mechanical) option. Technically, the L4 then is a replacement for the T4, with the A2 straddling an in-between that may or may not get refreshed at some point in the future.

MLPerf Inference 3.1 Performance

MLPerf is a consortium of AI leaders from academia, research, and industry established to provide fair and relevant AI hardware and software benchmarks. These benchmarks are designed to measure the performance of machine learning hardware, software, and services on various tasks and scenarios.

Our tests focus on two specific MLPerf benchmarks: Resnet50 and BERT.

  • Resnet50: This is a convolutional neural network used primarily for image classification. It’s a good indicator of how well a system can handle deep-learning tasks related to image processing.
  • BERT (Bidirectional Encoder Representations from Transformers): This benchmark focuses on natural language processing tasks, offering insights into how a system performs in understanding and processing human language.

Both these tests are crucial for evaluating AI hardware’s capabilities in real-world scenarios involving image and language processing.

Evaluating the NVIDIA L4 with these benchmarks is critical in helping to understand the capabilities of the L4 GPU in specific AI tasks. It also offers insights into how different configurations (single, dual, and quad setups) influence performance. This information is vital for professionals and organizations looking to optimize their AI infrastructure.

The models run under two key modes: Server and Offline.

  • Offline Mode: This mode measures a system’s performance when all data is available for processing simultaneously. It’s akin to batch processing, where the system processes a large dataset in a single batch. Offline mode is crucial for scenarios where latency is not a primary concern, but throughput and efficiency are.
  • Server Mode: In contrast, server mode evaluates the system’s performance in a scenario mimicking a real-world server environment, where requests come in one at a time. This mode is latency-sensitive, measuring how quickly the system can respond to each request. It’s essential for real-time applications, such as web servers or interactive applications, where immediate response is necessary.

1 x NVIDIA L4 – Dell PowerEdge XR7620

 

τελευταία εταιρεία περί NVIDIA L4 GPU Review – Low-Power Inferencing Wizard  2

As part of our recent review of the Dell PowerEdge XR7620, outfitted with a single NVIDIA L4, we took it to the edge to run several tasks, including MLPerf.

Our test system configuration included the following components:

  • 2 x Xeon Gold 6426Y – 16-core 2.5GHz
  • 1 x NVIDIA L4
  • 8 x 16GB DDR5
  • 480GB BOSS RAID1
  • Ubuntu Server 22.04
  • NVIDIA Driver 535
Dell PowerEdge XR7620 1x NVIDIA L4 Score
Resnet50 – Server 12,204.40
Resnet50 – Offline 13,010.20
BERT K99 – Server 898.945
BERT K99 – Offline 973.435

 

 

The performance in server and offline scenarios for Resnet50 and BERT K99 is nearly identical, indicating that the L4 maintains consistent performance across different server models.

1, 2 & 4 NVIDIA L4’s – Dell PowerEdge T560

τελευταία εταιρεία περί NVIDIA L4 GPU Review – Low-Power Inferencing Wizard  3

Our review unit configuration included the following components:

  • 2 x Intel Xeon Gold 6448Y (32-core/64-thread each, 225-watt TDP, 2.1-4.1GHz)
  • 8 x 1.6TB Solidigm P5520 SSDs w/ PERC 12 RAID card
  • 1-4x NVIDIA L4 GPUs
  • 8 x 64GB RDIMMs
  • Ubuntu Server 22.04
  • NVIDIA Driver 535
Moving back to the data center from the edge and utilizing the versatile Dell T560 Tower server, we noted that the L4 performs just as well in the single GPU test. This shows that both platforms can provide a solid foundation to the L4 without bottlenecks.
 
Dell PowerEdge T560 1x NVIDIA L4 Score
Resnet50 – Server 12,204.40
Resnet50 – Offline 12,872.10
Bert K99 – Server 898.945
Bert K99 – Offline 945.146

 

 

In our tests with two L4s in the Dell T560, we observed this near-linear scaling in performance for both Resnet50 and BERT K99 benchmarks. This scaling is a testament to the efficiency of the L4 GPUs and their ability to work in tandem without significant losses due to overhead or inefficiency.

Dell PowerEdge T560 2x NVIDIA L4 Score
Resnet50 – Server 24,407.50
Resnet50 – Offline 25,463.20
BERT K99 – Server 1,801.28
BERT K99 – Offline 1,904.10

 

 

The consistent linear scaling we witnessed with two NVIDIA L4 GPUs extends impressively to configurations featuring four L4 units. This scaling is particularly noteworthy as maintaining linear performance gains becomes increasingly challenging with each added GPU due to the complexities of parallel processing and resource management.

Dell PowerEdge T560 4x NVIDIA L4 Score
Resnet50 – Server 48,818.30
Resnet50 – Offline 51,381.70
BERT K99 – Server 3,604.96
BERT K99 – Offline 3,821.46

 

 

These results are for illustrative purposes only, and not competitive or official MLPerf results. For a complete official results list please visit the MLPerf Results Page.

In addition to validating the linear scalability of the NVIDIA L4 GPUs, our tests in the lab shed light on the practical implications of deploying these units in different operational scenarios. For instance, the consistency in performance between server and offline modes across all configurations with the L4 GPUs reveals their reliability and versatility.

This aspect is particularly relevant for businesses and research institutions where operational contexts vary significantly. Furthermore, our observations on the minimal impact of interconnect bottlenecks and the efficiency of GPU synchronization in multi-GPU setups provide valuable insights for those looking to scale their AI infrastructure. These insights go beyond mere benchmark numbers, offering a deeper understanding of how such hardware can be optimally utilized in real-world scenarios, guiding better architectural decisions and investment strategies in AI and HPC infrastructure.

NVIDIA L4 – Application Performance

We compared the performance of the new NVIDIA L4 against the NVIDIA A2 and NVIDIA T4 that came before it. To showcase this performance upgrade over the past models, we deployed all three models inside a server in our lab, with Windows Server 2022 and the latest NVIDIA drivers, leveraging our entire GPU test suite.

These cards were tested on a Dell Poweredge R760 with the following configuration:

  • 2 x Intel Xeon Gold 6430 (32 Cores, 2.1GHz)
  • Windows Server 2022
  • NVIDIA Driver 538.15
  • ECC Disabled on all cards for 1x sampling
τελευταία εταιρεία περί NVIDIA L4 GPU Review – Low-Power Inferencing Wizard  4

As we kick off the performance testing between this group of three enterprise GPUs, it is important to note the unique performance differences between the earlier A2 and T4 models. When the A2 was released, it offered some notable improvements such as lower power consumption and operating on a smaller PCIe Gen4 x8 slot, instead of the larger PCIe Gen3 x16 slot the older T4 required. Off the bat it allowed it to slot into more systems, especially with the smaller footprint needed.

Blender OptiX 4.0

Blender OptiX is an open-source 3D modeling application. This test can be run for both CPU and GPU, but we only did GPU like most other tests here. This benchmark was run using the Blender Benchmark CLI utility. The score is samples per minute, with higher being better.

Blender 4.0
(Higher is Better)
NVIDIA L4 NVIDIA A2 Nvidia T4
GPU Blender CLI – Monster 2,207.765 458.692 850.076
GPU Blender CLI – Junkshop 1,127.829 292.553 517.243
GPU Blender CLI – Classroom 1,111.753 262.387 478.786

 

 

Blackmagic RAW Speed Test

We test CPUs and GPUs with Blackmagic’s RAW Speed Test which tests video playback speeds. This is more of a hybrid test that includes CPU and GPU performance for real-world RAW decoding. These are displayed as separate results but we are only focusing on the GPUs here, so the CPU results are omitted.

Blackmagic RAW Speed Test
(Higher is Better)
NVIDIA L4 NVIDIA A2 NVIDIA T4
8K CUDA 95 FPS 38 FPS 53 FPS

Cinebench 2024 GPU

Maxon’s Cinebench 2024 is a CPU and GPU rendering benchmark that utilizes all CPU cores and threads. Again since we are focusing on GPU results, we did not run the CPU portions of the test. Higher Scores are Better.

Cinebench 2024
(Higher is Better)
NVIDIA L4 NVIDIA A2 NVIDIA T4
GPU 15,263 4,006 5,644

GPU PI

GPUPI 3.3.3 is a version of the lightweight benchmarking utility designed to calculate π (pi) to billions of decimals using hardware acceleration through GPUs and CPUs. It leverages the computing power of OpenCL and CUDA which includes both central and graphic processing units. We ran CUDA only on all 3 GPUs and the numbers here are the calculation time without reduction time added. Lower is better.

GPU PI Calculation Time in seconds
(Lower is Better)
NVIDIA L4 NVIDIA A2 NVIDIA T4
GPUPI v3.3 – 1B 3.732s 19.799s 7.504s
GPUPI v3.3 – 32B 244.380s 1,210.801s 486.231s

While the previous results looked at just a single iteration of each card, we also had the chance to look at a 5x NVIDIA L4 deployment inside the Dell PowerEdge T560.

GPU PI Calculation Time in seconds
(Lower is Better)
Dell PowerEdge T560 (2x Xeon Gold 6448Y) with 5x NVIDIA L4
GPUPI v3.3 – 1B 0sec 850ms
GPUPI v3.3 – 32B 50sec 361ms

 

 

Octanebench

OctaneBench is a benchmarking utility for OctaneRender, another 3D renderer with RTX support similar to V-Ray.

Octane (Higher is Better)
Scene Kernel NVIDIA L4 NVIDIA A2 NVIDIA T4
Interior Info channels 15.59 4.49 6.39
  Direct lighting 50.85 14.32 21.76
  Path tracing 64.02 18.46 25.76
Idea Info channels 9.30 2.77 3.93
  Direct lighting 39.34 11.53 16.79
  Path tracing 48.24 14.21 20.32
ATV Info channels 24.38 6.83 9.50
  Direct lighting 54.86 16.05 21.98
  Path tracing 68.98 20.06 27.50
Box Info channels 12.89 3.88 5.42
  Direct lighting 48.80 14.59 21.36
  Path tracing 54.56 16.51 23.85
Total Score 491.83 143.71 204.56

 

 

Geekbench 6 GPU

Geekbench 6 is a cross-platform benchmark that measures overall system performance. There are test options for both CPU and GPU benchmarking. Higher scores are better. Again, we only looked at the GPU results.

You can find comparisons to any system you want in the Geekbench Browser.

Geekbench 6.1.0
(Higher Is Better)
NVIDIA L4 NVIDIA A2 NVIDIA T4
Geekbench GPU OpenCL 156,224 35,835 83,046

Luxmark

LuxMark is an OpenCL cross-platform benchmarking tool from those who maintain the open-source 3D rending engine LuxRender. This tool looks at GPU performance in 3D modeling, lighting, and video work. For this review, we used the newest version, v4alpha0. In LuxMark, higher is better when it comes to the score.

Luxmark v4.0alpha0
OpenCL GPUs
(Higher is Better)
NVIDIA L4 NVIDIA A2 NVIDIA T4
Hall Bench 14,328 3,759 5,893
Food Bench 5,330 1,258 2,033

GROMACS CUDA

We also source compiled GROMACS, a molecular dynamics software, specifically for CUDA. This bespoke compilation was to leverage the parallel processing capabilities of the 5 NVIDIA L4 GPUs, essential for accelerating computational simulations.

The process involved the utilization of nvcc, NVIDIA’s CUDA compiler, along with many iterations of the appropriate optimization flags to ensure that the binaries were properly tuned to the server’s architecture. The inclusion of CUDA support in the GROMACS compilation allows the software to directly interface with the GPU hardware, which can drastically improve computation times for complex simulations.

The Test: Custom Protein Interaction in Gromacs

Leveraging a community-provided input file from our diverse Discord, which contained parameters and structures tailored for a specific protein interaction study, we initiated a molecular dynamics simulation. The results were remarkable— the system achieved a simulation rate of 170.268 nanoseconds per day.

GPU System ns/day core time (s)
NVIDIA A4000 Whitebox AMD Ryzen 5950x 84.415 163,763
RTX NVIDIA 4070 Whitebox AMD Ryzen 7950x3d 131.85 209,692.3
5x NVIDIA L4 Dell T560 w/ 2x Intel Xeon Gold 6448Y 170.268 608,912.7

More Than AI

In the unrelenting wave of innovation in today’s AI landscape, measuring and understanding the capabilities of various hardware platforms is critical. Not all AI applications require massive GPU training farms—there is a vital segment of inferencing AI that often demands less GPU power, particularly at the edge. In this review, we examine several NVIDIA L4 GPUs across three different Dell servers and a range of workloads, including MLperf, to evaluate how the L4 performs.
 
NVIDIA L4
NVIDIA L4 GPU
At its core, the L4 delivers an impressive 30.3 teraFLOPs of FP32 performance, making it ideal for high-precision computational tasks. Its capabilities extend to mixed-precision computations via TF32, FP16, and BFLOAT16 Tensor Cores—critical features for enhancing deep learning efficiency. According to the L4 spec sheet, performance in these mixed-precision modes ranges from 60 to 121 teraFLOPs.
 
The L4 excels in low-precision tasks, boasting 242.5 teraFLOPs with its FP8 and INT8 Tensor Cores, which significantly boost neural network inferencing performance. Equipped with 24GB of GDDR6 memory and a 300GB/s bandwidth, it can easily handle large datasets and complex models. What stands out most about the L4, however, is its energy efficiency: with a 72W TDP, it is well-suited for a wide variety of computing environments. This combination of high performance, memory efficiency, and low power consumption makes the NVIDIA L4 a compelling option for addressing edge computing challenges.
 
With the hype around AI reaching fever pitch, it’s easy to fixate solely on the L4’s performance with AI models—but it has a few other tricks up its sleeve, unlocking a world of possibilities for video applications. The L4 can host up to 1,040 concurrent AV1 video streams at 720p30, a capability that can transform how content is live-streamed to edge users, elevate creative storytelling, and enable exciting use cases for immersive AR/VR experiences.
 
The NVIDIA L4 also shines when it comes to optimizing graphics performance, as evidenced by its prowess in real-time rendering and ray tracing. In an edge office environment, the L4 is capable of delivering robust, high-powered graphical computation acceleration for VDI, catering to end users who rely on high-quality, real-time graphics rendering for their work.
 
Closing Thoughts
The NVIDIA L4 GPU provides a solid foundation for edge AI and high-performance computing, offering unmatched efficiency and versatility across a range of applications. Its ability to handle intensive AI workloads, acceleration tasks, or video pipelines—along with its optimized graphics performance—makes it an ideal choice for edge inferencing or virtual desktop acceleration. The L4’s unique combination of high computational power, advanced memory capabilities, and energy efficiency positions it as a key player in driving the acceleration of edge workloads, particularly in AI and graphics-intensive industries.
 
τελευταία εταιρεία περί NVIDIA L4 GPU Review – Low-Power Inferencing Wizard  5
 
NVIDIA L4 twist stack
There’s no denying that AI is at the center of the current IT storm, and demand for high-end H100/H200 GPUs remains through the roof. However, there’s also a major push to deploy more robust IT infrastructure at the edge—where data is generated and analyzed. In these scenarios, a more appropriately sized GPU is needed, and the NVIDIA L4 excels here. It should be the default choice for edge inferencing, whether deployed as a single unit or scaled together, as we tested in the T560.
 
Beijing Qianxing Jietong Technology Co., Ltd.
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com

Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!
 
 
Στοιχεία επικοινωνίας
Beijing Qianxing Jietong Technology Co., Ltd.

Υπεύθυνος Επικοινωνίας: Ms. Sandy Yang

Τηλ.:: 13426366826

Στείλετε το ερώτημά σας απευθείας σε εμάς (0 / 3000)