

LARGE LANGUAGE MODEL PLAYGROUND EDA (PT 3)
Silas Liu - Oct. 22, 2024
Large Language Models
This time I present my latest addition: an autonomous agent designed to perform Exploratory Data Analysis (EDA) and capable of running Python codes. It can generate commands dynamically, analyze data on its own and provide meaningful summaries based on the gathered informations. The agent is programmed to generate qualitative insights, guiding the user toward deeper analyses.
​
One key challenge was ensuring the data privacy. In my system there is a layer of anonymization that masks sensitive data before the analysis begins. The implementation presented many challenges regarding automation of programming.
Recently, I implemented a new agent in the LLM Playground that focuses on performing exploratory data analysis (EDA) autonomously. The primary goal of this agent is to streamline and accelerate the analysis process, allowing it to not only generate raw metrics but also provide qualitative insights that help guide users in further exploring their datasets. This agent is designed to detect patterns in the data and suggest additional approaches, making the analysis more strategic and practical.
​
Integrating this new agent into my existing workflow, which I had previously set up with LangGraph, was very easy. Although building the agent itself required careful planning and implementation, adding it to the flow took only a few adjustments.
​

One of the first concerns was ensuring data privacy. To address this, I added a layer of anonymization that hides the original column names before the analysis. Even without knowing the real variable names, the agent is capable of interpreting the data effectively, focusing on distributions, correlations, and potential data quality issues. This allows it to deliver valuable insights regardless of the specific nature of the dataset.
​
What makes this agent stand out is its ability to autonomously generate Python commands based on the data it receives. Instead of relying on predefined instructions, it adapts its behavior according to the data's characteristics. The agent follows a closed feedback loop where it creates commands, executes them, evaluates the results, and generates new commands based on what was discovered. This structure ensures it has complete freedom to make decisions throughout the analysis, making it both flexible and responsive.
​
To illustrate how this works, below is an example of a typical analysis. The agent can generate initial commands, execute them, and summarize the exploratory analysis by focusing on key aspects such as Distribution and Skewness, Missing Values and Duplicates, Correlations, Data Quality and Patterns. Once it identifies these points, it also suggests potential next steps, helping user decide whether to dive deeper into specific variables o generate relevant visualizations.

The agent's autonomy simplifies complex workflows by breaking the process into manageable iterations. After receiving the data, it runs the necessary analyses and highlights the key insights at each stage. It does not just present numbers but interprets them and provides practical recommendations for further exploration.
​
Below is an example where I run the agent on a reasonably sized dataset, with 32 columns to test its performance. The results were very reasonable: it identified skewness in some distributions and found significant correlations between specific columns, like radius_mean, perimeter_mean and area_mean, which are indeed related and provide meaningful insights about the dataset. However, one challenge I noticed is the high token consumption. When working with high dimensional datasets, the cost of tokens can increase considerably, which may be unavoidable depending on the complexity of the data.

With the help of this agent, I can now perform exploratory data analysis more efficiently and with a sharper focus on results. Automating many of the repetitive tasks and allowing the agent to independently handle decision-making may reduce significantly the time and effort needed in insights.