This page contains documentation of methodology and data processing involved in generating visualisations for this website. We have a section for each type(based on data source) of visualisation, with all the information about data source, processing and different metrics used. The questions underlying these visualisations are outcome of collaborative exploration between CLIx technology team and implementation, domain, research and various other teams. Please refer these scripts for more detailed look into data processing methodology.
Module level progress csvs are generated every hour in each school and is packaged in to syncthing tar file which is referred as thin data. Each csv, lists the students along with their buddies if any and quantitative data in a cummulative way. This quantitative data talks about the total lessons, activites visited along with the percentage of completion and even the number of times a particular activity being visited. For eg: if there is a file generated on a particular day at 12h09min which contains few students data, the next file of 13h09min has the cummulative data(data of 12h09min file plus the data generated in between 12th hr to 13th hr). Hence cummulative.
As students can explore the courses even without logging into the platform, each of such entry corresponds to anonymous user ‘0’. But while fetching data related to a particular module the extraction of anonymous user’s data is not considered unlike tools data as they are generated from 3rd party not by the platform itself.
Following steps are followed as part of collating the progress data across all schools of a particular state for a given month of enagagement:
Since the data generated is in a cummulative way, we need to find the latest file created on the last day of that month. This includes all the data since the enagagement started in that school till that month.
As part cleaning, have removed all the rows pertaining to internal accounts and the resultant csv holds data only pertaining to the student Ids ending with the schools code eg: red-bull-rj1 etc.
All the latest course progress csvs from every school are then collated into a single csv along with inclusion of few new columns which are as follows:
Date:The last date of the given month on which enagegment took place in a particular school. Day, Month, Year: These columns are populated by splitting the above Date column value accordingly. UnitCode:Each unit is given a unique code given by Research Team and the same is populated respectively.
Aggregation is done on the final collated csv by grouping on server_id and module_name columns to calculate the following:
- Number of students who visited a particular module - Average percentage if activities visited in a module
This gives us the total number of students who have gone through a particular module. In a given school, for each module, we add up number of unique user_id's to get total number of students.
This gives us the average percentage of activities visited by any of students who have gone through a particular module. In a given school, for each module, we calculate the sum of percentage_activities_visited and then average it across number of students.
Primary Source of tools data is json file generated whenever a user accesses Tools/Apps section of the CLIx software. These files along with other thin data are available from schools as part of synthing data. In a given school (in a particular machine), seperate json file is generated for each tool. This json file will have logs of all the users of the tool on that machine.
One specific aspect of project implementation is that not all students who are using the platform are registered. These non-registered users correspond to anonymous users and are logged with common user_id(=0). So logs corresponding to anonymous user is usually due to more than one student. In all these cases, we have left out anonymous user observations (as it is difficult to infer student level activity). They actually correspond to substantial part of the total observations. We have a simple methodology as of now to approximate number of users in anonymous user_id, but have deffered its implementation until we discuss this with research and other teams.
To process tools data, we first convert json files into csv files extracting only information relevant for our analysis. All these csv files of tools across all the schools is collated to create one big csv file. Each row of this csv file corresponds to a unique log of a user with the following features (columns): school_server_code: Unique code of the school in which log is generated user_id: Unique id of a student creating the log tool_name: Tool name of the tool being accessed created_at: Timestamp of creation of the log date_created: Date of creation of log time_spent: Time spent by a student(=user_id) on the tool(=tool_name) on a given day(=date_created). This is calculated by taking time difference between first and last log time_stamps of a every user_id in a given day. state_code: State in which machine is located
This corresponds to total number of days tools section was accessed (irrespective of the time spent) by students. In a given school, for each tool, we add up all unique date entries (a date entry corresponds to atleast one student accessing tool). This sum gives us the total number of unique days a tool was accessed by students during the observation period.
In a given school, number of unique users in a day is calculated for each tool. This number is averaged across all days for every tool across the observation period. Please note that for each tool averaging is done considering only dates on which it was used, so time period across which we are averaging could be different for different tools. So we cannot add-up the tool averages in a school to comment on average number of students in a day.
This is the aggregate number of unique students engaged with each tool of the platform. In a given school, for each tool, we add up number of unique user_id's (during the whole period of observation) to get number of students. Summed up number on y-axis can be interpreted as total number of unique students engaged with the tools for the corresponding school.
In a given school, time spent in a day by all students on each tool is calculated (only for days on which students logged into that particular tool). This number for each tool is averaged across observation period. Please note that days across which each tool's time spent is being averaged could be different, so cannot add-up the tool numbers in a school to talk about total time spent in a day.
In a given month, daily time spent on different tools by all students is calculated for all schools. Range of these daily time spent numbers are represented using min/max.
Primary source of data for these visuals is syncthing data which comprises of tools data (json logs) and modules data (csv files). At the time of development of these visuals activity timestamp data was not available. Tools data is json file generated whenever a user accesses Tools/Apps section of the CLIx software. Each module(progress) csv, lists the students along with their buddies if any and quantitative data in a cummulative way.
One important aspect of project implementation is that a typical student gets the same id throughout the observation period only 60%(as per rough estimates) of the time. So it is hard to track the progress of an individual student. Also modules data is registered with a margin of error close to a day, as scripts are triggered periodically whenever there is power. Tools data is registered at precise times as and when they occur.
To process tools data, we first convert json files into csv files extracting only information relevant for our analysis. All these csv files of tools across all the schools is collated to create one big csv file. Each row of this csv file corresponds to a unique log of a user with the following features (columns): school_server_code: Unique code of the school in which log is generated user_id: Unique id of a student creating the log tool_name: Tool name of the tool being accessed created_at: Timestamp of creation of the log date_created: Date of creation of log time_spent: Time spent by a student(=user_id) on the tool(=tool_name) on a given day(=date_created). This is calculated by taking time difference between first and last log time_stamps of a every user_id in a given day. state_code: State in which machine is located
Modules data is generated by collating progress csv files generated in a given school. Collation is achieved by combining csvs and then filtering out only those logs where there is increase in percentage of student activity. This results in a single csv for a given school with data corresponding to all students with their progression through modules.
Each bar represents a unique number of logins for tools and modules section of the platform. In a given school, for each day that has some login information, number of unique login ids are calculated. This is done for both tools and modules data independently and stacked bar chart is constructed using this data. Please note that there is a margin of error in determing exact day on which module was done, though 80 percent of the times it is exact.
In a given school, we are trying to estimate number of days only tools are done, only modules are done and modules-tools done together. Key idea we wanted to use is - If a student does a module and the related tool within 1 day(earlier or later), we want to consider that the student has done those tools and modules together. we know timestamp of module usage upto the nearest days and tools usage timestamp exactly, so there is always a margin of error. But this is the best estimate we could think of. Also we consider the tool or module is done if there is a log, irrespective of how much they engaged with them.
For a given school, we calculate the number of unique logins attempting different modules on a particular day. Please note that same student might have done more than one module in a given day and we count him/her for every different module he/she attempts.
For a given school, we calculate the number of unique logins attempting different tools on a particular day. Please note that whenever there is anonymous user(=0), we will count it as one login. Also same student might have done more than one tool in a day and we count him/her for every different tool he/she attempts.